Previously, we talked about IBM Demo Cloud, which is a free server with SSH access to learn Hadoop, Pig, Hive etc without the need of running own installation. Unlike server log, healthcare data lacks universal format. Here is Basic Commends Showing How To Process Healthcare Data in Hadoop, Pig Using IBM Demo Cloud. As for server logs, we can easily distribute scripts which will work all over the earth. Some healthcare data like data around diabetes or blood sugar level rarely distributed for educational purpose. Comparing with server log, we have to take one line as one set of entry. We are providing example with script to process things like server log. The reader needs to customize the script according to the format of data in text format. If the data source has blood sugar values ranging from 120 to theoretically infinity with multiple occurrences same value, we will get a list of values followed by number of occurrences, like :
1 2 3 4 5 6 7 | ... (216,1) (320,6) (297,2) (276,1) (278,3) ... |
How To Process Healthcare Data in Hadoop, Pig
Let us take that, the file’s name is blood-sugar.log
. First, we will feed the data to Hadoop with this command :
1 | hadoop fs -put blood-sugar.log |
Depending on the setup, we may face error after running the above command, then we need to append location where diskX is disk number, USERNAME is like admin1234 in this way :
---
1 | hadoop fs -put access.log.1 /diskX/home/USERNAME/ |
If we run :
1 | hadoop fs -ls -R |
We will get output like this pointing success :
1 2 | drwxrwxrwx+ - admin admin 0 2018-05-01 17:04 .staging -rwxrwxrwx+ 3 admin admin 12183708 2018-05-01 16:33 blood-sugar.log |
Now, we will create a script named script.pig
at the same location with this content, this script is for server log and needs editing depending on style of your data :
1 2 3 4 5 6 7 8 9 | DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader(); logs = LOAD '/path/to/USERNAMER/access.log.1' USING ApacheCommonLogLoader AS (addr: chararray, logname: chararray, user: chararray, time: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int); addrs = GROUP logs BY addr; counts = FOREACH addrs GENERATE flatten($0), COUNT($1) as count; DUMP counts; |
You need to change /path/to/USERNAMER/access.log.1
in the above example to real path, file name and save the file. Now run this command :
1 | locate piggybank.jar |
At the end of output, you’ll get something like these :
1 2 3 4 | ... /usr/iop/4.1.0.0/hive/lib/piggybank.jar /usr/iop/4.1.0.0/pig/piggybank.jar /usr/iop/4.1.0.0/pig/lib/piggybank.jar |
We will use that /usr/iop/4.1.0.0/pig/piggybank.jar
path. We will run pig command to bring grunt interface :
1 | pig |
Then, run this command, then quit :
1 2 | REGISTER '/usr/iop/4.1.0.0/pig/piggybank.jar'; quit |
Then we will run the pig script (which we provided above for server log, and you’ll edit to meet your need) :
1 | pig -x local script.pig |
This will return the intended output.