How To Process Healthcare Data in Hadoop, Pig (IBM Demo Cloud)

Abhishek Ghosh

By Abhishek Ghosh May 2, 2018 5:11 am Updated on May 2, 2018

How To Process Healthcare Data in Hadoop, Pig (IBM Demo Cloud)

Previously, we talked about IBM Demo Cloud, which is a free server with SSH access to learn Hadoop, Pig, Hive etc without the need of running own installation. Unlike server log, healthcare data lacks universal format. Here is Basic Commends Showing How To Process Healthcare Data in Hadoop, Pig Using IBM Demo Cloud. As for server logs, we can easily distribute scripts which will work all over the earth. Some healthcare data like data around diabetes or blood sugar level rarely distributed for educational purpose. Comparing with server log, we have to take one line as one set of entry. We are providing example with script to process things like server log. The reader needs to customize the script according to the format of data in text format. If the data source has blood sugar values ranging from 120 to theoretically infinity with multiple occurrences same value, we will get a list of values followed by number of occurrences, like :

...
(216,1)
(320,6)
(297,2)
(276,1)
(278,3)
...

1

2

3

4

5

6

7

...

(216,1)

(320,6)

(297,2)

(276,1)

(278,3)

...

How To Process Healthcare Data in Hadoop, Pig

Let us take that, the file’s name is blood-sugar.log. First, we will feed the data to Hadoop with this command :

hadoop fs -put blood-sugar.log

1	hadoop fs -put blood-sugar.log

Depending on the setup, we may face error after running the above command, then we need to append location where diskX is disk number, USERNAME is like admin1234 in this way :

hadoop fs -put access.log.1 /diskX/home/USERNAME/

1	hadoop fs -put access.log.1 /diskX/home/USERNAME/

If we run :

hadoop fs -ls -R

1	hadoop fs -ls -R

We will get output like this pointing success :

drwxrwxrwx+  - admin admin          0 2018-05-01 17:04 .staging
-rwxrwxrwx+  3 admin admin   12183708 2018-05-01 16:33 blood-sugar.log

1 2	drwxrwxrwx+ - admin admin 0 2018-05-01 17:04 .staging -rwxrwxrwx+ 3 admin admin 12183708 2018-05-01 16:33 blood-sugar.log

Now, we will create a script named script.pig at the same location with this content, this script is for server log and needs editing depending on style of your data :

DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();

logs = LOAD '/path/to/USERNAMER/access.log.1' USING ApacheCommonLogLoader AS (addr: chararray, logname: chararray, user: chararray, time: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int);

addrs = GROUP logs BY addr;

counts = FOREACH addrs GENERATE flatten($0), COUNT($1) as count;

DUMP counts;

1

2

3

4

5

6

7

8

9

DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();

logs = LOAD '/path/to/USERNAMER/access.log.1' USING ApacheCommonLogLoader AS (addr: chararray, logname: chararray, user: chararray, time: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int);

addrs = GROUP logs BY addr;

counts = FOREACH addrs GENERATE flatten($0), COUNT($1) as count;

DUMP counts;

You need to change /path/to/USERNAMER/access.log.1 in the above example to real path, file name and save the file. Now run this command :

locate piggybank.jar

1	locate piggybank.jar

At the end of output, you’ll get something like these :

...
/usr/iop/4.1.0.0/hive/lib/piggybank.jar
/usr/iop/4.1.0.0/pig/piggybank.jar
/usr/iop/4.1.0.0/pig/lib/piggybank.jar

1

2

3

4

...

/usr/iop/4.1.0.0/hive/lib/piggybank.jar

/usr/iop/4.1.0.0/pig/piggybank.jar

/usr/iop/4.1.0.0/pig/lib/piggybank.jar

How To Process Healthcare Data in Hadoop Pig

We will use that /usr/iop/4.1.0.0/pig/piggybank.jar path. We will run pig command to bring grunt interface :

pig

1

pig

Then, run this command, then quit :

REGISTER '/usr/iop/4.1.0.0/pig/piggybank.jar';
quit

1 2	REGISTER '/usr/iop/4.1.0.0/pig/piggybank.jar'; quit

Then we will run the pig script (which we provided above for server log, and you’ll edit to meet your need) :

pig -x local script.pig

1	pig -x local script.pig

This will return the intended output.

About Abhishek Ghosh

Here’s what we’ve got for you which might like :

Take The Conversation Further ...

Get new posts by email:

How To Process Healthcare Data in Hadoop, Pig

About Abhishek Ghosh

Here’s what we’ve got for you which might like :

Articles Related to How To Process Healthcare Data in Hadoop, Pig (IBM Demo Cloud)

Take The Conversation Further ...

Get new posts by email: