Apache DataFu is a collection of well-tested libraries for data mining and statistics. It has two parts – one for Apache Pig, which is a collection of user-defined functions for Apache Pig and second part is Hourglass, which is an incremental processing framework for Apache Hadoop in MapReduce. Apache DataFu for Pig is a collection of useful user-defined functions for data analysis in Apache Pig. It has been used by production workflows at LinkedIn. It is also included in Cloudera’s CDH and Apache Bigtop. Here is Guide on How to Install Apache DataFu on Debian System.
How to Install Apache DataFu : Steps
This is the official DataFu repository on Github :
1 | https://github.com/apache/datafu |
There is also a repository on Github by LinkedIn, which is not updated since ancient time. This is Apache’s official subdoomain for DataFu :
---
1 | https://datafu.apache.org/ |
Official Wiki provides examples of how to of functions for Statistics (median, quantiles, variance), Bag Operations (join, prepend, append, count items, concat), Set Operations (set intersection, union, difference), Sessions (sessionize streams of data), Sampling (random sample with/without replacement, weighted sample), Hashing (SHA and MD5), Link Analysis and so on.
On our website, guides are intended to be installed on own server original Apache repositories. So the user should have a running system with Apache Hadoop installed, with Apache Pig, with Apache Bigtop. We are not showing the commands to install and configure Java, Bash profile for them.
From Apt on Debian or Ubuntu
We guess, this way although easy will be not suitable for most systems plus version on apt can be old. This probably works only for CDH version :
1 2 | sudo apt-get install pig-udf-datafu sudo updatedb |
At /usr/lib/pig
, there will be datafu-0.0.X-cdhY.0.0.jar
file. X, Y are variables indicating versions. You can use the locate function :
1 | locate datafu-0.0 |
Register the JAR. Replace the version string with the current DataFu and CDH version numbers.
1 | REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar |
Building from source
Official download of source code :
1 | http://www-us.apache.org/dist/datafu/ |
Download that file which is latest with wget or git release (from Github). We need Gradle :
1 | https://gradle.org/install/ |
Replace version numbers :
1 2 3 4 | tar -xzvf apache-datafu-sources-1.4.0.tgz cd apache-datafu-sources-1.4.0 gradle -b bootstrap.gradle ./gradlew assemble |
The procedure will produce JARs in the following directories:
1 2 | datafu-pig/build/libs datafu-hourglass/build/libs |
DataFu artifacts can be installed to local maven repository:
1 | ./gradlew install |
We can test for Pig :
1 | ./gradlew :datafu-pig:test |
If local maven repository is at ~/foo
then the location will be ~/foo/repository/org/apache/datafu/
. For Mae, we need this settings :
1 2 3 4 5 6 7 8 9 10 | <dependency> <groupId>org.apache.datafu</groupId> <artifactId>datafu-pig</artifactId> <version>1.4.0</version> </dependency> <dependency> <groupId>org.apache.datafu</groupId> <artifactId>datafu-hourglass</artifactId> <version>1.4.0</version> </dependency> |
After this setup, it is possible to follow the official getting started guide :
1 | http://datafu.apache.org/docs/datafu/getting-started.html |