Previously, we talked about Apache Hadoop Framework. Here is How Install Apache Hadoop on Ubuntu on Single Cloud Server Instance in Stand-Alone Mode With Minimum System Requirement and Commands. Apache Hadoop is designed to run on standard dedicated hardware that provides the best balance of performance and economy for a given workload.
Where I Will Install Apache Hadoop?
For cluster, 2 quad core, hexacore upwards CPUs running at least 2GHz with 64GB of RAM is expected. We are installing as Single Node Cluster. Minimum 6-8 RAM on virtual instance is practical. You can try VPSDime 6GB OpenVZ instance at $7/month. Now, Hadoop is written in Java and OpenVZ is not exactly great for running Java applications. The host can kick out you if you whip their machine to have higher load average. If you want VMWare, then Aruba Cloud is cost effective and great. You can do testing, learning work on OpenVZ but it is not practical to run high load work with OpenVZ.
Steps To Install Apache Hadoop on Ubuntu on Single Cloud Server Instance
We will install a single-node Hadoop cluster on Ubuntu 16.04 LTS. First prepare :
---
1 2 3 4 | cd ~ apt update apt upgrade apt install default-jdk |
jdk or OpenJDK is the default Java Development Kit on Ubuntu 16.04. Now check the java version :
1 | java -version |
Sample output :
1 2 3 | openjdk version "1.8.0_91" OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14) OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) |
We will create a group named hadoop
and add a user named hduser
:
1 2 | sudo addgroup hadoop sudo adduser --ingroup hadoop hduser |
Next we will install extra softwares, use SSH as hduser
, generate key, setup password less SSH for hduser
on localhost :
1 2 3 4 5 6 7 | apt install ssh rsync su hduser ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys ssh localhost su k sudo adduser hduser sudo |
Here are releases of Apache Hadoop :
1 2 | http://hadoop.apache.org/releases.html https://dist.apache.org/repos/dist/release/hadoop/common/ |
Apache Hadoop 2.7.3 is the latest stable at the time of publishing this guide. We will do these steps :
1 2 3 4 5 6 | wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar xvzf hadoop* rm hadoop-2.7.3.tar.gz cd hadoop-2.7.3 sudo mv * /usr/local/hadoop sudo chown -R hduser:hadoop /usr/local/hadoop |
/usr/bin/java
is a symlink to /etc/alternatives/java
which is a symlink to default Java binary. We need the correct value for JAVA_HOME
:
1 | readlink -f /usr/bin/java | sed "s:bin/java::" |
If the output is :
1 | /usr/lib/jvm/java-8-openjdk-amd64/jre/ |
then we should open :
1 | nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh |
and adjust :
1 2 | #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ |
Now if we run :
1 | /usr/local/hadoop/bin/hadoop |
We will get output like :
1 2 | Usage: hadoop [--config confdir] [COMMAND | CLASSNAME] CLASSNAME run the class named CLASSNAME |
Up to This Step is Minimum, Basic Apache Hadoop on Ubuntu on Single Cloud Server Instance Setup. It means Hadoop is ready to be configured.
Configuring Apache Hadoop
We need to modify the following files to get a complete Apache Hadoop setup:
1 2 3 4 5 | ~/.bashrc /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/hdfs-site.xml |
Run :
1 2 | update-alternatives --config java nano ~/.bashrc |
Add these :
1 2 3 4 5 6 7 8 9 10 11 12 | #HADOOP START export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP END |
Save the file. Run :
1 2 3 | javac -version which javac readlink -f /usr/bin/javac |
Note the values. /usr/bin/javac
is from output of which javac
command. Run :
1 | nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh |
Modify :
1 | export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 |
The above is from previous outputs. Do not blindly copy-paste. Save the file. Now do these :
1 2 | mkdir -p /app/hadoop/tmp sudo chown hduser:hadoop /app/hadoop/tmp |
Open :
1 | nano /usr/local/hadoop/etc/hadoop/core-site.xml |
Modify :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration> |
Run :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml nano /usr/local/hadoop/etc/hadoop/mapred-site.xml <pre> Modify : <pre title="/usr/local/hadoop/etc/hadoop/mapred-site.xml"> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> |
Run :
1 2 3 | mkdir -p /usr/local/hadoop_store/hdfs/namenode mkdir -p /usr/local/hadoop_store/hdfs/datanode sudo chown -R hduser:hadoop /usr/local/hadoop_store |
Open :
1 | nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml |
Modify :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> </configuration> |
Try to run :
1 2 | cd ~ hadoop namenode -format |
Above command must be executed before we start using Hadoop. Basically the commands are for real physical server. You can read this guide :
1 | https://wiki.apache.org/hadoop/Virtual%20Hadoop |
The last command can fail for a given host-virtualisation technology. For that reason, in last step we will show how to use the bundled MapReduce program. If the above fails, you can use in that way. As you are new user with limited budget, we tried to emulate physical servers for learning plus offer a universal working example.
Now, we can use as from as fresh SSH :
1 2 3 | sudo su hduser cd /usr/local/hadoop/sbin && ls start-all.sh |
Actually on localhost you can browse to :
1 | http://localhost:50070/ |
You need to adjust the localhost
to fully qualified domain name to really see. We have successfully configured Hadoop to run in stand-alone mode. We will run the example MapReduce program. Run :
1 2 3 | mkdir ~/input cp /usr/local/hadoop/etc/hadoop/*.xml ~/input /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep ~/input ~/grep_example 'principal[.]*' |
More not possible to write on this guide, you may read here :
1 | https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html |