Apache Kafka is a stream processing platform which aims to provide a low-latency platform for handling real-time data feeds. Its storage is a massively scalable pub/sub message queue architected as a distributed transaction log making it valuable to process streaming data. Kafka can connect to external systems for data import/export. Apache Kafka is a part of Big Data analysis too. Here are the steps on how to install Apache Kafka on Ubuntu 16.04 running one single cloud server instance. We have a list of tutorials on Big Data cloud tools. But Kafka is preferred among many of the tools for various reasons. The way of installation of Apache Kafka is more closer to installation of other Apache Big Data tools.
Install Apache Kafka on Ubuntu 16.04
Apache Kafka needs a Java runtime environment and a user with sudo privilege. In the example, we will use the user name kafka
. The command adduser
will automatically create the user, initial group, and home directory. Run these commands one by one :
1 2 3 | adduser kafka id kafka ls -lad /home/kafka/ |
Now we will set the user for sudo permission to allow to be in suoders list :
---
1 | echo 'kafka ALL=(ALL) ALL' >> /etc/sudoers |
Type exit
to exit the session and run command SSH with that username :
1 | ssh kafka@your-IP-addresss |
After login, type :
1 | sudo su |
Now you’ll become root
user. Now, you may be in need in future to close the way for kafka
to SSH. Open this file :
1 | nano /etc/ssh/sshd_config |
You’ll add entry of kafka
to SSH if needed by having syntax like this :
1 | DenyUsers kafka |
save the file. Run these commands :
1 2 | /etc/init.d/sshd restart sudo service ssh restart |
During initial installation do not do the changes in /etc/ssh/sshd_config
. You’ll install the software by SSH to the system as any permitted user and run su kafka
to become kafka
.
Next we will install Java. There are two ways, one is using oracle-java8
:
1 2 3 | sudo add-apt-repository -y ppa:webupd8team/java sudo apt update apt install oracle-java8-installer -y |
Another is the default Java Run Time provided by apt :
1 | apt install default-jre |
Follow the way you want. Verify that JDK 8 is installed properly
1 | java -version |
Kafka cluster depends on ZooKeeper to perform operations such as detecting failed nodes. Kafka brokers need it to form a cluster, and the configuration is stored in ZK nodes. Newer versions of Kafka have decoupled the clients to consumers and producers. We still need ZooKeeper to run Kafka brokers. Once we install Kafka, we can use the ZooKeeper available with Kafka. But we will use ZooKeeper package that is available in Ubuntu’s repository.
1 | sudo apt install zookeeperd |
ZooKeeper will be started as a daemon automatically. By default, it will listen on port 2181 :
1 | sudo netstat -nlpt | grep ':2181' |
Now go to :
1 | cd /opt |
We will install the current stable version from :
1 | http://kafka.apache.org/downloads.html |
Version 0.10.2.1
is latest at the time of publication. We are showing example with one mirror, browse here on browser:
1 | http://www-eu.apache.org/dist/kafka/ |
and click 0.10.2.1 link, these are the files :
1 2 3 4 5 6 7 8 9 10 | javadoc/ 2017-04-27 01:29 - RELEASE_NOTES.html 2017-04-27 01:29 5.4K kafka-0.10.2.1-src.tgz 2017-04-27 01:29 3.8M kafka_2.10-0.10.2.1-site-docs.tgz 2017-04-27 01:29 1.9M kafka_2.10-0.10.2.1.tgz 2017-04-27 01:29 37M kafka_2.11-0.10.2.1-site-docs.tgz 2017-04-27 01:29 1.9M kafka_2.11-0.10.2.1.tgz 2017-04-27 01:29 36M kafka_2.12-0.10.2.1-site-docs.tgz 2017-04-27 01:29 1.9M kafka_2.12-0.10.2.1.tgz 2017-04-27 01:29 32M ... |
You know that version 0.10.2.1
is latest. So here /0.10.2.1/kafka-0.10.2.1-src.tgz
will be your file.
1 2 3 4 5 6 7 | # check /usr/local/kafka ls /usr/local/kafka cd /opt curl -O http://www-eu.apache.org/dist/kafka/0.10.2.1/kafka-0.10.2.1-src.tgz tar -xvf kafka_2.11-0.10.1.1.tgz -C /usr/local/kafka ls /usr/local/kafka rm kafka-0.10.2.1-src.tgz |
Open this file and look for delete.topic.enable
:
1 | nano /usr/local/kafka/config/server.properites |
Make it like this :
1 | delete.topic.enable = true |
Save the file. Start Kafka by running kafka-server-start.sh
script :
1 | sudo /opt/kafka/kafka_2.11-0.10.1.1/bin/kafka-server-start.sh /opt/kafka/kafka_2.11-0.10.1.1/config/server.properties |
Now, we can check listening ports of ZooKeeper and Kafka :
1 | netstat -ant | grep -E ':2181|:9092' |
Expected output :
1 2 | tcp6 0 0 :::9092 :::* LISTEN 1367/java tcp6 0 0 :::2181 :::* LISTEN - |
To stop, you’ll need to run script kafka-server-stop.sh
in the same way like we started above. Of course you can create topic :
1 | /usr/local/kafka/bin/kafka-topics.sh --create --topic topic-test --zookeeper localhost:2181 --partitions 1 --replication-factor 1 |
Install Apache Kafka on Ubuntu 16.04 : Configure With Spark
I am showing additional steps :
1 2 3 4 5 6 7 8 | wget https://dl.bintray.com/sbt/debian/sbt-0.13.11.deb sudo dpkg -i sbt-0.13.11.deb sudo apt update sudo apt install sbt wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz tar xvf spark-2.0.0-bin-hadoop2.7.tgz sudo mv spark-2.0.0-bin-hadoop2.7 /usr/local/spark nano ~/.profile |
Add spark configuration to profile :
1 2 3 | # set PATH so it includes user's private bin directories PATH="/usr/local/spark/bin:$HOME/bin:$HOME/.local/bin:$PATH" export PYSPARK_PYTHON=python3 |
Source it :
1 | source ~/.profile |
Test configuration :
1 | pyspark |
Rest you should read from :
1 | https://spark.apache.org/docs/latest/streaming-kafka-integration.html |