Apache Airflow possibly needs a small introduction. It is basically a usable part for managing servers running Big Data tools. Airflow scheduler executes tasks on an array of workers while following the specified dependencies. There is command line utilities. Similar technology is behind Luigi, Azkaban, Oozie etc. Luigi is simpler in scope than Apache Airflow. Here are the steps for installing Apache Airflow on Ubuntu, CentOS running on cloud server. One may use Apache Airflow to author workflows as directed acyclic graphs of tasks.
Installing Apache Airflow On Ubuntu, CentOS Cloud Server
You can install Apache Airflow anywhere – including your Mac. Because it is just a Python package with submodules. On your Mac if you run :
1 | pip search airflow |
Then you’ll get this kind of result :
---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | abel-airflow (1.7.1.3.post3) - Programmatically author, schedule and monitor data pipelines airflow-alt-ldap (0.0.1) - Alternative LDAP auth backend for airflow to support openLDAP installation without memberOf overlay airflow-declarative (1.0) - Airflow DAGs done declaratively apache-airflow (1.8.2rc1) - Programmatically author, schedule and monitor data pipelines airflow (1.8.0) - Programmatically author, schedule and monitor data pipelines airflow-imaging-plugins (2.4.1) - Airflow plugins to support Neuroimaging tasks. airflow_plugin_honeypot (1.0.0) - Airflow plugin that captures, parses, and visualizes Hive log files airflow_utils (0.3) - collection of helpers to create airflow dags AirflowOnTheDumpTruck (0.1.0) - AirflowOnTheDumptruck bunch of airflow operators and hooks fairflow (0.1.2) - Functional airflow. hovertools (0.1.6) - Tools for airflow-hovercraft ussd_airflow (0.0.4.7) - Ussd Airflow Library |
Those things are written here :
1 | https://airflow.incubator.apache.org/installation.html |
As we are starting with a possible blank server running a server OS like Ubuntu, we need to install these things :
1 2 3 4 | apt update apt upgrade apt install unzip build-essential python-dev libsasl2-dev python-pandas apt install binutils gcc pandas |
For CentOS commands will be like this :
1 2 3 | yum update yum groupinstall "Development tools" yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel python-devel wget cyrus-sasl-devel.x86_64 |
If want a message broker then you will need to install RabbitMQ. We are not showing that steps as it is optional.
If you want to use MySQL as database repo you will need to install some MySQL dependencies :
1 2 3 4 | apt install python-dev libmysqlclient-dev mariadb-devel # or yum install -y mysql-devel python-devel python-setuptools pip install MySQL-python |
Now steps are common for any operating system. Run :
1 | python -V |
We need Python 2.7.X, thats it. Then install pip :
1 2 3 4 5 6 7 | wget https://bootstrap.pypa.io/ez_setup.py python ez_setup.py unzip setuptools-*.zip cd setuptools-* easy_install pip # AirFlow needs it pip install -U boto |
Installing AirFlow or components is easy :
1 2 3 4 | pip install airflow==1.8.0 pip install airflow[hive]==1.8.0 pip install airflow[celery]==1.8.0 ... |
Commonly you’ll need this set :
1 | pip install airflow[async,celery,crypto,druid,jdbc,hdfs,hive,kerberos,ldap,mysql,password,rabbitmq,vertica] |
Possibly you’ll set $HOME
of Airflow from some sort of dot file :
1 | export AIRFLOW_HOME=~/airflow |
Initiate Airflow :
1 | airflow initdb |
Setup MySQL, create an airflow database, grant all on airflow database. We are not showing the steps.
In that $HOME
of Airflow, you need to have dags, logs directories :
1 2 | mkdir dags mkdir logs |
At $HOME
of Airflow, there is a airflow.cfg
file. That like starts like :
1 2 3 4 | [core] # The home folder for airflow, default is ~/airflow airflow_home = /usr/local/airflow ... |
That will need to have settings like this :
1 2 3 4 5 6 7 8 9 10 11 | executor = CeleryExecutor sql_alchemy_conn = mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow dags_are_paused_at_creation = True load_examples = False celery_result_backend = db+mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow # optional default_queue = {YOUR_QUEUE_NAME_HERE} # for RabbitMQ broker_url = amqp://guest:guest@{RABBITMQ_HOST}:5672/ # for AWS SQS: broker_url = sqs://{ACCESS_KEY_ID}:{SECRET_KEY}@ |
Again run :
1 | airflow initdb |
Now the last steps :
1 2 3 4 5 6 7 8 | # start server nohup airflow webserver $* >> ~/airflow/logs/webserver.logs & # start Celery workers nohup airflow worker $* >> ~/airflow/logs/worker.logs & # start Scheduler nohup airflow scheduler >> ~/airflow/logs/scheduler.logs & # start Flower nohup airflow flower >> ~/airflow/logs/flower.logs & |
You can now navigate to the Airflow UI on Browser :
1 | http://IP.Address:8080/admin/ |
And Flower web UI :
1 | http://IP.Address:5555/ |