Here is 2nd Part of Installing Local Data Lake on Ubuntu Server For Prediction System. In this guide, we will prepare to process data by Machine learning or Business Intelligence (BI) tool. Reader must follow the 1st part of this series in order to be able to understand this guide. Please adjust version number, path, username, password mentioned in this guide.
Installing Local Data Lake on Ubuntu Server : Understanding The Structure
Here is an overview of a typical Data Lake Architecture for prediction system :
Installing Local Data Lake on Ubuntu Server
This guide compatible data can be downloaded to Local Data Lake from Github Repo :
---
1 | https://github.com/AbhishekGhosh/pyspark |
Run these commands :
1 2 3 | cd ~/tutorials/pyspark git pull ipython notebook |
select data-flows.ipynb
. Next we need to
1 2 3 4 5 6 7 8 9 10 11 | #! mysql -u hiveuser -phivepassword -D retail_db -e "DROP DATABASE retail_db" > /dev/null import pymysql db = pymysql.connect(host="localhost", user='hiveuser', password='hivepassword') cur = db.cursor() cur.execute("CREATE DATABASE retail_db") cur.execute("USE retail_db") cur.execute(open("dump.sql").read()) cur.close() db.close() |
and :
1 2 3 4 5 6 7 8 | import pandas as pd db = pymysql.connect(host="localhost", user='hiveuser', password='hivepassword', db="retail_db") print pd.read_sql("SHOW TABLES", db) print pd.read_sql("DESCRIBE customers", db) print pd.read_sql("SELECT * FROM customers LIMIT 5", db) |
Check for output. On GitHub, you’ll get the full guide :
1 | https://github.com/AbhishekGhosh/pyspark/blob/master/data-flows.ipynb |
You’ll get Data Sets to build Prediction system here :
1 | https://archive.ics.uci.edu/ml/datasets.html |
We need to wget :
1 2 3 4 5 | wget https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data wget https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data wget https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data wget https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data ls processed.*.data |
As example, if you import the heart disease from there, you can follow this guide to build a prediction system :
1 | http://nasdag.github.io/blog/2016/01/02/predicting-heart-disease-with-hadoop-spark-and-python/ |
There is no difference with our whole series of this tutorial with our guide except the matter – we have more recent version compatible and guides are distributed in to many guides. This is quite basic example of how we can use the data and process by the system. If you want to make it a realtime prediction system, it will become more complex and will need a dedicated server. Basic principle will however remains the same.
Tagged With ubuntu , data lake reference architecture