Most commonly new developers, particularly who are interested in data analysis face some terminologies which have more to do with theoretical and practical part of engineering and analytical sciences. The developers can be from a variety of domains and the phrases often confuses them. The question what is data refining in big data such an obvious question and answer is commonly written for those who are related to statistics and analytical sciences. In data refining we refine disparate data to increase understanding of the data & remove data variability. We can understand that the meaning is not quite clear to many.
What is Data Refining in Big Data in Plain English?
No, refining is not a new terminology or buzzword. So as data refining. Refinement is a generic term in computer science which actually describes various approaches with the goal of producing computer understandable corrected programs and simplifying programs. Data refinement converts raw data to the specification of needed format by a software or implementable program. Possibly still the meaning is not clear.
In our 2 years ago published guide on wrong concepts around auto restart MySQL, we have shown you MySQL log on Github as gist :
---
1 | https://gist.github.com/AbhishekGhosh/66f3da024340c3fc3f1b |
First line is :
1 | 151011 10:38:40 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql |
Fifth line is this :
1 | 2015-10-11 10:38:41 0 [Warning] 'ERROR_FOR_DIVISION_BY_ZERO' is deprecated and will be removed in a future release. |
33rd line is this :
1 2 | 2015-10-11 10:44:56 14344 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE on a table with more than one UNIQUE KEY is unsafe Statement: INSERT INTO wp_stt2_meta ( `post_id`,`meta_value`,`meta_count` ) VALUES ( '17878', 'learning in artificial network ann', 1 ) ON DUPLICATE KEY UPDATE `meta_count` = `meta_count` + 1 |
If I want to develop a chart from the [Warning]
, that 5500 line log is not in proper condition for a computer to understand. The log essentially is intended to be human readable. This was part of data collection phase.
Another good example is Fail2Ban log. We have shown how to setup Fail2Ban log analytics graph with badips.com. Here is gist with fail2ban log :
1 | https://gist.github.com/AbhishekGhosh/48d84c020bdea9d8c8b96eec0a58a9f7 |
If you notice these few lines :
1 2 3 4 5 6 | 2017-07-17 05:46:05,824 fail2ban.actions [935]: NOTICE [sshd] Unban 97.79.239.20 2017-07-17 05:49:49,237 fail2ban.actions [935]: NOTICE [sshd] Unban 61.166.73.121 2017-07-17 06:43:41,419 fail2ban.filter [935]: INFO [sshd] Found 90.189.242.131 2017-07-17 06:43:41,423 fail2ban.filter [935]: INFO [sshd] Found 90.189.242.131 2017-07-17 06:43:44,428 fail2ban.filter [935]: INFO [sshd] Found 90.189.242.131 2017-07-17 06:43:44,790 fail2ban.actions [935]: NOTICE [sshd] Ban 90.189.242.131 |
You’ll understand that there is usable information but not possible for any system to easily compute. For that purpose as basic way we have shown bash commands to execute on fail2ban log to get valuable information using simple GNU tools. Obviously we have bash script with the commands for fail2ban on other guide. When we are running that bash script we are getting this output :
1 2 3 4 5 | Bad IPs from only from /var/log/fail2ban.log alone : ---Number-----IP------------------------------------------------------------- 1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229) 6 182.23.66.171 (182.23.66.171) 6 78-58-187-40.static.zebra.lt (78.58.187.40) |
That IP 78.58.187.40
needs ban via iptables. The script or commands converting data to human readable output for data analysis. 78.58.187.40
was 6 time banned and unbanned by fail2ban.
But when we are using software like Apache Hadoop with Spark, we do not need the human readable format but format which Apache Hadoop with Spark can process. This is berry basic, easy example of data refinement. The adobe log may be understandable to a analytics system in this format :
1 2 3 4 5 6 | 2017-07-17, 05:46:05, 824, Unban, 97.79.239.20 2017-07-17, 05:49:49, 237, Unban, 61.166.73.121 2017-07-17, 06:43:41, 419, Found, 90.189.242.131 2017-07-17, 06:43:41, 423, Found, 90.189.242.131 2017-07-17, 06:43:44, 428, Found, 90.189.242.131 2017-07-17, 06:43:44, 790, Ban, 90.189.242.131 |
On 17th July between 05:46 to 06:43 the IP 90.189.242.131
attacked 3 times and fail2ban banned it.
In a data warehouse, there is a collective process called Extract, Transform, and Load (ETL). Data extracting is the process of gathering data from data sources. The data then will then be Transformed in order to fit the need. Then the data has to be made to abide the rules of data architecture framework, then it will be loaded into the data warehouse.
What I have shown example with the logs with commands is towards retrenchment not refining. Retrenchment uses formal Methods to address the perceived limitations of formal refinement for situations in which refinement is practically unusable. It bears no meaning unless the script or article is read :
1 2 3 | 1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229) 6 182.23.66.171 (182.23.66.171) 6 78-58-187-40.static.zebra.lt (78.58.187.40) |
But this set is towards meaningful :
1 2 3 4 5 6 | 2017-07-17, 05:46:05, 824, Unban, 97.79.239.20 2017-07-17, 05:49:49, 237, Unban, 61.166.73.121 2017-07-17, 06:43:41, 419, Found, 90.189.242.131 2017-07-17, 06:43:41, 423, Found, 90.189.242.131 2017-07-17, 06:43:44, 428, Found, 90.189.242.131 2017-07-17, 06:43:44, 790, Ban, 90.189.242.131 |
There is free software like OpenRefine which is useful for some purpose :
1 | https://github.com/OpenRefine/OpenRefine |
If we go with logs as data source, we have shown how to merge many log files in to one big file in simple way for test purpose.
Tagged With paperuri:(5a3c8fdfb8851727457b529b8938b2e0) , Big Data Analysis Methods , big data refining , intellecty x california big data and data refining , refining data collection process , refining how the data is coded , what is big data