In our one previous guide, we have shown step by step tutorial on how to create Data Lake on server and talked basic matters around data lake. A data lake comprises of multiple repositories providing data to an organisation for analytical processing including analytics & reporting. In another guide, we have talked about medical prediction using the data lake. It is James Dixon who coined the term data lake for using useful data in “cleansed, packaged and structured for easy consumption” while the whole data lake is more like a tank of water. Data will flow from the streams to the lake. Users have access to the lake to do the work they want. In other words, like electricity is analogous to water, big data is analogous to water. Data Lake is source of data like water. The terminology and comparison went to and odd shape by some companies and Gartner has a good article :
1 | http://www.gartner.com/newsroom/id/2809117 |
What is Data Lake in Big Data?
Officially, data lake is one method of storing data within a system in its natural format which may facilitate the collocation of data in various structural forms. Forms can be object blobs or files. That raw data needs to be transformed data real to be used for various tasks like various toes of analytics, reporting, visualization, machine learning and so on. Data lake may include structured data from relational databases, semi-structured data including CSV, logs, XML, JSON or unstructured data like emails, PDFs or binary data like images, audio, video etc.
There is another terminology – data warehouse. Data lake is actually same as a data warehouse but in real life usage, they are shaped for different purposes, and analogous to swimming pool, lake etc. Data. There are matters like Storage, Agility, Security, Processing which are not same for data lake and data warehouse. There is really no complex matter with the terminology Data Lake.
---