The classic production factors of a company are based on the three pillars – work, land and capital. Knowledge is regarded as the fourth factor of production in our society and is the decisive competitive factor when it comes to the right use. Information is an integral part of knowledge. A company obtains its information from internal or external sources. The processing of this information is difficult because the data are in heterogeneous form, in different quality, in different data formats and in different data sources. The DWH is intended to unite the heterogeneous data and to enable an analytical use. A Data Warehouse is a knowledge-based, integrated, non-volatile, and time-variant collection of data in support of management’s decision-making process. This definition is not sufficient in practice. A data warehouse is a physical database that provides an integrated view of any data for analysis purposes. The Data Warehouse is part of Business Intelligence (BI). The Business Intelligence integrates the integration possibilities of strategies, processes, applications and technologies with the knowledge management. The idea of using company data with the help of computer-assisted systems for decision support has its origin in the 1960s and has been a major development to this day. The following is a brief description of this decision support system (DSS). Our current topic under discussion is What Data Warehousing Solutions Available in the Cloud and we will try to remain on-topic and brief. However it is important to introduce to some brief history.
Table of Contents |
Brief History of Data Warehousing Solutions
In the early 1960s, magnetic tapes were used to preserve data. With the help of programming languages such as COBOL and Fortran, these tapes were read out in order to generate reports. The great advantage of the magnetic tapes is the fact that they have a high capacity and because of their favorable price were cheap storage media. The readout, however, took a long time and the yield was an average of 5%. Due to difficult synchronization possibilities, which was due to the reading performance, data were stored redundantly. This fact increased the complexity in overlapping and global analysis. During 1970s, the development of direct access storage device (DASD) storage media allowed direct access to stored data. The first database applications were written with DASD. The goal of the database was to simplify the storage and access of the data:
---
- faster access times with DASD, since it does not have to be read sequentially
- databases were defined as a central storage location in order to minimize redundancy
- data analyzes became much more comfortable and faster
Because of the many advantages of a database, it was defined as the only source for data processing (“single-database paradigm”).
During 1980s, personal computers and fourth-generation-languages (4GL) formed the basis for OnLine Transactional Processing (OLTP). Business processes could thus not be processed and booked online via batch processing. In order not to burden the day-to-day business with analysis queries on the operational databases, extracting programs were developed. Their task was to use search criteria to select data in files and databases and to transport the results to other files or databases. The procedure of extracting was established. Soon there were extracts of extracts from extracts etc. The extraction process was also called naturally evolving architecture. Naturally growing architectures bring many challenges, such as:
- data credibility
- low productivity
- to transform the difficulty data into information
Analytical and informative processing is the processing of the data in order to assist management in decision-making. It is also referred to as DSS. Analytical processing does not record single data records, as in the case of operational processing, but determines trends using several data sets. Further distinguishing features are the frequency of the change within the different environments. For the operating systems, data is constantly changed. In analytical processing, the data is accessed frequently to collect analyzes. Data modification by the DSS analysts was seldom noticed. The user requirements for the operational and the strategic information systems fundamentally differ. Response times for analytical queries range from 30 minutes to 24 hours. Same response times in operational systems would lead to the closure of the business processes. Operational data have the requirement to be highly available. The points described above should make it clear that there are sufficient motivations to separate operation and information databases from each other.
In 2010, a large portion of households have internet access. The Internet is playing a more and more important role in today’s world. Web applications enable access to the customer potential through the economic use of electronic platforms, thus offering a further important sales market for companies. When properly integrated, the Warehousing can help to successfully operate an eBusiness-based web application. The Warehousing is intended to adopt the following support functions:
- receiving, processing and archiving of the data generated from the web application in order to counteract a utilization of the Internet
- providing integrated data via access to the high-performance ODS component
Today a main consideration with data warehousing starts with the matter while deciding between cloud Data Warehousing or on-premise Data Warehousing as the speed of deployment and uptime/maintenance are matters which these days we think more.
Basic of Data Warehousing
Although we have discussed the theoretical of Big Data as a separate series of articles, we will discuss it here briefly. The architectural requirements for a Data Warehouse result from the later usefulness which one would like to achieve with the system. A Data Warehouse is a database that has the task of providing integrated data for analysis queries. Therefore, some relevant points should be considered during the design of the architecture. The architecture must meet the requirements of independent use of individual components – in terms of capacity utilization. Furthermore, the long-term availability and the multiple usability of integrated data must be considered. The Data Warehouse architecture must offer the possibility of extensibility in order to be able to integrate new data sources if necessary. A further criterion is the automated processing within the individual process steps. Based on the reference architecture, the following individual components are contained in a Data Warehouse system in a too much simplified manner :
- Data Warehouse Manager (DWHM) is the central control component of the Data Warehouse system
- Data source is not part of the Data Warehouse system, but sources are the starting point of the data flow
- The Data Warehouse monitor recognizes data manipulations or changes within the data sources and takes care of the incremental alternation of the data in the base database and the Data Warehouse.
- The base database in the DWH reference model is located between the work area and the Data Warehouse.
- The workspace is the temporary location of the ETL process – The extraction (E) component is responsible for transferring the data from the data source to the work area. The transformation (T) component changes the structure and the content of the data with the aim of converting the data into a uniform internal format because of its heterogeneity. The load (L) component is responsible for transferring the data from the work area to the base database or to the Data Warehouse.
- An integration in data warehouse refers to pre-built system for transferring data from, or sometimes to a particular data source.
In order to make decisions, it is necessary to be able to access information quickly and in a complex way. For this, a two-dimensional view of information, such as in a table, is not enough. In such a table, only the relationships between two variables can be represented. To be able to include other variables in the analysis, OLAP technology provides the representation of the values in the form of a multidimensional cube. The abbreviation OLAP stands for the term Online Analytical Processing and is an analysis method, in which it is no longer a question of retrieving a single value from the data stock, but rather a multidimensional view of the data base Role. Particular characteristics of OLAP products are the architecture of the system, the existing interfaces, the definition of the multidimensional data model and the individual analysis functions and possibilities. Once a suitable software has been found, the potential users are to be prepared accordingly, so that the handling of the program can be smooth and success-oriented in practice. This can be achieved, in particular, through presentations and training courses.
Use Cases of Data Warehousing in Brief
It is really beyond the capacity of this article to describe the use cases to even provide a basic idea, however from point-of-sale (POS) data, enterprise resource planning (ERP) data, stored accounting data, patient records from an electronic health record (EHR) system, billing system – all actually been covered by Data Warehousing.
What Data Warehousing Solutions Available in the Cloud
There are a number of well-known software manufacturers who offer solutions for the construction of a Data Warehouse system or its individual components. IBM apparently has acquired technologies and developments through various acquisitions from other software companies such as Ascentail Software, Trigo, Cognos and SPSS to provide a Data Warehouse system with various components. This system is named IBM Smart Analytics System and has these components from our understanding :
- Balanced Configuration Units (BCU) : These are used to shorten the implementation time of a DWH system.
- IBM Balanced Warehouse : Provides predefined solutions for individual customer requirements. The system is module-based so that customers can assemble and expand their desired components themselves.
- IBM dashDB : Cloud-based data warehouse service built for analytic work suited for Cloudant JSON data, can accept data from a variety of sources by examining the structure of data when it is loaded.
- IBM DB2 on Cloud : Offers all features of on-premise DB2 deployment.
- IBM InfoSphere Warehouse : Based on IBM DB2, a relational database server, access to structured and unstructured data in DWH is possible in real-time.
- Cubing Services : Cubing Services is based on the functionality of a hybrid OLAP and allows direct access to the data in the DWH. They are created in real-time cache cubes, which are provided via standardized OLAP interfaces such as MDX (multidimensional), XMLA (XML for Analysis), or ODBO (OLEDB for OLAP). Access to these data is via BI clients such as Cognos 8 BI, Microsoft Excel, IBM DB2 AlphaBlox, and IBM DataQuant.
- The IBM solutions include various functions for data mining and text analysis, which include standard models (such as clustering and classification) and presentation components, as well as options for evaluating unstructured data from, for example, free text fields.
SAP probably mentionable as they offers several components and solutions to implement a Data Warehouse system in a company.
- SAP High-Performance Analytic Appliance (SAP HANA) performs data analysis from different data sources is performed in real-time via the so-called in-memory appliance.
- SAP NetWeaver Business Warehouse is used to provide a Data Warehouse including the functions ETL routines, metadata, InfoCubes, data models, and reports.
Obviously, there are several companies who offers Data Warehousing softwares. Other known companies with services (not only software) are Amazon Redshift, Teradata, Oracle Exadata etc. Of course there are hundreds of companies to try their luck.
Conclusion
Today’s cloud-based Data Warehousing require less on-premise resources such as own servers, own IT team dedicated for the job. A cloud Data Warehousing solution with instant integration systems can be set up rapidly for all data sources. However for cloud-based data warehouse we are dependent on provider’s uptime and maintenance. On-premise solutions does avoid those issues but we become responsible for outages (do not forget the cost factor). May be one want to prioritise granular control over convenience but for the most cloud-based at least best to start or develop, learn. Now among the companies, Google is risky company around data. Google has lot of things but unless free (except AdWords) peoples usually do not buy. Microsoft means “no UNIX” (although they are loving UNIX-like out of need of saving their business). We have no guide on Microsoft’s cloud server. Amazon never have quite reliable hardware (usually inexperienced startups use their services). Remaining are Oracle and SAP. Both really not no way closest to bad. Decision to choose a company of course depends on checking many matters like which are commonly applicable for cloud.