In increasingly complex application landscapes, the handling of data flows is becoming increasingly difficult. Read how Apache Kafka big data technology can help. Linking data from different systems is at the top of the to-do list in application handling. There are many solutions for this, each with advantages and disadvantages. Apache Kafka promises to solve the problem in the way that users have always wanted. Apache Kafka is currently used by more and more companies. This is not surprising since for data distribution tasks it forms a solid basis for any data integration in addition to Apache Avro as a payload format – whether at table level, at the business object level or for real-time replication. From Apache Kafka’s perspective, these are just different use cases. Even complex transformations can be flanged in a variety of ways, from conventional ETL tools to stream processing tools.
Big data tools that reach their limits
Many technologies for data integration have been invented. For example ETL (Extract-Transform-Load) tools, whose focus was primarily on transforming data. These are usually powerful tools, but they only worked well as batch processing. The promise of salvation of the EII (Enterprise Information Integration) tools was to make this process easier by not having to copy the data, but rather by linking them together at runtime.
Enterprise Application Integration (EAI) has another focus: tables should not be the basis, but application objects. After all, the key information is that, for example, a new customer master record has been created and not that five tables have changed. Realtime replication can only copy data, but with very low latency. If you look at all of these approaches from a technical perspective, you constantly come across physical limits and contradictions.
---
The right big data technology?
The current situation with various approaches and tools can be seen nicely at various companies. The software manufacturers provide various ETL tools on the market (see Gartner Magic Quadrant for Data Integration). The EII approach, i.e. data federation, is implemented in many of the business intelligence tools.
At an abstract level, an application object represents a logical data model, and the tables are a possible physical implementation. A much more direct representation would be storage in JSON or XML or an even more suitable format. The main thing is that it allows a hierarchical structure of the data. The above example with the customer base, this object should contain customer names, the various addresses of the customer, possible contact information and the like. If the data integration perfectly supports this format, there is no longer any reason to build a tool for data or application integration. A table is then just a particularly trivial, flat, business object. Such non-relational structures are even common in the big data world. Apache Avro is a suitable format for this because it includes the format definition, efficient storage and fast processing. The second basic decision revolves around the question of how many customers there are for the respective data. However, the situation has never been so easy in real life. Also, today’s IT landscapes are becoming increasingly complex. It then makes sense that, for example, ten target systems torment the source system every few seconds with the question: “What has changed?”. This only creates a high base load on the ERP system. In such a landscape, it would be much more clever if there were a central distributor. Then the ERP system only has to pass the data on to service and all data consumers get the data from there.
This approach has also been tried several times. Remember the SOA movement (Service Oriented Architecture) or IBM MQ as the best-known representative from the Enterprise Message Bus category. These solutions were all not bad but were mostly overwhelmed by the real requirements. Simple things, but they still prevented widespread use. I would like to mention two points in particular: the payload format and the handling of the queues.
If you couple two systems, you have to agree on an interface definition. What does a customer master record look like exactly – fields, data types, permitted values, etc.? Any change in the interface means that both the transmitter and the receiver must be updated synchronously. This is possible with two systems – but if many systems are involved, it quickly becomes confusing and practically impossible.
You need a technique that on the one hand prescribes a fixed structure definition and on the other hand, allows a certain tolerance. The classic way to do this is through different versions of the APIs. But that is also easier and is constantly being done with databases: What happens if you add a column to a table? With clean programming, nothing – all read accesses continue to work. In the big data world, this approach has been developed further and is called the evolution scheme. Apache Avro supports this very nicely.
Apache Kafka brings the advantages
In the past, a publish/subscribe model was mostly used for messaging. A change will be sent. Everyone who has registered as a data recipient will receive the corresponding data. Sounds obvious at first, but does not meet real requirements. As a consumer, we want to decide which data we want and when we get it. In many cases, it will be immediate – real-time – but there are other scenarios:
- The data warehouse only reads once a day.
- In the event of an error in the processing logic, the last few hours must be processed again.
- During development, you want to get the same data over and over again during testing.
This is exactly the advantage of Apache Kafka. It does all of this without creating other drawbacks and in a way that is simple, convenient and fast. From the logical principle, all change data are appended to the back of a log file. Every consumer says where to read from and Kafka transfers this data accordingly. The connection is left open, however, so that new data records are submitted with a latency in the millisecond range. The use-cases include :
- Kafka Messaging
- Website Activity Tracking
- Kafka Log Aggregation
- Stream Processing
- Kafka Event Sourcing
- Commit Log
This kills two birds with one stone: the consumer gets the data he requested – it is in control – and he gets the data in real-time. At least until he closes the connection himself. Furthermore, several instances of the same consumer can run in parallel. Kafka takes care of load balancing. If another consumer instance is started or an existing instance stops responding, Kafka automatically handles each of these situations. This makes programming simple and robust. So if you have to deal with data integration, pay attention to how your architecture and the tools used to harmonize with Apache Kafka.
Tagged With apache kafka , open etl file