Data lineage or data origin refers to the question in a data warehouse system to determine the original data records from which they were created for given aggregated data records. Data Lineage includes methods and tools that make the life cycle of data traceable and answer the questions of who, when, where, why and how. It is a discipline within metadata management that is often also a function of data catalogs. Data lineage capabilities allow users to understand the context of the data they use for decision-making and other business purposes.
Typically, in a data warehouse system, data is extracted from various sources, transformed according to specific rules, and made available for analysis (see ETL Process). With data lineage, the reverse path must be described in order to get from analysis results to the sources. For this purpose, the transformations are mathematically modeled in order to determine the associated input values for given output values of a transformation (EVA principle, Economic Value Added principle). Databases are the first choice when it comes to retaining, updating, querying, deleting and presenting data. Developers depend on data consistency so that APIs can perform the right transactions and applications can access the right data. Data scientists who develop machine learning models or create data visualizations also rely on data.
All processing steps are processed as transformations T
modeled from an input E
, one issue A
produces: T(E)=A
. The Lineage T'
of a data set a a of the output is defined as the subset E'
. The input in which the construction of a was involved: E'=T'(a,E)
. The lineage of a set of records is composed of the lineage of their elements.
---
All transformations can be divided into three classes. It is assumed that the transformations are stable and deterministic, that is, no new output objects are invented and the output is constant with the same input.
Blackbox
A black box is a transformation that cannot be used to specify special properties. Each element of the output can depend on any element of the input. An example of a black box is a function that indicates the deviation from the mean for each number of a set.
Dispatcher
A dispatcher is a transformation that handles elements of the input independently. Each input element can generate any number of output elements (even zero). The lineage of an element of the output of a dispatcher consists of all elements e
of the input together, for which it applies that e
on the transformation to a
was involved.
Aggregator
An aggregator is a transformation in which each input element participates in at least one output element and the input can be divided into disjoint partitions in such a way that each partition is responsible for exactly one output element. Each element of the output can thus be clearly assigned to a group of input elements. A special example of aggregators are key-preserving aggregators, where only input elements with a matching key attribute produce the same output element in which the same key occurs.
Another class of aggregators are context-free aggregators, where the mapping of an input element to a particular partition is independent of the values of other input elements.
A transformation that maps all input objects to itself (identity) or subjects each input element to a simple calculation (e.g. format conversion) is both a dispatcher and an aggregator and is also referred to as a filter.
Data Lineage Calculation
The data lineage of a given output can be determined with a tracing procedure if the property of the transformation is known.
- For dispatchers, each element of the input is checked to see if it generates the output and, in this case, added to the data lineage.
- For context-free aggregators, the partitions are first formed and then the one that leads to the output is selected. The partitions are determined by successively adding the input elements to existing partitions, if the size of the output remains the same as one element.
- For key-preserving aggregators, the keys of the input elements are checked.
- For filters, the data lineage corresponds to the output
- For general aggregators or black boxes, the effort for tracing is too great, since power sets of the input elements would have to be formed. Therefore, to effectively determine the data lineage of a transformation, either an explicit tracing procedure must be known or an inverse function must be used. The inverse function of a transformation can only be used as a tracing procedure for aggregators because it is not necessarily unique.
In order to determine the data lineage for an entire chain of transformations without having to store all intermediate results, the transformations are normalized by combining some of them without losing the special properties (aggregator, dispatcher, filter…) so that effective tracing is possible. Determining the optimal sequence for tracing a series of transformations connected in a row also depends on the cost model.