Data cleansing includes various methods for removing and correcting data errors in databases or other information systems. For example, the errors may consist of incorrect (originally incorrect or outdated), redundant, inconsistent, or incorrectly formatted data. Key steps for data cleansing are duplicate detection (detecting and merging the same data sets) and data fusion (merging and completing patchy data). Data cleansing is a contribution to improving the quality of information. However, information quality also affects many other characteristics of data sources (credibility, relevance, availability, costs, etc) that cannot be improved by means of data cleansing.
Data Cleansing Process
The process of cleaning up the data is divided into five successive steps):
- Make a backup copy of the file/table
- Data Quality – Setting Data Requirements
- Analysis of the data
- Standardization
- Cleanup of the data
Data Quality Requirements
---
High-quality and reliable data must meet certain requirements :
- valid data: same data type, certain maximum values, etc.
- complete data
- uniform data: same unit (i.e. currency, weight, length)
- integral data: Data must be protected from intentional and/or unintentional manipulation.
Analysis of data
Once the requirements have been clarified, the data must be checked, i.e. with the help of the checklists, whether the data is of the required quality.
Standardization of data before cleanup
For a successful cleanup, the data must first be standardized. For this purpose, these are first structured and then standardized. The structuring brings the data into a uniform format, for example, a date is brought into a uniform data format (01.09.2009) or composite data is broken down into its components, i.e. a customer’s name into the name components Salutation, Title, First Name and Last Name. In most cases, such structuring is not trivial and is carried out with the help of complex parsers.
During standardization, the existing values are mapped to a standardized value list. This standardization may be carried out, for example, academic titles or company additions.
Cleaning up data
There are six methods to clean up the data that can be applied individually or in combination:
- Derive from other data: The correct values are derived from other data (i.e. salutation from the gender).
- Replace with other data: The corrupted data is replaced by other data (i.e. from other systems).
- Use Default values: Default values are used instead of the incorrect data.
- Remove incorrect data: The data is filtered out and not further processed.
- Remove duplicates: Duplicates are identified through duplicate detection, the non-redundant data is
- Consolidation from the duplicates, and a single data set is formed from them.
- Split summary: In contrast to the removal of duplicates, incorrectly summarized data is separated again.
Storage of the faulty data
Before cleaning up the data, you should save the original, erroneous data as a copy, and not simply delete it after the cleanup. Otherwise, the adjustments would not be comprehensible, and such a process would not be audit-proof.
An alternative is to store the corrected value in an additional column. Because additional disk space is required, this approach is recommended for only a few columns in a record to correct. Another option is to store it in an additional line, which increases the memory requirement even more. Therefore, it is only possible to correct a small number of records. The last option for a large number of columns and rows to correct is to create a separate table.