In business informatics and computer linguistics, unstructured data is digitized information that is present in a non-formalized structure and which cannot be accessed aggregated by computer programs via a single interface. Examples include digital texts in natural language and digital sound recordings of human language. Unstructured data from structured and semi-structured data is differentiated. If you look at an e-mail, it has a certain structure: it contains a recipient, a sender, and possibly a title. This makes it one of the semi-structured data. However, the content of the e-mail itself is unstructured.
The automatic usability of unstructured data is limited by the fact that it has no data model and usually no metadata. Metadata and data are also mixed in text documents. To extract structures from it, modeling is required. Furthermore, unstructured data in connection with the storage of documents without existing data warehousing is mentioned. As a result, these are not indexable and cannot be searched together. Many data is unstructured at its origin. They gain structure by being brought into a scheme by human intervention. The structuring process can cause drawbacks because it is often associated with a loss of information. In the business environment, important information is often available in unstructured data, the failure to collect it can also cause legal problems. Therefore, the fields of knowledge management and data management deal with their integration and management. In order to structure the unstructured data, the Framework UIMA (Unstructured Information Management Architecture) exists in the Open Source area. This is a framework for building applications to process unstructured information.
Handling Unstructured Data
The following methods can be considered specifically for the structuring of the data:
---
- Text analysis and text mining have been on the market for many years. The products for this are solid market maturity. Several small specialized manufacturers have developed tools for this. Some business intelligence software manufacturers have bought such technologies under pressure from the market. Text mining can be done manually, by statistical procedures, by machine learning or by processing natural languages. It can provide terms and concepts in Thesauri that can become essential for additional business intelligence analysis.
- Machine learning is based on statistical methods such as Bayes classifiers, artificial neural networks, or latent semantic analysis (LSA). It is much more effective than traditional statistical methods, but not applicable everywhere. It requires monitoring and training of the machines, and as with the data mining procedures, a deep knowledge of the matter is necessary.
- Linguistic techniques can be faster than machine learning, and sometimes more accurate. They can reduce ambiguity, but they still need human intervention. Here, the models are easier to understand compared to LSA and machine learning.