Text mining or textual data mining, is a bundle of algorithm-based analysis methods for the discovery of meaning structures from unstructured or weakly structured text data. Using statistical means, text mining software opens up structures from texts that are intended to enable users to quickly recognize core information of the processed texts. Ideally, text mining systems provide information that users do not previously know if and that they are included in the processed texts. With the targeted application, text mining tools are also able to generate hypotheses, test them and refine them step by step.
Text mining, introduced into research terminology in 1995 as Knowledge Discovery from Text (KDT), is not a clearly defined term. In analogy to data mining in Knowledge Discovery in Databases (KDD), text mining is a largely automated process of knowledge discovery in textual data designed to enable effective and efficient use of available text archives. More comprehensively, text mining can be seen as a process of compilation and organization, formal structuring and algorithmic analysis of large document collections for the needs-based extraction of information and the discovery of hidden content relationships between texts and text fragments.
Typologies, Related Procedures and Areas of Application
The different conceptions of text mining can be ordered using different typologies. Types of information retrieval (IR), document clustering, text data mining and KDD are repeatedly mentioned as subforms of text mining.
---
In IR, it is known that the text data contains certain facts that are to be found using suitable search queries. From the data mining perspective, text mining is understood as “data mining on textual data”, for the exploration of (in need of interpretation) data from texts. The most far-reaching type of text mining is the actual KDT, in which new, previously unknown information is to be extracted from the texts.
Text mining is related to several other methods, from which it can be distinguished as follows. Text mining is most similar to data mining. With this, it shares many procedures, but not the subject: While data mining is usually applied to highly structured data, text mining deals with much weaker structured text data. In text mining, therefore, the primary data is structured more strongly in the first step to enable its indexing with data mining methods. In contrast to most data mining tasks, multiple classifications are usually expressly desired in text mining.
Furthermore, text mining uses information retrieval methods that are designed to find those text documents that are to be relevant for answering a search query. In contrast to data mining, potentially unknown meaning structures in the overall text material are not developed, but a lot of relevantly hoped for individual documents are identified based on known keywords.
Methods of information extraction aim to extract individual facts from texts. Information extraction often uses the same or similar procedural steps as is done in text mining; Information extraction is therefore sometimes regarded as a subfield of text mining. In contrast to (many other types of) text mining, at least the categories for which information is sought are known here – the user knows what he does not know.
Methods of automatic aggregation of texts, text extraction, produce a condensate of a text or a collection of texts; however, unlike text mining, this does not go beyond what is explicitly available in the texts. As a continuation of text mining, argumentation mining can be considered. The aim here is to extract argumentation structures.
Web mining, especially web content mining, is an important field of application for text mining. Attempts to establish text mining as a method of social science content analysis are still relatively new, for example, sentiment detection for the automatic extraction of attitudes towards a topic.