This article is continuation of second part of Text Mining in Web Content Mining. The request by the users has already been filtered by information retrieval using the previously mentioned methods. But despite the filtering, the user is still faced with a gigantic number of relevant documents. The effort to read and edit all documents is still too big for the user. Therefore, additional tools are needed for analysis. These must be able to recognize the structures within a text and give the user the opportunity to extract the information they are looking for.
Often, natural language text is considered unstructured because it does not have the structure familiar from databases. Contrary to the assumptions, natural language texts have a generic structure of words, phrases and sentences that can also be automated with an understanding of how words, phrases and sentences are structured. More effective information can be extracted than previously possible with pattern recognition techniques and string manipulation. The natural language constructed from words is governed by rules on how these words may be ordered. In order to develop effective text mining systems, it is therefore necessary to use these rules, as well as the word meanings for editing texts. In the broad areas of computational linguistics, the areas of great significance for text mining are described in more detail here. These areas include the morphology, syntax, and semantics of texts.
The structure of a word consists of word stem, affixes (prefixes and suffixes) and inflections. The core of a word is the word stem, which is often itself a word. By using Affixes the meaning of the word stem is changed. Inflections are inflections of a word that change the number and time. The analysis of the text mining is supported by the morphological analysis and helps to reduce the complexity of analysis and to represent the word meanings.
---
In order to be able to extract linguistically relevant features (for example words, phrases and texts) from the flow of signs, documents must first of all be prepared for information technology. In order to make this possible, the individual units (tokens) are temporarily removed from the text in the so-called tokenization step. Subsequently, an accumulation of the previously selected tokens is usually carried out around grammatical information. A part-of-speech tagger (POS tagger) classifies each unit of text based on its part of speech. This additional meta-information is appended to the tokens in tags.
Finally, the words thus prepared are combined into phrasal structures by a chunk parser. In this process, complete syntactic structures are not provided, but subordinate chunks are identified. For subsequent techniques, this information is now attached to the text with phrasal tags for further processing.
The complexity of a text analysis is substantially reduced by returning the words to their original word stem. This is due to the fact that the number of individual word occurrences can be specified, which is already a good indicator of how important a topic is in the respective document. Apart from that, morphology allows the incorporation of tools such as dictionaries, encyclopaedias, and the recognition of related words and phrases (for example, multi-word proper names).
In linguistics, the smallest unit is the word that may stand alone in the grammatical sense. Words can be made up of morphemes. They represent the small entity that may carry meaning. Morphes can be divided into two categories. There are the free morphemes and the bound morphs, which do not occur alone as a word in the text, but are bound by prefixes or suffixes to a word. Furthermore, morphemes are distinguished into substantive and functional morphemes. Content morphemes are typically word stems that take their meaning away from grammar. Functional morphemes behave differently, as they help to adapt the grammar. Finally, the morphemes are still divided into whether they are inflected or derived. The conversion of a verb to a noun is done by derived morphemes. Influential morphemes do not create new words, but extend the word stem of a word to fit the grammatical requirements. It has become clear how important the morphology is for text mining. A good frequency analysis is only possible if the word stems can be identified and the individual words can be recognized with their grammatical function. Only the combination of contentual and functional morphemes makes it possible to find the required grammatical information of a word. It has become clear how important the morphology is for text mining. A good frequency analysis is only possible if the word stems can be identified and the individual words can be recognized with their grammatical function. Only the combination of contentual and functional morphemes makes it possible to find the required grammatical information of a word. It has become clear how important the morphology is for text mining. A good frequency analysis is only possible if the word stems can be identified and the individual words can be recognized with their grammatical function. Only the combination of contentual and functional morphemes makes it possible to find the required grammatical information of a word.
Much as morphology has practical implications for the analysis of words, the syntax may have the same function for phrases and sentences. This is possible because the rules of linguistics describe how words can be put together in phrases and sentences. It can recognize both nouns, verbs, prepositional and adjective phrases. The formation of complicated phrases and sentences is made possible by these phrases. In combination with syntactic rules, it is possible to bring a hierarchical form into the individual phrases that represent how they relate to each other and to each other.
In analyzing the relationship between verb and noun phrases, the role of nouns in a sentence can be determined. Comparable to how word stems in words have the ability to affect usable affixes, verbs in sentences limit the number and type of nouns that can be used for a sentence. The result of this the so-called case assignments are stored in lexicons and used to search for patterns. In combination with syntactic rules, morphological information can be used to recognize structures in the form of word and phrase patterns, thus providing the basis for semantic analysis.
Part of linguistics is semantics, which deals with the meaning of natural language expressions. Unlike the syntax, semantics deals with the meaning of words (lexical semantics), sentences (sentence semantics) and texts (discourse semantics).
Aiming at the representation of meanings is a space-efficient solution that allows programs to make an immediate decision. Subject to these requirements, one possible approach is the use of semantic networks. Semantic networks use nodes and arrows to represent connections between objects, events, and concepts. For classifying and generalizing topics, this type of network has been useful in searching for topics instead of key words. The high time required for generalized networks with a rich vocabulary treasure opens up a problem, because only in limited domains can you work with semantic networks.
Statistical language processing
Text editing technologies based on morphology, syntax and semantics are powerful tools for extracting information from texts. They allow you to find documents based on topics or keywords, and texts can be scanned for reasonable phrase patterns, allowing you to extract key features and their relationships. In addition, documents can be stored in a simple way, which allows easy navigation, which goes far beyond the possibilities of information retrieval techniques and further extraction of information.
Like much else in life, these featured techniques have their limitations. The problem with these techniques is, among other things, the correct recognition of roles of identified noun phrases that influences a correct extraction of information and the orderly representation of abstract concepts. Semantic networks are well-suited for representing component (compositions and aggregations) and subset relations (inheritance). It proves to be much more difficult to deduce derivations without exceeding too great a degree of complexity. Synonyms and specialized domains, in which many different concepts of very similar concepts are described, also prove to be problematic. To use a general classification system, too many concepts would be needed to really be able to classify all kinds of topics. Due to the increase in concepts, simultaneous representation would no longer be possible. By using statistical techniques, a handful of these problems can be eliminated by combining the results of the linguistic analysis with simple statistical measures.
The normal tasks of text mining include the automated creation of text summaries. As already explained, this task can be solved by simply finding the most significant concepts. In order to circumvent the semantic networks and their limitations, word frequencies can be used to find the most essential concepts of a text. The significance of a word can then already be determined by the simple counting of shared word stems, with the result that at the same time the importance of a sentence can be determined by the importance of the words contained therein. By extracting these sentences, a simple but effective summary of a text is possible. The combination of linguistics with statistical techniques results in a simplification of the semantic networks. In semantic networks, each node usually represents a word or term. The description of the relationship between the nodes is made by arrows. Those arrows are used to calculate the degree of correlation between the nodes. The correlation shows how often words are used side by side. Although the full meaning of a text can not be represented in this method, it does provide information about the importance of topics to each other. Simply counting word frequencies could be replaced by applications that additionally consider how other terms are related to them. The linguistic approaches are completely sufficient for individual message texts. Statistical methods, in turn, are particularly well suited for large text collections such as newsgroups or newspaper archives.
Macrostructures
The techniques presented so far treat each part of a text consistently. This behavior can become problematic as soon as a larger text is examined from different sections with different contents and emphases. As a result of this analysis, only the sentences with the most commonly used terms would be considered in the summary. Consequently, as a result, significantly more information would be available about the longer sections in the summary. However, the importance of sections can not be determined simply by length. Furthermore, so far ignored that in some texts, the information in the various target groups have a different weighting. Examples of such texts are for example memos or reports.
In contrast to the mostly artificially created macrostructures, which are used to better structure large volumes of texts, microstructures provide the language level of a text. The macro structures include subdivisions such as chapters and headings, as well as presentation and meaning information of text elements that are present in tags such as HTML or XML. By references such as hyperlinks in documents on the Internet, which are used for easy navigation between relevant documents and can also be used to analyze the importance of a document. This is done by measuring how many references from a document to the so-called hub point to other documents, and how many references from other documents to that document show to the Authority. The most well-known search engine Google uses this technology under the name Pagerank. It refines the search results in addition to weighting with the frequency of search terms with the importance of the existing link structure.
Result presentation
For the presentation of the results in text mining, an output via the browser is usually used. Due to the volume of the result set, easy navigation through the documents must be possible. The simplified representation of the information makes it possible to perform a pattern recognition faster, which is why visualization tools play an ever-increasing role. The user finds it easier to recognize keywords and to decide between the documents through the visualization. At the beginning of text mining, the user could not interact with the graphics offered. It was therefore very cumbersome for users to bring in new findings in search. This issue has been resolved by interactive graphics, the user can make a selection by simple mouse clicks, which refines or changes the search. In some text mining systems, the user has the option from the outset of designing their own query dialogs. In the forth part article of this series, we have discussed the areas of application or tasks of text mining.