In the third part of Uses of Text Mining in Web Content Mining, we informed that in this part we will discuss about the areas of application or tasks of text mining. Basically, the different methods should analyze texts and make the implicit information too explicit. Then form relations from the information in different texts, highlight and visualize them. The descriptions give an overview of the many technologies and their role in text mining.
Information extraction
Information extraction includes word processing to identify selected information, such as specific types of names or specified characteristics of events. For names, it is sufficient to find these in the text and to recognize its nature. For events, the critical information (people, objects, date, location, etc.) must be extracted and this information, from the text, passed into a given structure. In information extraction, this given structure is defined as a template.
The following example is a template for extracting information about company change of managers :
---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | : = DOC_NR: "NUMBER" CONTENT: * : = SUCCESSION_ORG: POST: "POSITION TITLE" | "no title" IN_AND_OUT: + VACANCY_REASON: {DEPART_WORKFORCE, REASSIGNMENT, NEW_POST_CREATED, OTH_UNK} : = IO_PERSON: NEW_STATUS: {IN, IN_ACTING, OUT, OUT_ACTING} ON_THE_JOB: {YES, NO, UNCLEAR} OTHER_ORG: REL_OTHER_ORG: {SAME_ORG, RELATED_ORG, OUTSIDE_ORG} : = ORG_NAME: "NAME" ORG_ALIAS: "ALIAS" * ORG_DESCRIPTOR: "DESCRIPTOR" ORG_TYPE: {GOVERNMENT, COMPANY, OTHER} ORG_LOCALE: LOCALE-STRING {{LOC_TYPE}} * ORG_COUNTRY: NORMALIZED-COUNTRY-OR-REGION | COUNTRY-OR-REGION STRING * : = PER_NAME: "NAME" PER_ALIAS: "ALIAS" * PER_TITLE: "TITLE" * LOC_TYPE :: {CITY, PROVINCE, COUNTRY, REGION, UNK} |
Information extraction systems evaluate by the dimensions precision and completeness, both are summarized under the so-called “F-measure”. Since the systems do not require complete text comprehension and a complete grammatical analysis, they achieve high precision and completeness. Information extraction is always designed for a specific need for information and systematically analyzes texts according to predefined data, phrases and text segments.
Topic Detection and Tracking
Topic Detection and Tracking (TDT) refers to automatic techniques for searching thematically related material in data streams. Techniques that can be very valuable in various applications where efficient and timely access to information is essential.
One use of TDT is the free notification service which automatically sends emails to the user, if there are new results on the Internet for the corresponding search term.
In addition to the use of TDT in news sources, there are many more areas that are used in the industry. In science, it can be used to ensure that the latest references and publications are always available in a particular area of research. Likewise, it can be used on the stock market to always provide stock traders with the latest information and news about a company so that they can reconsider their investments or adjust accordingly. Or, a company uses a TDT system to monitor itself, its competitors and the products on the market.
Automatic summary of the text (Summarization)
The term “text summary” generally means the following definition: A summary is a text that produces from one or more texts, that contains a significant portion of the information in the original text (s), and that is no longer than half of the original text (s) . Hence the main task of the abstract is to reproduce the key messages of the text with a reduced number of words.
There are two different approaches to creating summaries: extraction and abstraction. In the case of the sentence extraction method, an assessment of a combination of statistical heuristics assigns each sentence an individual score. The highest rankings are considered the most prominent and are extracted to become part of the summary. The abstraction approach involves the simplification and compression of text. Here is a prerequisite that an understanding of the topics exists and the ability to rewrite the text. Considering these requirements, it becomes clear that abstraction is much more difficult to program than extraction, which makes extraction much more common in automated text summarization.
An essential area of application of automatic text summarization is evident in the field of search engines. These internet search engines allow the user to browse countless web pages for specific content, but presenting the results to the user is a problem. At this point, automatic text summarization systems can play an important role by summarizing the results so that the user can assess the relevance of the hits more quickly. The world’s most popular search engine Google has been using such a technique for simplified representation of the results found for several years.
Since dynamic web applications make individual documents more and more unrecognizable, in today’s IT world the algorithms used also require the ability to process multiple text documents.
A simple example of automatic text summary is old Microsoft Word’s “AutoSummary” feature. The user could choose the percentage of the total text to extract for the summary. (The feature has been removed from Word 2010.) Another example would be when researchers, medical staff, or companies get thousands of documents relevant to them.
The possibility of website summary for smaller devices is been discussed. Websites are designed for large monitors, but they are not very reader-friendly on the smartphones. The automatic summarization should be a conversion into a meaningful, easy to read and above all searchable format. Today, with the proliferation of smartphones, it’s more likely that mobile devices will create stand-alone, more compressed websites.
Categorization
The goal of text categorization is to classify documents into a fixed number of predefined categories. Each document can be assigned in multiple, exactly one, or no category. The documents to be classified can be texts, pictures, music or anything else. Each of these documents has its own classification challenges. Subsequently, the documents can be classified according to their topics or other characteristics (author, year, type, etc.).
To categorize a document, it is considered to be just a collection of words and not the same process as, for example, information extraction. Rather, only the words are counted for categorization, and the numbers identify the main topics of the document. For this purpose, a thesaurus is often used for given topics and the relationships are determined by searching for sub-terms, synonyms and related terms.
As with automatic text summary, Topic Detection and Tracking can be used to further specify the relevance of a document to the information you are looking for. For example, many companies offer customer support or need to answer individual customer questions. However, when they have their documents categorized by a system, the end user is able to get the information they need much more quickly. With today’s search engines, automatic text categorization has become indispensable, as new documents are published much too quickly on the Internet, or old ones are already removed. For this reason, the technique is used in the popular search engines to always provide an up-to-date and closely linked set of results.
Clustering
Clustering is a technique to group similar documents together, yet the groups must be as different as possible. Nonetheless, it differs from the categorization. Because here the documents are processed immediately and not by the specification of given topics. Another advantage of this is that the document can appear in different groups, ensuring that a relevant document is not omitted from the search results. A basic clustering algorithm creates a vector of topics for each document and evaluates how well this document fits in the different clusters.
Clustering algorithms group a set of documents into subsets or clusters. The algorithms’ goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters. If, for example, you were to search for the term “cell”, the search results also contain entries from the categories “biology”, “battery” and “prison”.
In the fifth and part part, we have discussed about Concept Linkage, Information Visualizing, Question-Answer-Systems and draw conclusion.