The term data mining is a bit misleading, because it is about gaining knowledge from existing data and not to the generation of data itself. What is Data Mining? Data mining is the systematic application of statistical methods to large databases with the aim of identifying new patterns and trends. The mere capture, storage, and processing of large amounts of data is sometimes referred as buzzword data mining. In the scientific context, it primarily refers to the extraction of knowledge that is valid in statistical sense, previously unknown and potentially useful for the determination of certain regularities and hidden relationships.
What Is Data Mining? Examples of Data Mining Software
Many of the methods used in data mining actually come from statistics, especially multivariate statistics, and are often adapted only in their complexity for use in data mining, often approximated to the detriment of accuracy. The loss of accuracy is often associated with a loss of statistical validity, so that from a purely statistical point of view, the procedures can sometimes even be wrong. However, for use in data mining, often the experimentally verified utility and acceptable runtime are more crucial than statistically proven correctness.
Machine learning is also closely related, but data mining focuses on finding new patterns, while in machine learning, primarily known patterns are automatically recognized by the computer in new data. However, a simple separation is not always possible here: for example, if association rules are extracted from the data, this is a process that corresponds to typical data mining tasks; The extracted rules also fulfill the goals of machine learning. Conversely, the subset of unsupervised learning from machine learning is very closely related to data mining. Machine learning techniques are often used in data mining and vice versa.
---
Research into database systems, and in particular index structures, plays an important role in data mining when it comes to reducing complexity. Typical tasks, such as nearest neighbor search, can be significantly speeded up using a suitable database index and the runtime of a data mining algorithm can be improved.
Information retrieval (IR) is another field of expertise that benefits from the findings of data mining. Which is the computer-aided search for complex content, but also the presentation for the user. Data mining techniques, such as clustering, are used to improve search results and their presentation to the user, for example, by grouping similar search results. Text mining and web mining are two specializations of data mining that are closely related to information retrieval.
Collecting information in a systematic way is an important prerequisite for obtaining valid results using data mining. If the data was collected statistically unclean, there may be a systematic error in the data, which is subsequently found in the data mining step. The result may not be a consequence of the observed objects, but may be due to the way in which the data was collected.
Data mining involves six common classes of tasks:
- Anomaly detection (outlier/change/deviation detection)
- Association rule learning (dependency modelling)
- Clustering
- Classification
- Regression
- Summarization
The knowledge discovery in databases (KDD) process is commonly defined with the stages:
- Selection
- Pre-processing
- Transformation
- Data mining
- Interpretation/evaluation
While most data mining methods try to deal with the most general data possible, there are also specializations for more specific types of data.
- Textmining
- Webmining
- Time series analysis
Use Cases of Data Mining
In addition to applications in the related fields of computer science, data mining is also increasingly used in industry:
- Decision support system
- In the financial sector:
Invoice verification for fraud detection
Credit scoring to determine default probabilities can be seen as a classic example of data mining - In Marketing:
Market segmentation , for example, customers in terms of similar buying behavior or interests for targeted advertising
Shopping cart analysis for price optimization and product placement in the supermarket
Audience selection for advertising campaigns
Customer Profile -Creating for Customer Relationship Management in Customer Relationship Management Systems
Business Intelligence
On the Internet: - attack detection
Recommendation services for products such as movies and music
Network analysis in social networks
Web usage mining to analyze user behavior
Text mining for the analysis of large text stocks - Pharmacovigilance (post market surveillance for unknown adverse events)
- Medicine
- Nursing
- Bibliometrics
- Exploratory data analysis
- Process analysis and optimization:
With the help of data mining, technical processes can be analyzed and the relationships between the individual process variables can be determined. This helps to control and optimize processes. The first successful approaches have already been achieved in the chemical industry and plastics processing.
Examples of Data Mining Software
The below applications are available under F/OSS licenses :
Carrot2: Text and search results clustering framework.
ELKI: A university research project with advanced cluster analysis and outlier detection methods written in the Java language.
GATE: a natural language processing and language engineering tool.
KNIME: The Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
Massive Online Analysis (MOA): a real-time big data stream mining with concept drift tool in the Java programming language.
MEPX – cross platform tool for regression and classification problems based on a Genetic Programming variant.
ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results.
MLPACK library: a collection of ready-to-use machine learning algorithms written in the C++ language.
NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language.
OpenNN: Open neural networks library.
Orange: A component-based data mining and machine learning software suite written in the Python language.
R: A programming language and software environment for statistical computing, data mining, and graphics. It is part of the GNU Project.
scikit-learn is an open source machine learning library for the Python programming language
Torch: An open source deep learning library for the Lua programming language and scientific computing framework with wide support for machine learning algorithms.
UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM.
Weka: A suite of machine learning software applications written in the Java programming language.