The abundance of data, coupled with the need for powerful data analysis tools has been described as a data-rich but information-poor situation. Data mining refers to extracting or “mining” knowledge from large amounts of data. The term is a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining.
Thus “data mining” should have been more appropriately named “knowledge mining from data”, which is unfortunately somewhat long knowledge mining has a shorter term and may not reflect the emphasis on mining from large data. Nevertheless mining from data is a vivid term characterizing the process that finds a small set of precious nuggets.
Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery from Database” (KDD). Knowledge discovery is a process and consists of an interactive sequence of the following steps:
- Data Cleaning: To remove noise irrelevant data, or inconsistent data
- Data Integration: Where multiple data sources may be combined
- Data Selection: Where data relevant to the analysis tasks are retrained from the database.
- Data Transformation: Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.
- Data Mining: An essential process where intelligent methods are applied to extract data patterns.
- Pattern Evaluation: To identify the truly interesting patterns representing knowledge based on some interesting measures.
- Knowledge Presentation: Where visualization and knowledge representation techniques are used to present the mined knowledge to the users.
In addition to this, sometimes huge volumes of data can be accumulated beyond databases and data warehouses.
(a) Typical examples include the World Wide Web and data streams, where the data flow in and out like streams, in applications like video conferencing telecommunication and sensor networks.
Now the Question comes why do we require Data warehouses?
The answer to this could be that the effective and efficient analysis of data in such different forms becomes a challenging task.
So what is data Mining?
“Data Mining refers to extracting or mining knowledge from large amounts of data”. Nevertheless, Mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of Raw material.
Note: Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge interaction, data pattern analysis, data archaeology, and data dredging.
Data Mining is a knowledge discovery process. Rather it is becoming more popular than the longer term of knowledge discovery from the data therefore, we use the word data mining.
Based on this view, the architecture of a typical data mining system may have the following components
- Database, Data warehouse, World Wide Web, or Other Information Repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.
- Database or Data Warehouse Server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.
- Knowledge Base: This is domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds. and metadata (eg describing data from multiple heterogeneous sources.
- Data Mining Engine: This is essential to the data mining system and ideally consist of a set of functional modules for tasks such as characterization, association analysis, classification, evaluation, deviation, cluster analysis, and prediction analysis.
- Pattern Evaluation Module: This component typically employs interestingness measures and interacts with the data mining modules, to focus the search towards interesting patterns. It may access interestingness thresholds to filter out the discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process to confine the search to only the interesting patterns.
- Graphical User Interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemes or data structures, evaluate mined patterns, and visualize the patterns in different forms.
By performing data mining, interesting knowledge, regularities or high-level information can be intracted from databases and viewed from different angles.
The discovered knowledge can be applied to decision making, process control, infor- mation-management and query processing. Therefore, data mining is considered as one of the most important frontiers in database and information systems and one of the most promising interdisciplinary developments in the information technology.