Forms of data preprocessing – Data in the warehouse, could be noisy. incomplete, and full of discrepancy. Rather, sometimes it is even difficult to record the correct data and retrieve it because of the noisy data present there. Consider the figure drown below:
Data transformation -2, 32, 100, 59, 48-0.02, 0.32, 1.00, 0.59, 0.48 The figure above depicts the data preprocessing forms. Following are the steps for it:
Forms of Data Preprocessing
- Data cleaning
- Data Integration
- Data Transformation
- Data Reduction
Data cleaning
As the name itself talks about the concept. The term data cleaning deals with cleaning the data from noise, errors, and any other discrepancies. such as missing values. The main goal of data cleaning is not only to locate these discrepancies but also to resolve these inconsistencies. The point of concern out here is that for the mining procedure, dirty data could cause any sort of confusion thereby producing an unreliable output.
Therefore, to conclude we would like to say that to make our output reliable and consistent we need to carry out a useful step and that is to run your data through some data cleaning routines.
Data Integration
For cases, wherein we have to include data from multiple sources in our analysis, we need to involve or integrate multiple databases, data cubes, or files. This is called data integration. The point of concern is that we use a combination of data cleaning & data integration when we are preparing data for a data warehouse and further data cleaning can be used to remove anomalies & redundancies.
Data Transformation
Data transformation operations are additional data preprocessing activities that contribute towards the success of the mining process, Examples of data transformation could be normalization & aggregation.
Now, from the name itself, we can depict that transformation deals with transforming or simply converting one form of data into another. Data transformation can involve the following processes:
- Normalization
- Smoothing
- Aggregation
- Generalisation
Data Reduction
As the name itself implies, Data Reduction deals with the reduction of data or simply reducing the data into a data set. Such that it is much smaller in volume but produces the same output. Several strategies could be adapted to the data reduction.
- Data aggregation.
- Attribute subset selection.
- Dimensionality reduction.
- Numerosity reduction.
Note: Data can also be reduced by generalizing the data using conceptual hierarchies.
An example of data reduction by generalisation could be, where low-level concepts such as a city for customer location are replaced with higher-level concepts such as region or province, or state.
A concept hierarchy is used to organize the concepts into varying levels of abstraction Data preprocessing techniques are used to improve the efficiency and accuracy of the data
Data preprocessing is therefore an important step in the knowledge discovery process. Since quality decisions must be based on the quality of data. Detecting data anomalies rectifying them early and reducing the data to be analyzed can lead to huge pay-offs for decision making.