Data Cleaning in Data Mining YASH PAL, 31 July 2024 Data cleaning means cleaning the inaccurate noise and other discrepancies in the data. Always remember, that the real-world data tends to be noisy, inconsistent, and incomplete Data cleaning can also be termed to be data cleaning. Data cleaning routine attempts to fill in missing values, smooth out noise, and also to correct inconsistencies in the data These are three basic methods by the virtue of which we can clean the data or simply cleanse the data: Missing value Noisy data Inconsistent data Missing value In this, we consider a situation where many tuples have no recorded value for several attributes. Now the point of concern is how can we go about filling in the missing values, for this attribute. Let us consider the following scenarios by the virtue of which we can fill in the missing values: Ignore the tuple: This method is poor when the percentage of missing values per attribute varies considerably. Here, we need to ignore the particular tuple which has missing values. This method is usually adapted when the class label is missing. The method of ignoring the tuple is not very effective unless the tuple contains several attributes with missing values Filling the missing values manually: This approach is not feasible with data sets that have many missing values. This is time-consuming as well. Use a global constant to fill in the missing value: We can use a global constant Be for instance “unknown” or”-as” for all the missing values. Though this method is simple, but it is not foolproof Use the mean attribute to fill in the missing value: The mean value of the arbute is used to replace the missing values for a particular attribute. Use the attribute mean for all the samples belonging to the same class as a given tuple Use the most probable values to fill in the missing value: We can determine the most probable value using Regression. Bayesian formulae, or Decision tree. For instance, we can construct a decision tree to predict the missing values for age and sales data using the other customer attributes like age in any data set. Note: The filled-in value may not be correct. In some cases in a missing value may not simply an error in the data, on the other hand Out of the above six strategies which is the most feasible and most popularly used strategy? The strategy of using a most probable value to fill in the missing value is highly popular and rather feasible. Noisy Data How can we exactly define the terminology Noise Noise is a kind of discrepancy in the data. Noise is an unwanted signal that has been added to the required data. We can remove the noise from the data using the data smoothing methods Nose is a random error or variance in a measured variable. Some of smoothing techniques are Binning Methods Clustering Combined computer & human Inspection a Regression Binning Methods Binning method simply smoothes out the sorted data by referring the values around it. Remember one thing is that these sorted values are placed or distributed into a number of bins or buckets This is a local smoothing techniques since it consults the nearby values. Each value is a bin is replaced by the mean value of the bin. For example: Mean of the values 8. 4 and 15 is Binl is 9. Therefor each original values in this bin is replaced by the value 9. There are three types of binning techniques: Smoothing by bin means: It is explain with an example, consider the mean of the values 4, 8 and 15 is 9 in Bin 1. Therefore, each value in the bin is replaced by the value 9. Smoothing by bin medians: In here cach bin value is replaced by the bin median If in any bin the total number of data is odd in nature, then the median is equal to the (n+1)/2 terms. median for odd = (n+1)/2 else if the total number of data is even in nature then the median is equal to term [(n)/2 term + [(n/2) + 1]] Smoothing by bin boundaries: In here, minimum & maximum values in a given bin are identified as the bin boundaries and further each bin value is then replaced by the closest boundary value. Note: An important point to remember out here is that the larger the width, the greater the effect of the smoothing. Clustering Cluster the name significs a bunch of common or similar kind of data. Always, remember, Outhers may be detected by clustering, where as the required data or the similar values are organised into clusters. The data that fall into a category outside the cluster is termed to be an outlier. Consider the figure drawn below: Here in the figure above: Figure shows three data clusters Each cluster centroid is marked with a “+” representing the average point in space for that cluster. Outliers may be detected as the values falling outside the set clusters. Regression Regression analysis uses the mathematical for smoothing the data. In this method, the data can be smoothed by felting the data to a function, such as with regression. Linear regression involves finding the best line to fit two variables such that one variable can be used to predict the other. Consider the graph drawn below: Computer Human Inspections Human inspection mode: In the human inspection mode, the computer presents the human with a series of PCB images. As each image is displayed, subjects visually search for faults. Once searching is completed, the subjects classify the image based on the severity of the fault(s). Once the image is classified, the inspector can view the next image. The system is designed to operate in two separate modes: with irumediate feedback, as in training and without feedback, as in practice. Computer Inspection Mode: In the purely automated inspection mode, the operation of the computer parallels that of the human system with the exception that there is no feedback provided. The role of the human is supervisory in nature. Hybrid inspection mode: In the hybrid inspection mode, both the search and decision-making functions are designed so that the functions can be performed cooperatively. The alternatives with parallel human and machine activities enable dynamic allocation of search and decision-making functions. Studies conducted using the simulator have illustrated its potential in providing training and also have looked at the interaction between human inspectors and computers specifically the effect of system response bias on inspection quality performance Search strategy training has also been tested using the PCB simulator using eye tracking information of an expert inspector. The search strategy information was provided to the trainees in three modes-static display obtained from eye tracking information. The results showed that the training was effective in improving performance and the adoption of the systematic search strategy. Computer Science Tutorials Data Mining computer sciencedata mining