Functionalities of Data Mining YASH PAL, 28 January 202428 May 2024 Functionalities of data mining – We have observed different types of databases and information repositories on which data mining can be performed. Let us now examine the kinds of data patterns that can be mined. Data mining tasks Basically, data mining tasks are divided into two categories as shown in the figure above: Descriptive Mining Tasks: They characterize the general properties of the data in the database. Predictive Mining Tasks: They perform inference on the current data to more predictions. Sometimes users may not have any idea what kind of data patterns could be of prime interest, Thus, to solve this issue we need to have a data mining system that can mine multiple kinds of different data patterns to later the needs and requirements of the user. Also, the data mining should have a feature of “searching” for the data patterns. So, if the user needs, to do his/ her job in a very short time he simply puts his pattern hint in the search field and retrieves the desired pattern. Consider the Data Mining Patterns as given below: Data Mining Patterns Let us discuss all the patterns one by one: Concept/Class Descriptions In oops we have studied about classes & objects. Let us give it a thought so that we can further elaborate on the concepts class description. Definition of object: An object is an identifiable entity that has some meaning. Definition of class: A class is a collection of objects (entities). Now, we can say that data can be associated with the class or concepts, for example, let’s say we have a company named Enterprise. In the “Enterprise” store, the class of items for sale includes computers and printers and the concepts of customers include big spenders and budget spenders. It is essential to describe individual classes or concepts in a very standard and precise form. As the name itself signifies, the data concept class description is the elaboration or description of a class or concept. How can we derive the class/concept Description? Class/Concept description Let us discuss this hierarchy: Data characterization Data characterization can be defined as the summarization of general characteristics or features of a target class of data using a database query or SQL Query, the data corresponding to the user-specific class is collected. An example to explain data characterization could be like: Consider a company developing software products every year. Now the question is we need to study the characteristics of the software products whose sales have increased by 10% in the last year. Now how can we retrain that data? Now that’s simple, we will retrieve the historical data by firing a database or SQL Query. We can use several methods to characterize the data and also summarise it by using methods like pie charts, bar charts, curves, data cubes, crosstabs, and so many others. Now we pass on to data discrimination. Data Discrimination It can be defined as the comparison of the general features of target class data objects with a general feature of objects from one or a set of contrasting classes. The target classes can be specified by the user and the data objects. The data here too, is retrieved from the database or SQL Queries. Explain data discrimination. We take a similar example of the “Enterprise” company, as discussed before to explain the concept of data discrimination. In “abe” company, we need to compare the general features whose of the software products whose sales have increased by 10%, in the last year with those whose sales have decreased by 30% during the same duration. We can use several data discrimination methods for this kind of scenario. Mining Frequent Patterns, Associations, and Correlations Frequent patterns as we can examine from its name only are the patterns that occur often or frequently in the data. For example, itemsets, subsequences substructures, and so many others, A frequent item can be termed as a set of items that frequently occurs in a required data set. A substructure can also be defined as the form of structure which according to the requirements, can combine with the itemsets and subsequences. Association Analysis Association Analysis is the discovery of association rules showing attribute value conditions that occur frequently together in a given set of data. It is used for market basket or transaction data analysis Consider an example of such a rule. Let us assume the transaction database of “Enterprise” company, where we had the list of buyers and products buys (x. “computer”) => buys (x. “Software”) Here, let x be a variable representing a customer Always remember that associate rules are always of the form x => y i.e. Am = B where Ai (for UE belongs {1,…..,m}, and Bj (for j belongs {1,….,n}) are attribute-value pairs. What do we interpret by x => y Now the association rule x => y is interpreted as the database tuples that satisfy the conditions in x are also likely to satisfy the conditions in y Note: This is an association between more than one attribute or predicate (ie. age. income, and buys). If we adopt the terminology in a multidimensional database then each attribute is referred to a single dimension. We can term this association rule as a multidimensional association rule. Classification and Prediction Let us recall the general meaning of classification and prediction. Classification: This terminology deals with classifying or elaborating a single term into its parts or dividing it into parts In data mining classification is a process of finding a set of functions (or models) Such that they describe and further distinguish the data classes or concepts. Now the question is why do we need to classify the data? The purpose of the classification of the data model is simply to use this data model and further predict the class of objects whose class label is unknown. Let us elaborate on the concept of classification and prediction stepwise: The derived model is based on the analysis of a set of training data. Now the question that strikes our mind is how are we supposed to represent the derived model? Now the Answer to this Query is that the derived model could be represented in several ways like Decision tree, mathematical formulae, or neural network and classification rules, for example, if-then rules. A Decision tree is a flow chart that resembles a tree structure. Here, each node denotes a test on an attribute value. Furthermore, each branch represents an outcome of the test, and the tree leaves represent classes or class distributions. A Neural Network is a collection of neurons like processing units, such that all the small units have weighted connections among them, or A Neural network is a collection of linear threshold units that can be trained to distinguish objects of different classes. Note: We can easily convert a Decision tree into classification rules Prediction: Prediction is a technique that is used to predict some missing value or unavailable value. it has two types Numeric Prediction Class Label Prediction The point of concern is that prediction could refer to both, be it class label prediction or numeric prediction but it gives prime importance to the numeric prediction. Consider the below figures Decision Tree Neural network Cluster Analysis A cluster is the one as the name suggests, which accommodates a huge number of objects within itself, i.e. It analyzes the data objects without consulting a known class label. Consider the figure drawn below: Clustering The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity Clusters are formed so that objects of one similar type are accommodated in one cluster while the objects of slightly different types are accommodated into another cluster. Outlier Analysis Most of the data mining methods discard the outlier Outliers can be defined as Noise or Exceptions. Outlier mining is the technique of analyzing the outlier data. They may be detected using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are at substantial distance from any other cluster are considered outliers. Evolution and Deviation Analysis Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association, classification, or clustering of time-related data, distinct features of such an analysis include time series data analysis, sequence or priority pattern matching, and similarity-based data analysis. In the analysis of time-related data, it is often desirable not only to model the general evolutionary trend of the data but also to identify thereby adapting the deviation that Occurs over time Deviations are differences between measured values and corresponding references such as previous values are normative values. A data mining system performing deviation analysis, upon the defection of a set of deviations, may do the following describe the characteristics of the derivation try to explain the reason behind them, and suggest actions to bring the deviated values back to their excepted values. Computer Science Tutorials Data Mining computer sciencedata mining