����������DATA MINING�AND �DATA WAREHOUSING �
B.TECH - VI SEMESTER
R17 Regulation
Dr. Anupriya Koneru
Reference : Jiawei Han, Micheline Kamber,Data Mining Concepts and Techniques, 2/e, 2006 , Elsevier Publisher .
UNIT-II�
�Needs Pre-processing the Data�CO: Identify the need of pre-processing�
Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. e.g., occupation=“ ”.
Noisy data: containing errors or outliers data. e.g., Salary=“-10”
Inconsistent data: containing discrepancies in codes or names. e.g., Age=“42” Birthday=“03/07/1997”
�Needs Pre-processing the Data�CO: Identify the need of pre-processing�
�Needs Pre-processing the Data�CO: Identify the need of pre-processing�
Major Tasks in Data Preprocessing
�Needs Pre-processing the Data�CO: Identify the need of pre-processing�
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
Measure the Central Tendency
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
Measure the Central Tendency
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
Measure the Central Tendency
To analyze data using the mean, median and mode, we need to use the most appropriate measure of central tendency. The following points should be remembered:
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
Measures of Dispersion
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�
(i) Draw a box to represent the middle 50% of the observations of the data set.
(ii) Show the median by drawing a vertical line within the box.
(iii) Draw the lines (called whiskers) from the lower and upper ends of the box to the minimum and maximum values of the data set respectively, as shown in the following diagram.
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �
Histogram
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �
Scatter Plot
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �
Loess curve
Box plot
Quantile plot
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �
Quantile-Quantile plots (Q-Q plot)
�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �
Quantile-Quantile plots (Q-Q plot)
The steps in constructing a QQ plot are as follows:
�Data Cleaning �CO: Apply different methods of Data Cleaning� �
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
Various methods for handling this problem:
Missing Values
Noisy data
�Data Cleaning �CO: Apply different methods of Data Cleaning�Missing Values �� �
Missing Values
The various methods for handling the problem of missing values in data tuples include:
(b) Manually filling in the missing value:
I1 | I2 | I3 | Output |
20 | | | |
25 | 200 | 100 | 1 |
20 | 400 | 80 | 0 |
�Data Cleaning �CO: Apply different methods of Data Cleaning�Missing Values �� �
Missing Values
The various methods for handling the problem of missing values in data tuples include:
(c) Using a global constant to fill in the missing value:
I1 | I2 | I3 | Output |
20 | 400 | NA | 0 |
25 | NA | 100 | 1 |
20 | 400 | 80 | 0 |
�Data Cleaning �CO: Apply different methods of Data Cleaning�Missing Values �� �
Missing Values
The various methods for handling the problem of missing values in data tuples include:
(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal) values, for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
(e) Using the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using Bayesian formalism, or decision tree induction.
I1 | I2 | I3 | Output |
20 | 400 | NA | 0 |
25 | NA | 100 | 1 |
20 | 400 | 80 | 0 |
30 | 200 | 99 | 1 |
I1 | I2 | I3 | Output |
20 | good | NA | 0 |
25 | NA | good | 1 |
20 | bad | bad | 0 |
30 | good | good | 1 |
mean
mean
Mode
Mode
�Data Cleaning �CO: Apply different methods of Data Cleaning�Noisy data�� �
Several Data smoothing techniques:
Binning methods:
In this technique,
�Data Cleaning �CO: Apply different methods of Data Cleaning�Noisy data�� �
Binning methods:
Example: Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values): -
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin means: -
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29 o
Smoothing by bin boundaries: -
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
�Data Cleaning �CO: Apply different methods of Data Cleaning�Noisy data�� �
Clustering:
Regression :
�Data Cleaning �CO: Apply different methods of Data Cleaning�Data Cleaning as a Process�� �
�Data Cleaning �CO: Apply different methods of Data Cleaning�Data Cleaning as a Process�� �
“So, how can we proceed with discrepancy detection?”
�Data Cleaning �CO: Apply different methods of Data Cleaning�Data Cleaning as a Process�� �
“So, how can we proceed with discrepancy detection?”
���Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
1. Schema integration and object matching
How can equivalent real-world entities from multiple data sources be matched up?
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
2. Redundancy is another important issue.
Value should be −1 ≤ rA,B ≤ +1
���Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
2. Redundancy is another important issue.
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
2. Redundancy is another important issue.
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
2. Redundancy is another important issue.
Correlation analysis of categorical attributes using χ 2 .
Suppose that a group of 1,500 people was surveyed. The gender of each person was noted. Each person was polled as to whether their preferred type of reading material was fiction or nonfiction. Thus, we have two attributes, gender and preferred reading. The observed frequency (or count) of each possible joint event is summarized in the contingency table shown in Table, where the numbers in parentheses are the expected frequencies
we can verify the expected frequencies for each cell. For example, the expected frequency for the cell (male, fiction) is
e11 = count(male)×count(fiction) / N = 300×450/ 1500 = 90
χ 2 = (250−90) / 90 + (50−210) /210 + (200−360) /360 + (1000−840) /840
= 284.44+121.90+71.11+30.48 = 507.93.
The χ 2 statistic tests the hypothesis that A and B are independent. The test is based on a significance level, with
(r −1)×(c−1) degrees of freedom.
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
2. Redundancy is another important issue.
Correlation analysis of categorical attributes using χ 2 .
For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of freedom, the χ 2 value needed to reject the hypothesis at the 0.001 significance level is 10.828
Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent
Two attributes are (strongly) correlated for the given group of people.
The χ 2 statistic tests the hypothesis that A and B are independent. The test is based on a significance level, with
(r −1)×(c−1) degrees of freedom.
Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?
Null Hypothesis: Gender and Education level are independent
Alternative Hypothesis: Gender and Education level are correlated.
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��� �
There are a number of issues to consider during data integration:
3. Detection and resolution of data value conflicts.
Careful integration of the data from multiple sources can help reduce and avoid redundancies and inconsistencies in the resulting data set.
This can help improve the accuracy and speed of the subsequent mining process.
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Transformation��� �
In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:
Smoothing - which works to remove noise from the data. Such techniques include binning, regression, and clustering.
Aggregation - where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities.
Generalization of the data - where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior.
Normalization - where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
Attribute construction (Feature construction) - where new attributes are constructed and added from the given set of attributes to help the mining process.
There are many methods for data normalization. We study three:
�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Transformation��� �
Min-max normalization:
Performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute A.
Min-max normalization maps a value v, of A to in the range [new minA,new maxA] by computing
z-score normalization:
The values for an attribute A, are normalized based on the mean and standard deviation of A.
A value v, of A is normalized to by computing
Normalization by decimal scaling
Normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the maximum absolute value of A.
A value v, of A is normalized to by computing
����Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Transformation��� �
Min-max normalization:
Min-max normalization. Suppose that the minimum and maximum values for the attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is transformed to (73,600−12,000 / 98,000−12,000 )(1.0−0) +0 = 0.716.
z-score normalization:
z-score normalization Suppose that the mean and standard deviation of the values for the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed to 73,600−54,000 / 16,000 = 1.225.
Normalization by decimal scaling
Suppose that the recorded values of A range from −986 to 917. The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide each value by 1,000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
����Data Reduction�CO: Apply different data reduction techniques��� �
Data Reduction Strategies:-
Data Discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies.
����Data Reduction�CO: Apply different data reduction techniques��� �
1 Data Cube Aggregation
����Data Reduction�CO: Apply different data reduction techniques��� �
Why Dimensionality Reduction?
����Data Reduction�CO: Apply different data reduction techniques��� �
2 Attribute subset selection /Dimension reduction
Feature selection
����Data Reduction�CO: Apply different data reduction techniques��� �
����Data Reduction�CO: Apply different data reduction techniques��� �
����Data Reduction�CO: Apply different data reduction techniques��� �
3. Data compression-
Wavelet Transforms
Haar Wavelet Transform:
The given function is: [ 9 7 3 5 ]
Output:
����Data Reduction�CO: Apply different data reduction techniques��� �
1. The length, L, of the input data vector must be an integer power of 2. This condition can be met by padding the data vector with zeros as necessary (L ¸ n).
2. Each transform involves applying two functions. The first applies some data smoothing, such as a sum or weighted average. The second performs a weighted difference, which acts to bring out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of measurements (x2i,x2i+1). This results in two sets of data of length L/2. In general, these represent a smoothed or low-frequency version of the input data and the high frequency content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the above iterations are designated the wavelet coefficients of the transformed data.
Equivalently, a matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients, where the matrix used depends on the given DWT.
The matrix must be orthonormal, meaning that the columns are unit vectors and are mutually orthogonal, so that the matrix inverse is just its transpose.
����Data Reduction�CO: Apply different data reduction techniques��� �
Principal Components Analysis
1. The input data are normalized, so that each attribute falls within the same range. This step helps ensure that attributes with large domains will not dominate attributes with smaller domains.
2. PCA computes N orthonormal vectors which provide a basis for the normalized input data. These are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to as the principal components. The input data are a linear combination of the principal components.
����Data Reduction�CO: Apply different data reduction techniques��� �
Principal Components Analysis
3. The principal components are sorted in order of decreasing signicance" or strength. The principal components essentially serve as a new set of axes for the data, providing important information about variance. That is, the sorted axes are such that the rst axis shows the most variance among the data, the second axis shows the next highest variance, and so on. This information helps identify groups or patterns within the data.
4. Since the components are sorted according to decreasing order of signicance", the size of the data can be reduced by eliminating the weaker components, i.e., those with low variance. Using the strongest principal components, it should be possible to reconstruct a good approximation of the original data.
����Data Reduction�CO: Apply different data reduction techniques��� �
4. Numerosity reduction
Regression and log-linear models:
����Data Reduction�CO: Apply different data reduction techniques��� �
����Data Reduction�CO: Apply different data reduction techniques��� �
Histograms
Example : The following data are a list of prices of commonly sold items
at AllElectronics (rounded to the nearest dollar). The numbers have been
sorted.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15,
18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
����Data Reduction�CO: Apply different data reduction techniques��� �
How are the buckets determined and the attribute values partitioned?
There are several partitioning rules, including the following:
1. Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as the width of $10 for the buckets in Figure ).
2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the same number of contiguous data samples).
3. V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal histogram is the one with the least variance. Histogram variance is a weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket.
4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the β -1 largest differences, where β is user-specified.
����Data Reduction�CO: Apply different data reduction techniques��� �
Clustering
����Data Reduction�CO: Apply different data reduction techniques��� �
Sampling:
1. Simple random sample without replacement (SRSWOR) of size n: This is created by drawing n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples are equally likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint \clusters", then a SRS of m clusters can be obtained, where m < M. For example, tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called \strata", a stratified sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample, especially when the data are skewed. For example, a stratified sample may be obtained from customer data, where stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented.
����Data Reduction�CO: Apply different data reduction techniques��� �
����Data Discretization and Concept Hierarchy Generation���� �
���� Data Discretization and Concept Hierarchy Generation �� �
Data Discretization and Concept Hierarchy Generation
���� Data Discretization and Concept Hierarchy Generation �� �
Data Discretization and Concept Hierarchy Generation
Entropy-Based Discretization:
of information gain.
split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
���� Data Discretization and Concept Hierarchy Generation �� �
Data Discretization and Concept Hierarchy Generation
Entropy-Based Discretization:
The basic method for entropy-based discretization of an attribute A within the set is as follows:
1. Each value of A can be considered as a potential interval boundary or split-point (denoted split point) to partition the range of A. That is, a split-point for A can partition the tuples in D into two subsets satisfying the conditions A ≤ split point and A > split point, respectively, thereby creating a binary discretization.
2. Entropy-based discretization, as mentioned above, uses information regarding the class label of tuples. Suppose we want to classify the tuples in D by partitioning on attribute A and some split-point. Ideally, we would like this partitioning to result in an exact classification of the tuples. For example, if we had two classes, we would hope that all of the tuples of, say, class C1 will fall into one partition, and all of the tuples of class C2 will fall into the other partition. However, this is unlikely. For example, the first partition may contain many tuples of C1, but also some of C2. How much more information would we still need for a perfect classification, after this partitioning? This amount is called the expected information requirement for classifying a tuple in D based on partitioning by A. It is given by
���� Data Discretization and Concept Hierarchy Generation �� �
Data Discretization and Concept Hierarchy Generation
Entropy-Based Discretization:
The basic method for entropy-based discretization of an attribute A within the set is as follows:
���� Data Discretization and Concept Hierarchy Generation �� �
Data Discretization and Concept Hierarchy Generation
Entropy-Based Discretization:
3. The process of determining a split-point is recursively applied to each partition obtained, until some stopping criterion is met, such as when the minimum information requirement on all candidate split-points is less than a small threshold, e, or when the number of intervals is greater than a threshold, max interval.
Entropy-based discretization can reduce data size.
����Data Discretization and Concept Hierarchy Generation�� �
Segmentation by natural partitioning
(a) If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, then partition the range into 3 intervals (3 equi-width intervals for 3, 6, 9, and three intervals in the grouping of 2-3-2 for 7);
(b) if it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equi-width intervals; and
(c) if it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equi-width intervals.
����Data Discretization and Concept Hierarchy Generation�� �
Segmentation by natural partitioning
The following example illustrates the use of the 3-4-5 rule for the automatic construction of a numeric hierarchy:
����Data Discretization and Concept Hierarchy Generation- Segmentation by natural partitioning Example of 3-4-5 rule��� �
����Data Discretization and Concept Hierarchy Generation�� �
Segmentation by natural partitioning
����Data Discretization and Concept Hierarchy Generation�� �
Concept hierarchy generation for categorical data
����Data Discretization and Concept Hierarchy Generation�� �
Specification of a set of attributes
Data Mining Primitives
Data Mining Query Language
Data Mining Query Language
Mining classification rules: Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on their buying patterns. You are especially interested in those customers whose salary is no less than $40,000, and who have bought more than $1,000 worth of items, each of which is priced at no less than $100. In particular, you are interested in the customer’s age, income, the types of items purchased, the purchase location, and where the items were made. You would like to view the resulting classification in the form of rules.
Data Mining Query Language
Concepts Description
Data Generalization and Summarization based Characterization
Characterization –Data Cube Approach
Characterization –Attribute Oriented Induction
Characterization –Attribute Oriented Induction
Characterization –Attribute Oriented Induction
Characterization –Attribute Oriented Induction
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Analytical characterization: Analysis of attribute relevance
Mining Class Comparisons
Mining Class Comparisons
Mining Class Comparisons
Mining Class Comparisons
Mining Class Comparisons
Mining Class Comparisons
THANK YOU