3 of 104

�Needs Pre-processing the Data�CO: Identify the need of pre-processing�

Data in the real world is dirty.
It can be in incomplete, noisy and inconsistent from.
These data needs to be preprocessed in order to help improve the quality of the data, and quality of the mining results.
If no quality data , then no quality mining results.
The quality decision is always based on the quality data.
If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult

Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. e.g., occupation=“ ”.

Noisy data: containing errors or outliers data. e.g., Salary=“-10”

Inconsistent data: containing discrepancies in codes or names. e.g., Age=“42” Birthday=“03/07/1997”

4 of 104

�Needs Pre-processing the Data�CO: Identify the need of pre-processing�

Incomplete data may come from

“Not applicable” data value when collected
Different considerations between the time when the data was collected and when it is analyzed.
Human/hardware/software problems

Noisy data (incorrect values) may come from

Faulty data collection by instruments
Human or computer error at data entry
Errors in data transmission

Inconsistent data may come from

Different data sources
Functional dependency violation (e.g., modify some linked data)

5 of 104

�Needs Pre-processing the Data�CO: Identify the need of pre-processing�

Major Tasks in Data Preprocessing

Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration

Integration of multiple databases, data cubes, or files

Data transformation

Normalization and aggregation

Data reduction

Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization

Part of data reduction but with particular importance, especially for numerical data

6 of 104

�Needs Pre-processing the Data�CO: Identify the need of pre-processing�

7 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Measure the Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
As such, measures of central tendency are sometimes called measures of central location.
In other words, in many real-life situations, it is helpful to describe data by a single number that is most representative of the entire collection of numbers.
Such a number is called a measure of central tendency.
The most commonly used measures are as follows. Mean, Median, and Mode

Mean: Mean, or average, of numbers is the sum of the numbers divided by n. That is:

8 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Measure the Central Tendency

Midrange - The midrange of a data set is the average of the minimum and maximum values.
Median: Median of numbers is the middle number when the numbers are written in order. If is even, the median is the average of the two middle numbers.
Trimmed mean - A trimming mean eliminates the extreme observations by removing observations from each end of the ordered sample. It is calculated by discarding a certain percentage of the lowest and the highest scores and then computing the mean of the remaining scores.
Mode of numbers is the number that occurs most frequently. If two numbers tie for most frequent occurrence, the collection has two modes and is called bimodal.

It is possible for a set of data values to have more than one mode.
If there are two data values that occur most frequently, we say that the set of data values is bimodal.
If there is three data values that occur most frequently, we say that the set of data values is trimodal
If two or more data values that occur most frequently, we say that the set of data values is multimodal
If there is no data value or data values that occur most frequently, we say that the set of data values has no mode

9 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Measure the Central Tendency

To analyze data using the mean, median and mode, we need to use the most appropriate measure of central tendency. The following points should be remembered:

The mean is useful for predicting future results when there are no extreme values in the data set. However, the impact of extreme values on the mean may be important and should be considered. E.g. the impact of a stock market crash on average investment returns.
The median may be more useful than the mean when there are extreme values in the data set as it is not affected by the extreme values.
The mode is useful when the most common item, characteristic or value of a data set is required.

10 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Measures of Dispersion

Measures of dispersion measure how spread out a set of data is.
The two most commonly used measures of dispersion are the variance and the standard deviation. Rather than showing how data are similar, they show how data differs from its variation, spread, or dispersion.
Other measures of dispersion that may be encountered include the Quartiles, Inter quartile range (IQR), Five number summary, range and box plots

Variance and Standard Deviation

Very different sets of numbers can have the same mean.
You will now study two measures of dispersion, which give you an idea of how much the numbers in a set differ from the mean of the set.
These two measures are called the variance of the set and the standard deviation of the set

11 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Variance and Standard Deviation

It’s a measure of how much typical number in the set differes from the mean
The greater in the standard deviation the more the number it vary

Percentile

Percentiles are values that divide a sample of data into one hundred groups containing (as far as possible) equal numbers of observations.
The pth percentile of a distribution is the value such that p percent of the observations fall at or below it.

Quartiles

Quartiles are numbers that divide an ordered data set into four portions, each containing approximately one-fourth of the data.
Twenty-five percent of the data values come before the first quartile (Q1).
The median is the second quartile (Q2); 50% of the data values come before the median.
Seventy-five percent of the data values come before the third quartile (Q3).
Q1=25th percentile=(n*25/100), where n is total number of data in the given data set
Q2=median=50th percentile=(n*50/100)
Q3=75th percentile=(n*75/100)

12 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Quartiles

Inter quartile range (IQR) The inter quartile range is the length of the interval between the lower quartile (Q1) and the upper quartile (Q3).
This interval indicates the central, or middle, 50% of a data set.
I Q R = Q 3 - Q 1

Range

The range of a set of data is the difference between its largest (maximum) and smallest (minimum) values.
In the statistical world, the range is reported as a single number, the difference between maximum and minimum.
Sometimes, the range is often reported as “from (the minimum) to (the maximum),” i.e., two numbers.

Five-Number Summary

The Five-Number Summary of a data set is a five-item list comprising the minimum value, first quartile, median, third quartile, and maximum value of the set.
{MIN, Q1, MEDIAN (Q2), Q3, MAX}

13 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data�

Box plots

A box plot is a graph used to represent the range, median, quartiles and inter quartile range of a set of data values.
Constructing a Box plot: To construct a box plot:

(i) Draw a box to represent the middle 50% of the observations of the data set.

(ii) Show the median by drawing a vertical line within the box.

(iii) Draw the lines (called whiskers) from the lower and upper ends of the box to the minimum and maximum values of the data set respectively, as shown in the following diagram.

Outliers Outlier data is a data that falls outside the range. Outliers will be any points below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.

14 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �

Histogram

A histogram is a way of summarizing data that are measured on an interval scale (either discrete or continuous).
It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form.
It divides up the range of possible values in a data set into classes or groups.
For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group.
This means that the rectangles might be drawn of non-uniform height.
The histogram is only appropriate for variables whose values are numerical and measured on an interval scale.
It is generally used when dealing with large data sets (>100 observations) A histogram can also help detect any unusual observations (outliers), or any gaps in the data set.

15 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �

Scatter Plot

A scatter plot is a useful summary of a set of bivariate data (two variables), usually drawn before working out a linear correlation coefficient or fitting a regression line.
It gives a good visual picture of the relationship between the two variables, and aids the interpretation of the correlation coefficient or regression model.
Each unit contributes one point to the scatter plot, on which points are plotted but not joined.
The resulting pattern indicates the type and strength of the relationship between the two variables.
A scatter plot will also show up a non-linear relationship between the two variables and whether or not there exist any outliers in the data.

16 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �

Loess curve

It is another important exploratory graphic aid that adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence.
The word loess is short for “local regression.”

Box plot

The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.

Quantile plot

Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

17 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �

Quantile-Quantile plots (Q-Q plot)

Quantile-quantile plots allow us to compare the quantiles of two sets of numbers.
This kind of comparison is much more detailed than a simple comparison of means or medians.
A normal distribution is often a reasonable model for the data.
Without inspecting the data, however, it is risky to assume a normal distribution.
There are a number of graphs that can be used to check the deviations of the data from the normal distribution.
The most useful tool for assessing normality is a quantile or QQ plot.
This is a scatter plot with the quantiles of the scores on the horizontal axis and the expected normal scores on the vertical axis.

18 of 104

�Descriptive Data Summarization�CO: Understand the Statistical Analysis representation of Data� Graphic Displays of Basic Descriptive Data Summaries �

Quantile-Quantile plots (Q-Q plot)

The steps in constructing a QQ plot are as follows:

First, we sort the data from smallest to largest.
A plot of these scores against the expected normal scores should reveal a straight line.
The expected normal scores are calculated by taking the z-scores of (I - ½)/n where I is the rank in increasing order.
Curvature of the points indicates departures of normality.
This plot is also useful for detecting outliers.
The outliers appear as points that are far away from the overall pattern op points

19 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning� �

Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.

Various methods for handling this problem:

Missing Values

Noisy data

20 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Missing Values ��

Missing Values

The various methods for handling the problem of missing values in data tuples include:

Ignoring the tuple:

This is usually done when the class label is missing (assuming the mining task involves classification or description).
This method is not very effective unless the tuple contains several attributes with missing values.
It is especially poor when the percentage of missing values per attribute varies considerably

(b) Manually filling in the missing value:

In general, this approach is time-consuming and may not be a reasonable task for large data sets with many missing values, especially when the value to be filled in is not easily determined.

I1	I2	I3	Output
20
25	200	100	1
20	400	80	0

21 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Missing Values ��

Missing Values

The various methods for handling the problem of missing values in data tuples include:

Replace all missing attribute values by the same constant, such as a label like “Unknown,” or −∞.
If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common — that of “Unknown.”
Hence, although this method is simple, it is not recommended.

I1	I2	I3	Output
20	400	NA	0
25	NA	100	1
20	400	80	0

22 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Missing Values ��

Missing Values

The various methods for handling the problem of missing values in data tuples include:

(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal) values, for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.

(e) Using the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using Bayesian formalism, or decision tree induction.

For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.

I1	I2	I3	Output
20	400	NA	0
25	NA	100	1
20	400	80	0
30	200	99	1

I1	I2	I3	Output
20	good	NA	0
25	NA	good	1
20	bad	bad	0
30	good	good	1

mean

Mode

23 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Noisy data��

Noise is a random error or variance in a measured variable.
Data smoothing tech is used for removing such noisy data.

Several Data smoothing techniques:

Binning methods:

Binning methods smooth a sorted data value by consulting the neighborhood", or values around it.
The sorted values are distributed into a number of 'buckets', or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing

In this technique,

The data for first sorted
Then the sorted list partitioned into equi-depth of bins.
Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Smoothing by bin means: Each value in the bin is replaced by the mean value of the bin.
Smoothing by bin medians: Each value in the bin is replaced by the bin median.
Smoothing by boundaries: The min and max values of a bin are identified as the bin boundaries. Each bin value is replaced by the closest boundary value.

24 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Noisy data��

Binning methods:

Example: Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values): -

Bin 1: 4, 8, 9, 15

Bin 2: 21, 21, 24, 25

Bin 3: 26, 28, 29, 34

Smoothing by bin means: -

Bin 1: 9, 9, 9, 9

Bin 2: 23, 23, 23, 23

Bin 3: 29, 29, 29, 29 o

Smoothing by bin boundaries: -

Bin 1: 4, 4, 4, 15

Bin 2: 21, 21, 25, 25

Bin 3: 26, 26, 26, 34

25 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Noisy data��

Clustering:

Outliers in the data may be detected by clustering, where similar values are organized into groups, or ‘clusters’.
Values that fall outside of the set of clusters may be considered outliers.

Regression :

smooth by fitting the data into regression functions.
Linear regression involves finding the best of line to fit two variables, so that one variable can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to a multidimensional surface.
Using regression to find a mathematical equation to fit the data helps smooth out the noise.

26 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Data Cleaning as a Process��

Data cleaning is a big job.
What about data cleaning as a process?
How exactly does one proceed in tackling this task?
Are there any tools out there to help?
The first step in data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors:

Poorly designed data entry forms that have many optional fields
Human error in data entry
Deliberate errors (e.g., respondents not wanting to divulge information about themselves)
Data decay (e.g., outdated addresses).
Discrepancies may also arise from inconsistent data representations and the inconsistent use of codes.
Errors in instrumentation devices that record data and system errors, are another source of discrepancies.
Errors can also occur when the data are (inadequately) used for purposes other than originally intended.
There may also be inconsistencies due to data integration (e.g., where a given attribute can have different names in different databases)

27 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Data Cleaning as a Process��

“So, how can we proceed with discrepancy detection?”

Data discrepancy detection

Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading - is a kind of source of errors that typically occurs when developers compress new attribute definitions into unused portions of already defined attributes.
Check uniqueness rule, consecutive rule and null rule

Unique rule is a rule says that each value of the given attribute must be different from all other values of that attribute
Consecutive rule is a rule says that there can be no missing values between the lowest and highest values of the attribute and that all values must also be unique.
Null rule specifies the use of blanks, question marks, special characters or other strings that may indicate the null condition and how such values should be handled.

Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

28 of 104

�Data Cleaning �CO: Apply different methods of Data Cleaning�Data Cleaning as a Process��

“So, how can we proceed with discrepancy detection?”

Use commercial tools

Some data inconsistencies may be corrected manually using external references.
For example, errors made at data entry may be corrected by performing a paper trace.
Most errors, however, will require data transformations.
This is the second step in data cleaning as a process.
That is, once we find discrepancies, we typically need to define and apply (a series of) transformations to correct them.
Commercial tools can assist in the data transformation step.
Data migration tools allow simple transformations to be specified, such as to replace the string “gender” by “sex”.
ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface

Integration of the two processes

Iterative and interactive

29 of 104

��Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

Data mining often requires data integration—the merging of data from multiple data stores.
The data may also need to be transformed into forms appropriate for mining.
Data integration, which combines data from multiple sources into a coherent data store, as in data warehousing.

There are a number of issues to consider during data integration:

1. Schema integration and object matching

How can equivalent real-world entities from multiple data sources be matched up?

This is referred to as the entity identification problem.
For example, how can the data analyst or the computer be sure that customer id in one database and cust-number in another refer to the same attribute?
Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values used to help avoid errors in schema integration.

30 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

There are a number of issues to consider during data integration:

2. Redundancy is another important issue.

An attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.
Some redundancies can be detected by Correlation Analysis.
Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data.
For numerical attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient
Also known as Pearson’s product moment coefficient, named after its inventer, Karl Pearson

N - Number of tuples
ai and bi - Values of A and B in tuple i
A and B - Mean values of A and B
σA and σB - Standard deviations of A and B
Σ(aibi) is the sum of the AB cross-product

Value should be −1 ≤ rA,B ≤ +1

31 of 104

��Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

There are a number of issues to consider during data integration:

2. Redundancy is another important issue.

If rA,B is greater than 0, then A and B are positively correlated, meaning that the values of A increase as the values of B increase. The higher the value, the stronger the correlation
A higher value may indicate that A (or B) may be removed as a redundancy.
If the resulting value is equal to 0, then A and B are independent and there is no correlation between them.
If the resulting value is less than 0, then A and B are negatively correlated, where the values of one attribute increase as the values of the other attribute decrease.
This means that each attribute discourages the other.
Scatter plots can also be used to view correlations between attributes

32 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

There are a number of issues to consider during data integration:

2. Redundancy is another important issue.

For categorical (discrete) data, a correlation relationship between two attributes, A and B, can be discovered by a χ 2 (chi-square) test.
Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values, namely b1,b2,...br .
The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows.
Let (Ai ,Bj) denote the event that attribute A takes on value ai and attribute B takes on value bj , that is, where (A = ai ,B = bj).
Each and every possible (Ai ,Bj) joint event has its own cell (or slot) in the table.
The χ 2 value (also known as the Pearson χ 2 statistic) is computed as

33 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

There are a number of issues to consider during data integration:

2. Redundancy is another important issue.

Correlation analysis of categorical attributes using χ 2 .

Suppose that a group of 1,500 people was surveyed. The gender of each person was noted. Each person was polled as to whether their preferred type of reading material was fiction or nonfiction. Thus, we have two attributes, gender and preferred reading. The observed frequency (or count) of each possible joint event is summarized in the contingency table shown in Table, where the numbers in parentheses are the expected frequencies

we can verify the expected frequencies for each cell. For example, the expected frequency for the cell (male, fiction) is

e11 = count(male)×count(fiction) / N = 300×450/ 1500 = 90

χ 2 = (250−90) / 90 + (50−210) /210 + (200−360) /360 + (1000−840) /840

= 284.44+121.90+71.11+30.48 = 507.93.

The χ 2 statistic tests the hypothesis that A and B are independent. The test is based on a significance level, with

(r −1)×(c−1) degrees of freedom.

34 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

There are a number of issues to consider during data integration:

2. Redundancy is another important issue.

Correlation analysis of categorical attributes using χ 2 .

For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of freedom, the χ 2 value needed to reject the hypothesis at the 0.001 significance level is 10.828

Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent

Two attributes are (strongly) correlated for the given group of people.

The χ 2 statistic tests the hypothesis that A and B are independent. The test is based on a significance level, with

(r −1)×(c−1) degrees of freedom.

35 of 104

Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:

Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

Null Hypothesis: Gender and Education level are independent

Alternative Hypothesis: Gender and Education level are correlated.

37 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Integration��

There are a number of issues to consider during data integration:

3. Detection and resolution of data value conflicts.

For the same real-world entity, attribute values from different sources may differ.
This may be due to differences in representation, scaling, or encoding.
An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in another.
For example, the total sales in one database may refer to one branch of All Electronics, while an attribute of the same name in another database may refer to the total sales for All Electronics stores in a given region.
When matching attributes from one database to another during integration, special attention must be paid to the structure of the data.

Careful integration of the data from multiple sources can help reduce and avoid redundancies and inconsistencies in the resulting data set.

This can help improve the accuracy and speed of the subsequent mining process.

38 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Transformation��

In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:

Smoothing - which works to remove noise from the data. Such techniques include binning, regression, and clustering.

Aggregation - where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities.

Generalization of the data - where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior.

Normalization - where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.

Attribute construction (Feature construction) - where new attributes are constructed and added from the given set of attributes to help the mining process.

There are many methods for data normalization. We study three:

min-max normalization
z-score normalization
normalization by decimal scaling

39 of 104

�Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Transformation��

Min-max normalization:

Performs a linear transformation on the original data.

Suppose that minA and maxA are the minimum and maximum values of an attribute A.

Min-max normalization maps a value v, of A to in the range [new minA,new maxA] by computing

z-score normalization:

The values for an attribute A, are normalized based on the mean and standard deviation of A.

A value v, of A is normalized to by computing

Normalization by decimal scaling

Normalizes by moving the decimal point of values of attribute A.

The number of decimal points moved depends on the maximum absolute value of A.

A value v, of A is normalized to by computing

40 of 104

��Data Integration and Transformation�CO: Apply different data integration and transformation techniques�Data Transformation��

Min-max normalization:

Min-max normalization. Suppose that the minimum and maximum values for the attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is transformed to (73,600−12,000 / 98,000−12,000 )(1.0−0) +0 = 0.716.

z-score normalization:

z-score normalization Suppose that the mean and standard deviation of the values for the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed to 73,600−54,000 / 16,000 = 1.225.

Normalization by decimal scaling

Suppose that the recorded values of A range from −986 to 917. The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide each value by 1,000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.

41 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Data Reduction Strategies:-

Data cube aggregation - where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection /Dimension reduction - where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed.
Data compression- where encoding mechanisms are used to reduce the data set size. Examples: Wavelet Transforms Principal Components Analysis
Numerosity reduction - where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation - where raw data values for attributes are replaced by ranges or higher conceptual levels.

Data Discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies.

42 of 104

��Data Reduction�CO: Apply different data reduction techniques��

1 Data Cube Aggregation

Imagine that you have collected the data for your analysis.
These data consist of sales per quarter, for the years 2002 to 2004.
You are, however, interested in the annual sales (total per year), rather than the total per quarter.
Thus the data can be summed up so that the resulting data summarise the total sales per year instead of per quarter.
The resulting data set is smaller in volume, without loss of information necessary for the analysis task.

43 of 104

��Data Reduction�CO: Apply different data reduction techniques��

A database or date warehouse may store terabytes of data.
So it may take very long to perform data analysis and mining on such huge amounts of data.
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume but still contain critical information.

Why Dimensionality Reduction?

Most machine learning and data mining techniques may not be effective for highdimensional data �
Curse of Dimensionality � - Dimensionality becomes a serious obstacle for the efficiency of most of the DM algorithms. It has been estímate that as the number of dimensions increase, the sample size needs to increase exponentially in order to have an effective estimate of multivariate densities.
Query accuracy and efficiency degrade rapidly as the dimension increases. �
The intrinsic dimension may be small. �
For example, the number of genes responsible for a certain type of disease may be small.
Visualization: projection of high-dimensional data onto 2D or 3D. �
Data compression: efficient storage and retrieval. �
Noise removal: positive effect on query accuracy.

44 of 104

��Data Reduction�CO: Apply different data reduction techniques��

2 Attribute subset selection /Dimension reduction

It reduces the data set size by removing irrelevant attributes.
This is a method of attribute subset selection are applied.
A heuristic method of attribute of sub set selection is explained here:

Feature selection

Feature selection is a must for any data mining product.
That is because, when you build a data mining model, the dataset frequently contains more information than is needed to build the model.
For example, a dataset may contain 500 columns that describe characteristics of customers, but perhaps only 50 of those columns are used to build a particular model.
If you keep the unneeded columns while building the model, more CPU and memory are required during the training process, and more storage space is required for the completed model.
In which select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

45 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Basic heuristic methods of attribute subset selection include the following techniques, some of which are illustrated below:
Step-wise forward selection: The procedure starts with an empty set of attributes. The best of the original attributes is determined and added to the set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
Combination forward selection and backward elimination: The step-wise forward selection and backward elimination methods can be combined, where at each step one selects the best attribute and removes the worst from among the remaining attributes.
Decision tree induction: Decision tree induction constructs a flow-chart-like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses the “best" attribute to partition the data into individual classes. When decision tree induction is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree form the reduced subset of attributes.

46 of 104

��Data Reduction�CO: Apply different data reduction techniques��

47 of 104

��Data Reduction�CO: Apply different data reduction techniques��

3. Data compression-

In Data compression, data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data.
If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless.
If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy.
Two popular and effective methods of lossy :

wavelet transforms
Principal Components Analysis.

Wavelet Transforms

The discrete wavelet transform(DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X’, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we consider each tuple as an n-dimensional data vector, that is, X = (x1,x2, . . . ,xn), depicting n measurements made on the tuple from n database attributes

48 of 104

Haar Wavelet Transform:

The given function is: [ 9 7 3 5 ]

Output:

49 of 104

��Data Reduction�CO: Apply different data reduction techniques��

The general procedure for applying a discrete wavelet transform uses a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The method is as follows:

1. The length, L, of the input data vector must be an integer power of 2. This condition can be met by padding the data vector with zeros as necessary (L ¸ n).

2. Each transform involves applying two functions. The first applies some data smoothing, such as a sum or weighted average. The second performs a weighted difference, which acts to bring out the detailed features of the data.

3. The two functions are applied to pairs of data points in X, that is, to all pairs of measurements (x2i,x2i+1). This results in two sets of data of length L/2. In general, these represent a smoothed or low-frequency version of the input data and the high frequency content of it, respectively.

4. The two functions are recursively applied to the sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2.

5. Selected values from the data sets obtained in the above iterations are designated the wavelet coefficients of the transformed data.

Equivalently, a matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients, where the matrix used depends on the given DWT.

The matrix must be orthonormal, meaning that the columns are unit vectors and are mutually orthogonal, so that the matrix inverse is just its transpose.

fast DWT” algorithm has a complexity of O(n) for an input vector of length n.

50 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Principal Components Analysis

Suppose that the data to be compressed consists of N tuples or data vectors, from k-dimensions.
Principal components analysis (PCA) searches for c k-dimensional orthogonal vectors that can best be used to represent the data, where c << N.
The original data is thus projected onto a much smaller space, resulting in data compression.
PCA can be used as a form of dimensionality reduction. However, unlike attribute subset selection, which reduces the attribute set size by retaining a subset of the initial set of attributes, PCA combines" the essence of attributes by creating an alternative, smaller set of variables.
The initial data can then be projected onto this smaller set.
The basic procedure is as follows:

1. The input data are normalized, so that each attribute falls within the same range. This step helps ensure that attributes with large domains will not dominate attributes with smaller domains.

2. PCA computes N orthonormal vectors which provide a basis for the normalized input data. These are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to as the principal components. The input data are a linear combination of the principal components.

51 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Principal Components Analysis

3. The principal components are sorted in order of decreasing signicance" or strength. The principal components essentially serve as a new set of axes for the data, providing important information about variance. That is, the sorted axes are such that the rst axis shows the most variance among the data, the second axis shows the next highest variance, and so on. This information helps identify groups or patterns within the data.

4. Since the components are sorted according to decreasing order of signicance", the size of the data can be reduced by eliminating the weaker components, i.e., those with low variance. Using the strongest principal components, it should be possible to reconstruct a good approximation of the original data.

PCA can be applied to ordered and unordered attributes, and can handle sparse data and skewed data.
Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions.
For example, a 3-D data cube for sales with the dimensions item type, branch, and year must first be reduced to a 2-D cube, such as with the dimensions item type, and branch * year.

52 of 104

��Data Reduction�CO: Apply different data reduction techniques��

4. Numerosity reduction

Can we reduce the data volume by choosing alternative, `smaller' forms of data representation?" Techniques of numerosity reduction can indeed be applied for this purpose.
These techniques may be parametric or non-parametric.
For parametric methods, a model is used to estimate the data, so that typically only the data parameters need be stored, instead of the actual data. (Outliers may also be stored). Log-linear models, which estimate discrete multidimensional probability distributions, are an example.
Non-parametric methods for storing reduced representations of the data include histograms, clustering, and sampling.

Regression and log-linear models:

Regression and log-linear models can be used to approximate the given data.
In linear regression, the data are modelled to t a straight line.
For example, a random variable, Y (called a response variable), can be modelled as a linear function of another random variable, X (called a predictor variable), with the equation

53 of 104

��Data Reduction�CO: Apply different data reduction techniques��

where the variance of Y is assumed to be constant. The coefficients and (called regression coefficients ) specify the Y -intercept and slope of the line, respectively.
These coefficients can be solved for by the method of least squares, which minimizes the error between the actual line separating the data and the estimate of the line.
Multiple regression is an extension of linear regression allowing a response variable Y to be modeled as a linear function of a multidimensional feature vector.
Log-linear models approximate discrete multidimensional probability distributions.
The method can be used to estimate the probability of each cell in a base cuboid for a set of discretized attributes, based on the smaller cuboids making up the data cube lattice.
This allows higher order data cubes to be constructed from lower order ones.
Log-linear models are therefore also useful for data compression (since the smaller order cuboids together typically occupy less space than the base cuboid) and data smoothing (since cell estimates in the smaller order cuboids are less subject to sampling variations than cell estimates in the base cuboid).

54 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Histograms

Histograms use binning to approximate data distributions and are a popular form of data reduction.
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets.
The buckets are displayed on a horizontal axis, while the height (and area) of a bucket typically reflects the average frequency of the values represented by the bucket.
If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets. Often, buckets instead represent continuous ranges for the given attribute.

Example : The following data are a list of prices of commonly sold items

at AllElectronics (rounded to the nearest dollar). The numbers have been

sorted.

1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15,

18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,

25, 25, 25, 25, 28, 28, 30, 30, 30.

55 of 104

��Data Reduction�CO: Apply different data reduction techniques��

How are the buckets determined and the attribute values partitioned?

There are several partitioning rules, including the following:

1. Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as the width of $10 for the buckets in Figure ).

2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the same number of contiguous data samples).

3. V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal histogram is the one with the least variance. Histogram variance is a weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket.

4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the β -1 largest differences, where β is user-specified.

V-Optimal and MaxDiff histograms tend to be the most accurate and practical.
Histograms are highly effective at approximating both sparse and dense data, as well as highly skewed, and uniform data.
The histograms described above for single attributes can be extended for multiple attributes. Multidimensional histograms can capture dependencies between attributes.

56 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Clustering

Clustering techniques consider data tuples as objects.
They partition the objects into groups or clusters, so that objects within a cluster are \similar" to one another and \dissimilar" to objects in other clusters.
Similarity is commonly defined in terms of how \close" the objects are in space, based on a distance function.
The \quality" of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality, and is defined as the average distance of each cluster object from the cluster centroid (denoting the \average object", or average point in space for the cluster).

57 of 104

��Data Reduction�CO: Apply different data reduction techniques��

Sampling:

Sampling can be used as a data reduction technique since it allows a large data set to be represented by a much smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples. Let's have a look at some possible samples for D.

1. Simple random sample without replacement (SRSWOR) of size n: This is created by drawing n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples are equally likely.

2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again

3. Cluster sample: If the tuples in D are grouped into M mutually disjoint \clusters", then a SRS of m clusters can be obtained, where m < M. For example, tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.

4. Stratified sample: If D is divided into mutually disjoint parts called \strata", a stratified sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample, especially when the data are skewed. For example, a stratified sample may be obtained from customer data, where stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented.

58 of 104

��Data Reduction�CO: Apply different data reduction techniques��

59 of 104

��Data Discretization and Concept Hierarchy Generation��

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values.
Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Discretization techniques can be categorized based on how the discretization is performed, such as whether it uses class information or which direction it proceeds (i.e., top-down vs. bottom-up).
If the discretization process uses class information, then we say it is supervised discretization.
Otherwise, it is unsupervised.
If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then repeats this recursively on the resulting intervals, it is called top-down discretization or splitting.
This contrasts with bottom-up discretization or merging, which starts by considering all of the continuous values as potential split-points, removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals.

60 of 104

�� Data Discretization and Concept Hierarchy Generation ��

Data Discretization and Concept Hierarchy Generation

Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values, known as a concept hierarchy.
Concept hierarchies are useful for mining at multiple levels of abstraction.
A concept hierarchy for a given numerical attribute defines a discretization of the attribute.
Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-aged, or senior).
Although detail is lost by such data generalization, the generalized data may be more meaningful and easier to interpret.
This contributes to a consistent representation of data mining results among multiple mining tasks, which is a common requirement.
Concept hierarchies for numerical attributes can be constructed automatically based on data discretization.
We examine the following methods: binning, histogram analysis, entropy-based discretization, X^2-merging, cluster analysis, and discretization by intuitive partitioning.
In general, each method assumes that the values to be discretized are sorted in ascending order.

61 of 104

�� Data Discretization and Concept Hierarchy Generation ��

Data Discretization and Concept Hierarchy Generation

Entropy-Based Discretization:

Entropy is one of the most commonly used discretization measures.
It was first introduced by Claude Shannon in pioneering work on information theory and the concept

of information gain.

Entropy-based discretization is a supervised, top-down splitting technique.
It explores class distribution information in its calculation and determination of split-points (data values for partitioning an attribute range).
To discretize a numerical attribute, A, the method selects the value of A that has the minimum entropy as a

split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.

Such discretization forms a concept hierarchy for A.
Let D consist of data tuples defined by a set of attributes and a class-label attribute.
The class-label attribute provides the class information per tuple.

62 of 104

�� Data Discretization and Concept Hierarchy Generation ��

Data Discretization and Concept Hierarchy Generation

Entropy-Based Discretization:

The basic method for entropy-based discretization of an attribute A within the set is as follows:

1. Each value of A can be considered as a potential interval boundary or split-point (denoted split point) to partition the range of A. That is, a split-point for A can partition the tuples in D into two subsets satisfying the conditions A ≤ split point and A > split point, respectively, thereby creating a binary discretization.

2. Entropy-based discretization, as mentioned above, uses information regarding the class label of tuples. Suppose we want to classify the tuples in D by partitioning on attribute A and some split-point. Ideally, we would like this partitioning to result in an exact classification of the tuples. For example, if we had two classes, we would hope that all of the tuples of, say, class C1 will fall into one partition, and all of the tuples of class C2 will fall into the other partition. However, this is unlikely. For example, the first partition may contain many tuples of C1, but also some of C2. How much more information would we still need for a perfect classification, after this partitioning? This amount is called the expected information requirement for classifying a tuple in D based on partitioning by A. It is given by

63 of 104

�� Data Discretization and Concept Hierarchy Generation ��

Data Discretization and Concept Hierarchy Generation

Entropy-Based Discretization:

The basic method for entropy-based discretization of an attribute A within the set is as follows:

where D1 and D2 correspond to the tuples in D satisfying the conditions A ≤ split point and A > split point, respectively; |D| is the number of tuples in D, and so on. The entropy function for a given set is calculated based on the class distribution of the tuples in the set. For example, given m classes, C1,C2, . . . ,Cm, the entropy of D1 is

where pi is the probability of class Ci in D1, determined by dividing the number of tuples of class Ci in D1 by |D1|, the total number of tuples in D1. Therefore, when selecting a split-point for attribute A, we want to pick the attribute value that gives the minimum expected information requirement (i.e., min(InfoA(D))). This would result in the minimum amount of expected information (still) required to perfectly classify the tuples after partitioning by A ≤ split point and A>split point. This is equivalent to the attribute-value pair with the maximum information gain
We use the split-point to partition the range of A into two intervals, corresponding to A ≤ split point and A > split point.

64 of 104

�� Data Discretization and Concept Hierarchy Generation ��

Data Discretization and Concept Hierarchy Generation

Entropy-Based Discretization:

3. The process of determining a split-point is recursively applied to each partition obtained, until some stopping criterion is met, such as when the minimum information requirement on all candidate split-points is less than a small threshold, e, or when the number of intervals is greater than a threshold, max interval.

Entropy-based discretization can reduce data size.

65 of 104

��Data Discretization and Concept Hierarchy Generation��

Segmentation by natural partitioning

Although binning, histogram analysis, clustering and entropy-based discretization are useful in the generation of numerical hierarchies, many users would like to see numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or \natural".
For example, annual salaries broken into ranges like [$50,000, $60,000) are often more desirable than ranges like [$51263.98, $60872.34), obtained by some sophisticated clustering analysis.
The 3-4-5 rule can be used to segment numeric data into relatively uniform, \natural" intervals.
In general, the rule partitions a given range of data into either 3, 4, or 5 relatively equi-length intervals, recursively and level by level, based on the value range at the most significant digit.
The rule is as follows:

(a) If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, then partition the range into 3 intervals (3 equi-width intervals for 3, 6, 9, and three intervals in the grouping of 2-3-2 for 7);

(b) if it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equi-width intervals; and

(c) if it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equi-width intervals.

66 of 104

��Data Discretization and Concept Hierarchy Generation��

Segmentation by natural partitioning

The rule can be recursively applied to each interval, creating a concept hierarchy for the given numeric attribute.
Since there could be some dramatically large positive or negative values in a data set, the top level segmentation, based merely on the minimum and maximum values, may derive distorted results.
For example, the assets of a few people could be several orders of magnitude higher than those of others in a data set. Segmentation based on the maximal asset values may lead to a highly biased hierarchy.
Thus the top level segmentation can be performed based on the range of data values representing the majority (e.g., 5%-tile to 95%-tile) of the given data.
The extremely high or low values beyond the top level segmentation will form distinct interval(s) which can be handled separately, but in a similar manner.

The following example illustrates the use of the 3-4-5 rule for the automatic construction of a numeric hierarchy:

67 of 104

��Data Discretization and Concept Hierarchy Generation- Segmentation by natural partitioning Example of 3-4-5 rule��

68 of 104

��Data Discretization and Concept Hierarchy Generation��

Segmentation by natural partitioning

Step 1 – Min=-$351,976, Max=$4,700,896, low (5^th percentile)=-$159,876, high (95^th percentile)=$1,838,761
Step 2 – For low and high, most significant digit is at $1,000,000, rounding low -$1,000,000, rounding high $2,000,000
Step 3 – interval ranges over 3 distinct values at the most significant digit, so using 3-4-5 rule partition into 3 intervals, -$1,000,000-$0, $0-$1,000,000, and $1,000,000-$2,000,000
Step 4 – Examine Min & Max values to see how they “fit” into first level partitions, first partition covers Min value, so adjust left boundary to make partition smaller, last partition doesn’t cover Max value, so create a new partition (round max up to next significant digit) $2,000,000-$5,000,000
Step 5 – Recursively, each interval can be further partitioned using 3-4-5 rule to form next lower level of the hierarchy

69 of 104

��Data Discretization and Concept Hierarchy Generation��

Concept hierarchy generation for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or experts

Example : rel db may contain: street, city, province_or_state, country
Expert defines ordering of hierarchy such as street < city < province_or_state < country

Specification of a portion of a hierarchy by explicit data grouping

Example : province_or_state, country : {Alberta, Saskatchewan, Manitoba} – prairies_Canada & {British Columbia, prairies_Canada} – Western Canada

Specification of a set of attributes, but not of their partial ordering

Auto generate the attribute ordering based upon observation that attribute defining a high level concept has a smaller # of distinct values than an attribute defining a lower level concept
Example : country (15), state_or_province (365), city (3567), street (674,339)

Specification of only a partial set of attributes

Try and parse database schema to determine complete hierarchy

70 of 104

��Data Discretization and Concept Hierarchy Generation��

Specification of a set of attributes

Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest level of the hierarchy.

71 of 104

Data Mining Primitives

Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed.
A data mining task can be specified in the form of a data mining query, which is input to the data mining system.
A data mining query is defined in terms of data mining task primitives.
These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths.
The data mining primitives specify the following as:

72 of 104

The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction. An example of a concept hierarchy for the attribute (or dimension) age . User beliefs regarding relationships in the data are another form of background knowledge.

73 of 104

The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. For example, interestingness measures for association rules include support and confidence. Rules whose support and confidence values are below user-specified thresholds are considered uninteresting.
The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes.

75 of 104

Data Mining Query Language

A data mining query language can be designed to incorporate these primitives, allowing users to flexibly interact with data mining systems.
Having a data mining query language provides a foundation on which user-friendly graphical interfaces can be built.
This facilitates a data mining system’s communication with other information systems and its integration with the overall information processing environment.
Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks, from data characterization to evolution analysis.
Each task has different requirements.
The design of an effective data mining query language requires a deep understanding of the power, limitation, and underlying mechanisms of the various kinds of data mining tasks.

76 of 104

Data Mining Query Language

There are several proposals on data mining languages and standards.
we use a data mining query language known as DMQL (Data Mining Query Language), which was designed as a teaching tool, based on the above primitives.
The language adopts an SQL-like syntax, so that it can easily be integrated with the relational query language, SQL.
Let’s look at how it can be used to specify a data mining task.

Mining classification rules: Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on their buying patterns. You are especially interested in those customers whose salary is no less than $40,000, and who have bought more than $1,000 worth of items, each of which is priced at no less than $100. In particular, you are interested in the customer’s age, income, the types of items purchased, the purchase location, and where the items were made. You would like to view the resulting classification in the form of rules.

1 of 104

2 of 104

3 of 104

4 of 104

5 of 104

6 of 104

7 of 104

8 of 104

9 of 104

10 of 104

11 of 104

12 of 104

13 of 104

14 of 104

15 of 104

16 of 104

17 of 104

18 of 104

19 of 104

20 of 104

21 of 104

22 of 104

23 of 104

24 of 104

25 of 104

26 of 104

27 of 104

28 of 104

29 of 104

30 of 104

31 of 104

32 of 104

33 of 104

34 of 104

35 of 104

36 of 104

37 of 104

38 of 104

39 of 104

40 of 104

41 of 104

42 of 104

43 of 104

44 of 104

45 of 104

46 of 104

47 of 104

48 of 104

49 of 104

50 of 104

51 of 104

52 of 104

53 of 104

54 of 104

55 of 104

56 of 104

57 of 104

58 of 104

59 of 104

60 of 104

61 of 104

62 of 104

63 of 104

64 of 104

65 of 104

66 of 104

67 of 104

68 of 104

69 of 104

70 of 104

71 of 104

72 of 104

73 of 104

74 of 104

75 of 104

76 of 104

77 of 104

78 of 104

79 of 104

80 of 104