2 of 47

AGGREGATION

“less is more”
Aggregation - combining of two or more objects into a single object.
In Example,

One way to aggregate transactions for this data set is to replace all the transactions of a single store with a single storewide transaction.
This reduces number of records (1 record per store).

How an aggregate transaction is created

Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that were sold at that location.

Can also be viewed as a multidimensional array, where each attribute is a dimension.

Used in OLAP

3 of 47

AGGREGATION

Motivations for aggregation

Smaller data sets require less memory and processing time which allows the use of more expensive data mining algorithms.
Availability of change of scope or scale

by providing a high-level view of the data instead of a low-level view.

Behavior of groups of objects or attributes is often more stable than that of individual objects or attributes.

Disadvantage of aggregation

potential loss of interesting details.

4 of 47

AGGREGATION

average yearly precipitation has less variability than the average monthly precipitation.

5 of 47

SAMPLING

Approach for selecting a subset of the data objects to be analyzed.
Data miners sample because it is too expensive or time consuming to process all the data.
The key principle for effective sampling is the following:

Using a sample will work almost as well as using the entire data set if the sample is representative.

A sample is representative if it has approximately the same property (of interest) as the original set of data.

Choose a sampling scheme/Technique – which gives high probability of getting a representative sample.

6 of 47

SAMPLING

Sampling Approaches:

Simple random sampling

equal probability of selecting any particular item.
Two variations on random sampling:

(1) sampling without replacement—as each item is selected, it is removed from the set of all objects that together constitute the population, and
(2) sampling with replacement—objects are not removed from the population as they are selected for the sample.

Problem: When the population consists of different types of objects, with widely different numbers of objects, simple random sampling can fail to adequately represent those types of objects that are less frequent.

Stratified sampling:

starts with prespecified groups of objects
Simpler version -equal numbers of objects are drawn from each group even though the groups are of different sizes.
Other - the number of objects drawn from each group is proportional to the size of that group.

7 of 47

SAMPLING

Sampling and Loss of Information

Larger sample sizes increase the probability that a sample will be representative, but they also eliminate much of the advantage of sampling.
Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.

8 of 47

SAMPLING

Determining the Proper Sample Size

Desired outcome: at least one point will be obtained from each cluster.
Probability of getting one object from each of the 10 groups increases as the sample size runs from 10 to 60.

9 of 47

SAMPLING

Adaptive/Progressive Sampling:

Proper sample size - Difficult to determine
Start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained.
Initial correct sample size is eliminated
Stop increasing the sample size at leveling-off point(where no improvement in the outcome is identified).

10 of 47

DIMENSIONALITY REDUCTION

Data sets can have a large number of features.
Example

a set of documents, where each document is represented by a vector whose components are the frequencies with which each word occurs in the document.
thousands or tens of thousands of attributes (components), one for each word in the vocabulary.

11 of 47

DIMENSIONALITY REDUCTION

Benefits to dimensionality reduction.

Data mining algorithms work better if the dimensionality is lower.

It eliminates irrelevant features and reduce noise

Lead to a more understandable model

fewer attributes

Allow the data to be more easily visualized.
Amount of time and memory required by the data mining algorithm is reduced with a reduction in dimensionality.

Reduce the dimensionality of a data set by creating new attributes that are a combination of the old attributes.
Feature subset selection or feature selection:

The reduction of dimensionality by selecting new attributes that are a subset of the old.

12 of 47

DIMENSIONALITY REDUCTION

The Curse of Dimensionality

Data analysis become significantly harder as the dimensionality of the data increases.
data becomes increasingly sparse

Classification

there are not enough data objects to model a class to all possible objects.

Clustering

density and the distance between points - becomes less meaningful

13 of 47

DIMENSIONALITY REDUCTION

Linear Algebra Techniques for Dimensionality Reduction

Principal Components Analysis (PCA)

for continuous attributes
finds new attributes (principal components) that

(1) are linear combinations of the original attributes,
(2) are orthogonal (perpendicular) to each other, and
(3) capture the maximum amount of variation in the data.

Singular Value Decomposition (SVD)

Related to PCA

14 of 47

FEATURE SUBSET SELECTION

Another way to reduce the dimensionality - use only a subset of the features.
Redundant Features

Example:

Purchase price of a product and the amount of sales tax paid

Redundant to each other
contain much of the same information.

Irrelevant features contain almost no useful information for the data mining task at hand.

Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.

Redundant and irrelevant features

reduce classification accuracy and the quality of the clusters that are found.
can be eliminated immediately by using common sense or domain knowledge,
systematic approach - for selecting the best subset of features
Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and then take the subset that produces the best results.

15 of 47

FEATURE SUBSET SELECTION

3 standard approaches to feature selection:

Embedded
Filter
Wrapper

16 of 47

FEATURE SUBSET SELECTION

Embedded approaches:

Feature selection occurs naturally as part of the data mining algorithm.
During execution of algorithm, the Algorithm itself decides which attributes to use and which to ignore.
Example:- Algorithms for building decision tree classifiers

Filter approaches:

Features are selected before the data mining algorithm is run
Approach that is independent of the data mining task.

Wrapper approaches:

Uses the target data mining algorithm as a black box to find the best subset of attributes
typically without enumerating all possible subsets.

17 of 47

FEATURE SUBSET SELECTION

An Architecture for Feature Subset Selection :
The feature selection process is viewed as consisting of four parts:

a measure for evaluating a subset,
a search strategy that controls the generation of a new subset of features,
a stopping criterion, and
a validation procedure.

Filter methods and wrapper methods differ only in the way in which they evaluate a subset of features.

wrapper method – uses the target data mining algorithm
filter approach - evaluation technique is distinct from the target data mining algorithm.

18 of 47

FEATURE SUBSET SELECTION

19 of 47

FEATURE SUBSET SELECTION

Feature subset selection is a search over all possible subsets of features.
Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task

Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
Wrapper approach: running the target data mining application, measure the result of the data mining.

Stopping criterion

conditions involving the following:

the number of iterations,
whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
whether a subset of a certain size has been obtained,
whether simultaneous size and evaluation criteria have been achieved, and
whether any improvement can be achieved by the options available to the search strategy.

Validation:

Finally, the results of the target data mining algorithm on the selected subset should be validated.
An evaluation approach: run the algorithm with the full set of features and compare the full results to results obtained using the subset of features.

20 of 47

FEATURE SUBSET SELECTION

Feature Weighting

An alternative to keeping or eliminating features.
One Approach

Higher weight - More important features
Lower weight - less important features

Another Approach – automatic

Example – Classification Scheme - Support vector machines

Other Approach

The normalization of objects – Cosine Similarity – used as weights

21 of 47

FEATURE CREATION

Create a new set of attributes that captures the important information in a data set from the original attributes

much more effective.

No. of new attributes < No. of original attributes
Three related methodologies for creating new attributes:

Feature extraction
Mapping the data to a new space
Feature construction

22 of 47

FEATURE CREATION

Feature Extraction

The creation of a new set of features from the original raw data

Example: Classify set of photographs based on existence of human face (present or not)

Raw data (set of pixels) - not suitable for many types of classification algorithms.
Higher level features( presence or absence of certain types of edges and areas that are highly correlated with the presence of human faces), then a much broader set of classification techniques can be applied to this problem.

Feature extraction is highly domain-specific

New area means development of new features and feature extraction methods.

23 of 47

FEATURE CREATION

Mapping the Data to a New Space

A totally different view of the data can reveal important and interesting features.
If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
If, there are a number of periodic patterns and a significant amount of noise is present, then these patterns are hard to detect.
Such patterns can be detected by applying a Fourier transform to the time series in order to change to a representation in which frequency information is explicit.
Example:

Power spectrum that can be computed after applying a Fourier transform to the original time series.

24 of 47

FEATURE CREATION

Feature Construction

Features in the original data sets consists necessary information, but not suitable for the data mining algorithm.
If new features constructed out of the original features can be more useful than the original features.

Example (Density).

Dataset contains the volume and mass of historical artifact.
Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most directly yield an accurate classification.

25 of 47

DISCRETIZATION AND BINARIZATION

Classification algorithms, require that the data be in the form of categorical attributes.
Algorithms that find association patterns, require that the data be in the form of binary attributes.
Discretization - transforming a continuous attribute into a categorical attribute
Binarization - transforming both continuous and discrete attributes into one or more binary attributes

26 of 47

DISCRETIZATION AND BINARIZATION

Binarization of a categorical attribute (Simple technique):

If there are m categorical values, then uniquely assign each original value to an integer in the interval [0, m − 1].
If the attribute is ordinal, then order must be maintained by the assignment.
Next, convert each of these m integers to a binary number using n binary attributes

n = [log₂ (m)] binary digits are required to represent these integers

27 of 47

DISCRETIZATION AND BINARIZATION

Example: a categorical variable with 5 values

{awful, poor, OK, good, great}

require three binary variables x1, x2, and x3.

28 of 47

DISCRETIZATION AND BINARIZATION

Discretization of Continuous Attributes ( classification or association analysis)

Transformation of a continuous attribute to a categorical attribute involves two subtasks:

decide no. of categories
decide how to map the values of the continuous attribute to these categories.

Step I: Sort Attribute Values and divide into n intervals by specifying n − 1 split points.
Step II : all the values in one interval are mapped to the same categorical value.

29 of 47

DISCRETIZATION AND BINARIZATION

Discretization of Continuous Attributes

Problem of discretization is

Deciding how many split points to choose and
where to place them.

The result can be represented either as

a set of intervals {(x₀, x₁],(x₁, x₂],... ,(x_n−1, x_n)},

where x₀ and x_n may be +∞ or −∞, respectively,

as a series of inequalities x₀ < x ≤ x₁,..., x_n−1< x < x_n.

30 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Discretization methods for Classification

Supervised - known class information
Unsupervised - unknown class information

Equal width approach:

divides the range of the attribute into a user-specified number of intervals each having the same width.
problem with outliers

Equal frequency (equal depth) approach:

Puts same number of objects into each interval

K-means Clustering method

31 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Original Data

32 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Equal Width Discretization

33 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Equal Frequency Discretization

34 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

K-means Clustering (better result)

35 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization

When additional information (class labels) are used then it produces better results.
Some Concerns: purity of an interval and the minimum size of an interval.
statistically based approaches:

start with each attribute value as a separate interval and create larger intervals by merging adjacent intervals that are similar according to a statistical test.

Entropy based approaches:

36 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization

Entropy based approaches:

Entropy Definition

e_i- Entropy in i^th interval
p_ij = m_ij/m_i probability of class j in the i th interval.
k - no. of different class labels
m_i - no. of values in the i th interval of a partition,
m_ij - no. of values of class j in interval i.

37 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization

Entropy

38 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization

Entropy based approaches:

Total entropy, e, of the partition is

weighted average of the individual interval entropies,
m - no. of values,
w_i = m_i/m fraction of values in the i th interval
n - no. of intervals.

Perfectly Pure Interval:entropy is 0

If an interval contains only values of one class

Impure Interval: entropy is maximum

classes of values in an interval occur equal

39 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization

Entropy based approaches:

Simple approach for partitioning a continuous attribute:

starts by bisecting the initial values so that the resulting two intervals give minimum entropy.
consider each value as a possible split point
Repeat splitting process with another interval

choosing the interval with the worst (highest) entropy,
until a user-specified number of intervals is reached,

stopping criterion is satisfied.

40 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization
Entropy based approaches:

3 categories for both x & y

41 of 47

DISCRETIZATION AND BINARIZATION

Supervised Discretization
Entropy based approaches:

5 categories for both x & y

Observation:

no improvement for 6 categories

42 of 47

DISCRETIZATION AND BINARIZATION

Categorical Attributes with Too Many Values

If categorical attribute is an ordinal,

techniques similar to those for continuous attributes

If the categorical attribute is nominal,

Example:-

University that has a large number of departments.
department name attribute - dozens of diff. values.
combine departments into larger groups, such as

engineering,
social sciences, or
biological sciences.

43 of 47

Variable Transformation

Transformation that is applied to all the values of a variable.
Example: magnitude of a variable is important

then the values of the variable can be transformed by taking the absolute value.

Simple Function Transformation:

A simple mathematical function is applied to each value individually.
If x is a variable, then examples of such transformations include

x ^k,
log x,
e ^x,
√ x,
1/x,
sin x, or |x|

44 of 47

Variable Transformation

Variable transformations should be applied with caution since they change the nature of the data.
Example:-

transformation fun. is 1/x
if value is 1 or >1 then reduces the magnitude of values

values {1, 2, 3} go to {1, 1/ 2, 1/3}

if value is b/w o & 1 then increases the magnitude of values

values {1, 1/2, 1/3} go to {1, 2, 3}.

so better ask questions such as the following:

Does the order need to be maintained?
Does the transformation apply to all values( -ve & 0)?
What is the effect of the transformation on the values between 0 & 1?

45 of 47

Variable Transformation

Normalization or Standardization
Goal of standardization or normalization

To make an entire set of values have a particular property.

A traditional example is that of “standardizing a variable” in statistics.

x - mean (average) of the attribute values and
s_x - standard deviation,
Transformation

creates a new variable that has a mean of 0 and a standard deviation of 1.

46 of 47

Variable Transformation

Normalization or Standardization
If different variables are to be combined, a transformation is necessary to avoid having a variable with large values dominate the results of the calculation.
Example:

comparing people based on two variables: age and income.
For any two people, the difference in income will likely be much higher in absolute terms (hundreds or thousands of dollars) than the difference in age (less than 150).
Income values(higher values) will dominate the calculation.

47 of 47

Variable Transformation

Normalization or Standardization
Mean and standard deviation are strongly affected by outliers
Mean is replaced by the median, i.e., the middle value.
x - variable
absolute standard deviation of x is
x_i - i th value of the variable,
m - number of objects, and
µ - mean or median.
Other approaches

computing estimates of the location (center) and
spread of a set of values in the presence of outliers

These measures can also be used to define a standardization transformation.