1 of 47

DATA PREPROCESSING

Datamining

Unit - II

2 of 47

AGGREGATION

  • “less is more”
  • Aggregation - combining of two or more objects into a single object.
  • In Example,
    • One way to aggregate transactions for this data set is to replace all the transactions of a single store with a single storewide transaction.
    • This reduces number of records (1 record per store).
  • How an aggregate transaction is created
    • Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
    • A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that were sold at that location.
  • Can also be viewed as a multidimensional array, where each attribute is a dimension.
    • Used in OLAP

3 of 47

AGGREGATION

  • Motivations for aggregation
    • Smaller data sets require less memory and processing time which allows the use of more expensive data mining algorithms.
    • Availability of change of scope or scale
      • by providing a high-level view of the data instead of a low-level view.
    • Behavior of groups of objects or attributes is often more stable than that of individual objects or attributes.
  • Disadvantage of aggregation
    • potential loss of interesting details.

4 of 47

AGGREGATION

average yearly precipitation has less variability than the average monthly precipitation.

5 of 47

SAMPLING

  • Approach for selecting a subset of the data objects to be analyzed.
  • Data miners sample because it is too expensive or time consuming to process all the data.
  • The key principle for effective sampling is the following:
    • Using a sample will work almost as well as using the entire data set if the sample is representative.
      • A sample is representative if it has approximately the same property (of interest) as the original set of data.
    • Choose a sampling scheme/Technique – which gives high probability of getting a representative sample.

6 of 47

SAMPLING

  • Sampling Approaches:
    • Simple random sampling
      • equal probability of selecting any particular item.
      • Two variations on random sampling:
        • (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that together constitute the population, and
        • (2) sampling with replacement—objects are not removed from the population as they are selected for the sample.
      • Problem: When the population consists of different types of objects, with widely different numbers of objects, simple random sampling can fail to adequately represent those types of objects that are less frequent.
    • Stratified sampling:
      • starts with prespecified groups of objects
      • Simpler version -equal numbers of objects are drawn from each group even though the groups are of different sizes.
      • Other - the number of objects drawn from each group is proportional to the size of that group.

7 of 47

SAMPLING

Sampling and Loss of Information

  • Larger sample sizes increase the probability that a sample will be representative, but they also eliminate much of the advantage of sampling.
  • Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.

8 of 47

SAMPLING

Determining the Proper Sample Size

  • Desired outcome: at least one point will be obtained from each cluster.
  • Probability of getting one object from each of the 10 groups increases as the sample size runs from 10 to 60.

9 of 47

SAMPLING

  • Adaptive/Progressive Sampling:
    • Proper sample size - Difficult to determine
    • Start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained.
    • Initial correct sample size is eliminated
    • Stop increasing the sample size at leveling-off point(where no improvement in the outcome is identified).

10 of 47

DIMENSIONALITY REDUCTION

  • Data sets can have a large number of features.
  • Example
    • a set of documents, where each document is represented by a vector whose components are the frequencies with which each word occurs in the document.
    • thousands or tens of thousands of attributes (components), one for each word in the vocabulary.

11 of 47

DIMENSIONALITY REDUCTION

  • Benefits to dimensionality reduction.
    • Data mining algorithms work better if the dimensionality is lower.
      • It eliminates irrelevant features and reduce noise
    • Lead to a more understandable model
      • fewer attributes
    • Allow the data to be more easily visualized.
    • Amount of time and memory required by the data mining algorithm is reduced with a reduction in dimensionality.
  • Reduce the dimensionality of a data set by creating new attributes that are a combination of the old attributes.
  • Feature subset selection or feature selection:
    • The reduction of dimensionality by selecting new attributes that are a subset of the old.

12 of 47

DIMENSIONALITY REDUCTION

  • The Curse of Dimensionality
    • Data analysis become significantly harder as the dimensionality of the data increases.
    • data becomes increasingly sparse
  • Classification
    • there are not enough data objects to model a class to all possible objects.
  • Clustering
    • density and the distance between points - becomes less meaningful

13 of 47

DIMENSIONALITY REDUCTION

  • Linear Algebra Techniques for Dimensionality Reduction
    • Principal Components Analysis (PCA)
      • for continuous attributes
      • finds new attributes (principal components) that
        • (1) are linear combinations of the original attributes,
        • (2) are orthogonal (perpendicular) to each other, and
        • (3) capture the maximum amount of variation in the data.
    • Singular Value Decomposition (SVD)
      • Related to PCA

14 of 47

FEATURE SUBSET SELECTION

  • Another way to reduce the dimensionality - use only a subset of the features.
  • Redundant Features
    • Example:
      • Purchase price of a product and the amount of sales tax paid
        • Redundant to each other
        • contain much of the same information.
  • Irrelevant features contain almost no useful information for the data mining task at hand.
    • Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.
  • Redundant and irrelevant features
    • reduce classification accuracy and the quality of the clusters that are found.
    • can be eliminated immediately by using common sense or domain knowledge,
    • systematic approach - for selecting the best subset of features
    • Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and then take the subset that produces the best results.

15 of 47

FEATURE SUBSET SELECTION

    • 3 standard approaches to feature selection:
      • Embedded
      • Filter
      • Wrapper

16 of 47

FEATURE SUBSET SELECTION

  • Embedded approaches:
    • Feature selection occurs naturally as part of the data mining algorithm.
    • During execution of algorithm, the Algorithm itself decides which attributes to use and which to ignore.
    • Example:- Algorithms for building decision tree classifiers
  • Filter approaches:
    • Features are selected before the data mining algorithm is run
    • Approach that is independent of the data mining task.
  • Wrapper approaches:
    • Uses the target data mining algorithm as a black box to find the best subset of attributes
    • typically without enumerating all possible subsets.

17 of 47

FEATURE SUBSET SELECTION

  • An Architecture for Feature Subset Selection :
  • The feature selection process is viewed as consisting of four parts:
    1. a measure for evaluating a subset,
    2. a search strategy that controls the generation of a new subset of features,
    3. a stopping criterion, and
    4. a validation procedure.
  • Filter methods and wrapper methods differ only in the way in which they evaluate a subset of features.
    • wrapper method – uses the target data mining algorithm
    • filter approach - evaluation technique is distinct from the target data mining algorithm.

18 of 47

FEATURE SUBSET SELECTION

19 of 47

FEATURE SUBSET SELECTION

  • Feature subset selection is a search over all possible subsets of features.
  • Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
    • Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
    • Wrapper approach: running the target data mining application, measure the result of the data mining.
  • Stopping criterion
    • conditions involving the following:
      • the number of iterations,
      • whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
      • whether a subset of a certain size has been obtained,
      • whether simultaneous size and evaluation criteria have been achieved, and
      • whether any improvement can be achieved by the options available to the search strategy.
  • Validation:
    • Finally, the results of the target data mining algorithm on the selected subset should be validated.
    • An evaluation approach: run the algorithm with the full set of features and compare the full results to results obtained using the subset of features.

20 of 47

FEATURE SUBSET SELECTION

  • Feature Weighting
    • An alternative to keeping or eliminating features.
    • One Approach
      • Higher weight - More important features
      • Lower weight - less important features
    • Another Approach – automatic
      • Example – Classification Scheme - Support vector machines
    • Other Approach
      • The normalization of objects – Cosine Similarity – used as weights

21 of 47

FEATURE CREATION

  • Create a new set of attributes that captures the important information in a data set from the original attributes
    • much more effective.
  • No. of new attributes < No. of original attributes
  • Three related methodologies for creating new attributes:
    1. Feature extraction
    2. Mapping the data to a new space
    3. Feature construction

22 of 47

FEATURE CREATION

  • Feature Extraction
    • The creation of a new set of features from the original raw data
      • Example: Classify set of photographs based on existence of human face (present or not)
        • Raw data (set of pixels) - not suitable for many types of classification algorithms.
        • Higher level features( presence or absence of certain types of edges and areas that are highly correlated with the presence of human faces), then a much broader set of classification techniques can be applied to this problem.
    • Feature extraction is highly domain-specific
      • New area means development of new features and feature extraction methods.

23 of 47

FEATURE CREATION

Mapping the Data to a New Space

  • A totally different view of the data can reveal important and interesting features.
  • If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
  • If, there are a number of periodic patterns and a significant amount of noise is present, then these patterns are hard to detect.
  • Such patterns can be detected by applying a Fourier transform to the time series in order to change to a representation in which frequency information is explicit.
  • Example:
    • Power spectrum that can be computed after applying a Fourier transform to the original time series.

24 of 47

FEATURE CREATION

  • Feature Construction
    • Features in the original data sets consists necessary information, but not suitable for the data mining algorithm.
    • If new features constructed out of the original features can be more useful than the original features.
  • Example (Density).
    • Dataset contains the volume and mass of historical artifact.
    • Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most directly yield an accurate classification.

25 of 47

DISCRETIZATION AND BINARIZATION

    • Classification algorithms, require that the data be in the form of categorical attributes.
    • Algorithms that find association patterns, require that the data be in the form of binary attributes.
    • Discretization - transforming a continuous attribute into a categorical attribute
    • Binarization - transforming both continuous and discrete attributes into one or more binary attributes

26 of 47

DISCRETIZATION AND BINARIZATION

    • Binarization of a categorical attribute (Simple technique):
      • If there are m categorical values, then uniquely assign each original value to an integer in the interval [0, m − 1].
      • If the attribute is ordinal, then order must be maintained by the assignment.
      • Next, convert each of these m integers to a binary number using n binary attributes
        • n = [log2 (m)] binary digits are required to represent these integers

27 of 47

DISCRETIZATION AND BINARIZATION

Example: a categorical variable with 5 values

{awful, poor, OK, good, great}

require three binary variables x1, x2, and x3.

28 of 47

DISCRETIZATION AND BINARIZATION

    • Discretization of Continuous Attributes ( classification or association analysis)
      • Transformation of a continuous attribute to a categorical attribute involves two subtasks:
        • decide no. of categories
        • decide how to map the values of the continuous attribute to these categories.
      • Step I: Sort Attribute Values and divide into n intervals by specifying n − 1 split points.
      • Step II : all the values in one interval are mapped to the same categorical value.

29 of 47

DISCRETIZATION AND BINARIZATION

    • Discretization of Continuous Attributes
      • Problem of discretization is
        • Deciding how many split points to choose and
        • where to place them.
      • The result can be represented either as
        • a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)},

where x0 and xn may be +∞ or −∞, respectively,

or

        • as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.

30 of 47

DISCRETIZATION AND BINARIZATION

    • UnSupervised Discretization
      • Discretization methods for Classification
        • Supervised - known class information
        • Unsupervised - unknown class information
      • Equal width approach:
        • divides the range of the attribute into a user-specified number of intervals each having the same width.
        • problem with outliers
      • Equal frequency (equal depth) approach:
        • Puts same number of objects into each interval
      • K-means Clustering method

31 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Original Data

32 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Equal Width Discretization

33 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

Equal Frequency Discretization

34 of 47

DISCRETIZATION AND BINARIZATION

UnSupervised Discretization

K-means Clustering (better result)

35 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
      • When additional information (class labels) are used then it produces better results.
      • Some Concerns: purity of an interval and the minimum size of an interval.
      • statistically based approaches:
        • start with each attribute value as a separate interval and create larger intervals by merging adjacent intervals that are similar according to a statistical test.
      • Entropy based approaches:

36 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
      • Entropy based approaches:
        • Entropy Definition

        • ei - Entropy in ith interval
        • pij = mij/mi probability of class j in the i th interval.
        • k - no. of different class labels
        • mi - no. of values in the i th interval of a partition,
        • mij - no. of values of class j in interval i.

37 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
      • Entropy

38 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
      • Entropy based approaches:
        • Total entropy, e, of the partition is
          • weighted average of the individual interval entropies,
          • m - no. of values,
          • wi = mi/m fraction of values in the i th interval
          • n - no. of intervals.
        • Perfectly Pure Interval:entropy is 0
          • If an interval contains only values of one class
        • Impure Interval: entropy is maximum
          • classes of values in an interval occur equal

39 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
      • Entropy based approaches:
        • Simple approach for partitioning a continuous attribute:
          • starts by bisecting the initial values so that the resulting two intervals give minimum entropy.
          • consider each value as a possible split point
          • Repeat splitting process with another interval
            • choosing the interval with the worst (highest) entropy,
            • until a user-specified number of intervals is reached,

or

            • stopping criterion is satisfied.

40 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
    • Entropy based approaches:
      • 3 categories for both x & y

41 of 47

DISCRETIZATION AND BINARIZATION

    • Supervised Discretization
    • Entropy based approaches:
      • 5 categories for both x & y
  • Observation:
    • no improvement for 6 categories

42 of 47

DISCRETIZATION AND BINARIZATION

  • Categorical Attributes with Too Many Values
    • If categorical attribute is an ordinal,
        • techniques similar to those for continuous attributes
    • If the categorical attribute is nominal,
        • Example:-
          • University that has a large number of departments.
          • department name attribute - dozens of diff. values.
          • combine departments into larger groups, such as
              • engineering,
              • social sciences, or
              • biological sciences.

43 of 47

Variable Transformation

  • Transformation that is applied to all the values of a variable.
  • Example: magnitude of a variable is important
    • then the values of the variable can be transformed by taking the absolute value.
  • Simple Function Transformation:
    • A simple mathematical function is applied to each value individually.
    • If x is a variable, then examples of such transformations include
      • x k,
      • log x,
      • e x,
      • √ x,
      • 1/x,
      • sin x, or |x|

44 of 47

Variable Transformation

  • Variable transformations should be applied with caution since they change the nature of the data.
  • Example:-
    • transformation fun. is 1/x
    • if value is 1 or >1 then reduces the magnitude of values
      • values {1, 2, 3} go to {1, 1/ 2, 1/3}
    • if value is b/w o & 1 then increases the magnitude of values
      • values {1, 1/2, 1/3} go to {1, 2, 3}.
  • so better ask questions such as the following:
    • Does the order need to be maintained?
    • Does the transformation apply to all values( -ve & 0)?
    • What is the effect of the transformation on the values between 0 & 1?

45 of 47

Variable Transformation

  • Normalization or Standardization
  • Goal of standardization or normalization
    • To make an entire set of values have a particular property.
  • A traditional example is that of “standardizing a variable” in statistics.
    • x - mean (average) of the attribute values and
    • sx - standard deviation,
    • Transformation

    • creates a new variable that has a mean of 0 and a standard deviation of 1.

46 of 47

Variable Transformation

  • Normalization or Standardization
  • If different variables are to be combined, a transformation is necessary to avoid having a variable with large values dominate the results of the calculation.
  • Example:
    • comparing people based on two variables: age and income.
    • For any two people, the difference in income will likely be much higher in absolute terms (hundreds or thousands of dollars) than the difference in age (less than 150).
    • Income values(higher values) will dominate the calculation.

47 of 47

Variable Transformation

  • Normalization or Standardization
  • Mean and standard deviation are strongly affected by outliers
  • Mean is replaced by the median, i.e., the middle value.
  • x - variable
  • absolute standard deviation of x is
  • xi - i th value of the variable,
  • m - number of objects, and
  • µ - mean or median.
  • Other approaches
    • computing estimates of the location (center) and
    • spread of a set of values in the presence of outliers
  • These measures can also be used to define a standardization transformation.