1 of 35

CLASSIFICATION

DATAMINING

UNIT III

2 of 35

BASIC CONCEPTS

  • Input data ->collection of records. E
  • Record / instance / example -> tuple (x, y)
    • x - attribute set
    • y - special attribute (class label / category / target attribute)
  • Attribute set - properties of a Data Object – Discrete / Continuous
  • Class label –
    • Classification – y is Discrete attribute
    • Regression (Predictive Modeling Task) - y is a continuous attribute.

3 of 35

BASIC CONCEPTS

  • Definition:
    • Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.
    • The target function is also known informally as a classification model.

4 of 35

BASIC CONCEPTS

  • A classification model is useful for the following purposes.
    • Descriptive modeling: A classification model can serve as an explanatory tool to distinguish between objects of different classes.

5 of 35

BASIC CONCEPTS

  • A classification model is useful for the following purposes.
    • Predictive Modeling:
      • A classification model can also be used to predict the class label of unknown records.
      • Automatically assigns a class label when presented with the attribute set of an unknown record.
  • Classification techniques best suit for binary or nominal categories.
  • Do not consider the implicit order
  • Relationships are also ignored(super-sub class)

6 of 35

General approach to solving a classification problem

  • Classification technique (or classifier)
    • Systematic approach to building classification models from an input data set.
    • Examples
      • Decision tree classifiers,
      • Rule-based classifiers,
      • Neural networks,
      • Support vector machines, and
      • Naive bayes classifiers.
  • Learning algorithm
    • Used by the classifier
    • To identify a model
      • That best fits the relationship between the attribute set and class label of the input data.

7 of 35

General approach to solving a classification problem

  • Model
    • Generated by a learning algorithm
    • Should satisfy the following:
      • Fit the input data well
      • Correctly predict the class labels of records it has never seen before.
  • Training set
    • Consisting of records whose class labels are known
    • used to build a classification model

8 of 35

General approach to solving a classification problem

  • Confusion Matrix
    • Used to evaluate the performance of a classification model
    • Holds details about
      • counts of test records correctly and incorrectly predicted by the model.
      • Table 4.2 depicts the confusion matrix for a binary classification problem.
      • fij – no. of records from class i predicted to be of class j.
      • f01 – no. of records from class 0 incorrectly predicted as class 1.
      • total no. of correct predictions made (f11 + f00)
      • total number of incorrect predictions (f10 + f01).

9 of 35

General approach to solving a classification problem

  • Performance Metrics:
    1. Accuracy

    • Error Rate

10 of 35

DECISION TREE INDUCTION

Working of Decision Tree

    • We can solve a classification problem by asking a series of carefully crafted questions about the attributes of the test record.
    • Each time we receive an answer, a follow-up question is asked until we reach a conclusion about the class label of the record.
    • The series of questions and their possible answers can be organized in the form of a decision tree
    • Decision tree is a hierarchical structure consisting of nodes and directed edges.

11 of 35

DECISION TREE INDUCTION

Working of Decision Tree

  • Three types of nodes:
      • Root node
        • No incoming edges
        • Zero or more outgoing edges.
      • Internal nodes
        • Exactly one incoming edge and
        • Two or more outgoing edges.
      • Leaf or terminal nodes
        • Exactly one incoming edge and
        • No outgoing edges.
    • Each leaf node is assigned a class label.
    • Non-terminal nodes (root & other internal nodes) contain attribute test conditions to separate records that have different characteristics.

12 of 35

DECISION TREE INDUCTION

Working of Decision Tree

13 of 35

DECISION TREE INDUCTION

Buiding Decision Tree

  • Hunt’s algorithm:
    • basis of many existing decision tree induction algorithms, including
      • ID3,
      • C4.5, and
      • CART.
    • Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records into successively purer subsets.
    • Dt - set of training records with node t
    • y = {y1, y2,..., yc} -> class labels.
    • Hunt’s algorithm.
    • Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
    • Step 2: If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and the records in Dt are distributed to the children based on the outcomes.
    • Note:-algorithm is then recursively applied to each child node.

14 of 35

DECISION TREE INDUCTION

Buiding Decision Tree

  • Example:-predicting whether a loan applicant will repay or not (defaulted)
    • Construct a training set by examining the records of previous borrowers.

15 of 35

DECISION TREE INDUCTION

Building Decision Tree

  • Hunt’s algorithm will work fine
    • if every combination of attribute values is present in the training data and
    • if each combination has a unique class label.
  • Additional conditions
    1. If a child nodes is empty(no records in training set) then declare it as a leaf node with the same class label as the majority class of training records associated with its parent node.
    2. Identical attribute values and diff. class label. Not possible to further split. declare node as leaf with the same class label as the majority class of training records associated with this node.
  • Design Issues of Decision Tree Induction
    • 1. How should the training records be split?
      • Test condition to divide the records into smaller subsets.
      • provide a method for specifying the test condition
      • measure for evaluating the goodness of each test condition.

    • 2. How should the splitting procedure stop?
      • A stopping condition is needed
      • stop when either all the records belong to the same class or all the records have identical attribute values.

16 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

  • Test condition for Binary Attributes
    • The test condition for a binary attribute generates two potential outcomes.

17 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

  • Test condition for Nominal Attributes
    • nominal attribute can have many values
    • Test condition can be expressed in two ways
      • Multiway split - number of outcomes depends on the number of distinct values
      • Binary splits(used in CART) - produces binary splits by considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.

18 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

  • Test condition for Ordinal Attributes
    • Ordinal attributes can also produce binary or multiway splits.
    • values can be grouped without violating the order property.
    • 4.10© is invalid

19 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

  • Test condition for Continuous Attributes
    • Test condition - Comparison test (A < v) or (A ≥ v) with binary outcomes,

or

    • Test condition - a range query with outcomes of the form vi ≤ A < vi+1, for i = 1,..., k.
      • Multiway split
      • Apply the discretization strategies

20 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • p(i|t) - fraction of records belonging to class i at a given node t.
    • Sometimes – used as only Pi
  • Two-class problem
    • (p0, p1) - class distribution at any node
    • p1 = 1 − p0
    • (0.5, 0.5) because there are an equal number of records from each class
    • Car Type, will result in purer partitions

21 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • selection of best split is based on the degree of impurity of the child nodes
  • Node with class distribution (0, 1) has zero impurity,
  • Node with uniform class distribution (0.5, 0.5) has the highest impurity.
  • p - fraction of records that belong to one of the two classes.
  • P – maximum(0.5) – class distribution is even
  • P- min. (0 or 1)– all records belong to the same class

22 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • Node N1 has the lowest impurity value, followed by N2 and N3.

23 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • To Determine the performance of test condition compare the degree of impurity of the parent node (before splitting) with the degree of impurity of the child nodes (after splitting).
  • The larger their difference, the better the test condition.
  • Information Gain:

    • I(·) - impurity measure of a given node,
    • N - total no. of records at parent node,
    • k - no. of attribute values
    • N(vj) - no. of records associated with the child node, vj.
  • The difference in entropy(Impurity measure) is known as the Information gain, ∆info

24 of 35

Calculate Impurity using Gini

Find out, which attribute is selected?

25 of 35

26 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • Splitting of Binary Attributes
    • Before splitting, the Gini index is 0.5
      • because equal number of records from both classes.
    • If attribute A is chosen to split the data,
      • Gini index
        • node N1 = 0.4898, and
        • node N2 = 0.480.
      • Weighted average of the Gini index for the descendent nodes is
        • (7/12) × 0.4898 + (5/12) × 0.480 = 0.486.
    • Weighted average of the Gini index for attribute B is 0.375.
    • B is selected because of small value

27 of 35

28 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • Splitting of Nominal Attributes
    • First Binary Grouping
      • Gini index of {Sports, Luxury} is 0.4922 and
      • the Gini index of {Family} is 0.3750.
      • The weighted average Gini index

16/20 × 0.4922 + 4/20 × 0.3750 = 0.468.

    • Second binary grouping of {Sports} and {Family, Luxury},
      • weighted average Gini index is 0.167.

  • The second grouping has a

lower Gini index

because its corresponding subsets

are much purer.

29 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • Splitting of Continuous Attributes

  • A brute-force method -Take every value of the attribute in the N records as a candidate split position.
        • Count the number of records with annual income less than or greater than v(computationally expensive).
  • To reduce the complexity, the training records are sorted based on their annual income,
  • Candidate split positions are identified by taking the midpoints between two adjacent sorted values:

30 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • Gain Ratio
    • Problem:
      • Customer ID - produce purer partitions.
      • Customer ID is not a predictive attribute because its value is unique for each record.
    • Two Strategies:
      • First strategy(used in CART)
        • restrict the test conditions to binary splits only.
      • Second Strategy(used in C4.5 - Gain Ratio - to determine goodness of a split)
        • modify the splitting criterion
        • consider - number of outcomes produced by the attribute test condition.

31 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

  • Gain Ratio

32 of 35

33 of 35

Tree-Pruning

  • After building the decision tree,
    • Tree-pruning step - to reduce the size of the decision tree.
    • Pruning -
      • trims the branches of the initial tree
      • improves the generalization capability of the decision tree.
  • Decision trees that are too large are susceptible to a phenomenon known as overfitting.

34 of 35

35 of 35