1 of 27

Machine Learning I:

Decision Trees

Patrick Hall

Visiting Faculty, Department of Decision Sciences

George Washington University

2 of 27

Lecture 4 Agenda

  • Brief Introduction
  • Decision Tree Algorithm
    • Greedy Heuristics
    • Impurity & Information Gain
    • Tree Pruning
  • Banknote Case: Python Example
  • Readings

3 of 27

Where are we in the modeling lifecycle?

Data Collection �& ETL

Feature Selection & Engineering

Supervised�Learning

Unsupervised�Learning

Deployment

Cost Intensive

Revenue

Generating

Assessment & Validation

4 of 27

�Brief Introduction

General Overview

5 of 27

Decision Tree Anatomy

  • Three node types:
    • Root node: no incoming links and zero or more outgoing links
    • Internal nodes: exactly one incoming and two or more outgoing links
    • Leaf or terminal nodes: exactly one incoming link and no outgoing links
  • Every leaf or terminal node in the decision tree is associated with a class label
  • Non-terminal nodes contain feature test conditions that are typically defined using a single feature
  • Depth: length of the longest decision-path from the tree root to a leaf node

6 of 27

Decision Tree Structure: Simple Example

7 of 27

�Decision Tree Algorithm

Greedy Heuristics,

Impurity, Information Gain, & Tree Pruning

8 of 27

Basic Algorithm Overview

  • Many possible decision trees can be constructed from a particular dataset
    • Rashomon Effect
  • Finding an optimal tree is computational expensive:
    • Learning the simplest decision tree is NP-hard
  • Efficient algorithms have been developed to learn reasonably accurate decision trees in a reasonable amount of time, resorting to greedy heuristics
  • Employ a greedy strategy to grow the decision tree in a top-down fashion by making a series of locally optimal decisions
  • Splitting criterion determines which feature is chosen as the test condition and determines how the training data is distributed amongst child nodes
  • Stopping criterion determines when a node should be stopped from expanding
  • Pruning the decision tree prevents overfitting using validation data

9 of 27

Sources: Introduction to Data Mining and Introduction to Statistical Learning

Training for Classification

  • Recursive binary splitting via greedy strategy
    • Split the dataset based on a feature test that optimizes a certain criterion
  • Split criterion: ex. information gain
    • Determine how to split the dataset
      • How to specify the feature test condition?
        • Depends on feature type
        • Depends on number of ways to split
      • How to determine the best split?
        • Impurity test, e.g., Gini Index, entropy, misclassification, etc.
    • Stopping criterion: determine when to stop splitting nodes
    • Overfitting and tree pruning

10 of 27

Source: Elements of Statistical Learning

Recursive Binary Split Method

  • Iteratively stratifying or segmenting the predictor space into a number of simple regions that can be organized into a hierarchical structure
  • Prediction for a given observation typically uses the mean or mode of training observations in the regions to which it belongs
  • Splitting rules used to segment the predictor space can be summarized in a tree

11 of 27

Cost Function: Impurity Test

  • Metrics used to determine the goodness of feature test condition that partitions the training data into purer subsets in the child nodes
  • Common metrics include:
    • Entropy
    • Gini Index
    • Classification Error
  • An impure node containing training data from multiple classes is likely to require several levels of node expansion that increases the depth of the tree
  • Larger trees are less desirable as they are more prone to overfitting, instability, difficulty in interpretation, and require more training and test time

12 of 27

Source: Elements of Statistical Learning

Impurity Measures of a Classification Tree

  • For two classes, if p is the proportion in the second class:
    • Misclassification: 1-max(p, 1-p)
    • Gini Index: 2p(1-p)
    • (Cross) entropy: -(p)log(p) - (1-p)log(1-p)
  • All three are similar, but entropy and Gini Index are differentiable and more sensitive to the changes in the node probabilities than the misclassification rate
  • Entropy measure has been scaled to fit the Gini and Misclassification impurity measures

13 of 27

Source: Introduction to Data Mining

Impurity Computation: At Single Node

  • The following example illustrates how the values of the impurity measure vary as we alter the classification distribution

  • Node N1 has the lowest impurity value, followed by N2 and N3

14 of 27

Source: Introduction to Data Mining

Impurity Computation: Collective Child Nodes

  • Consider the following example for the loan borrower classification problem with three candidate feature test conditions: Home Owner, Marital Status, and ID

  • Feature test condition that splits a node containing N training data into k children, the collective impurity of the child nodes can be computed by taking a weighted sum of the impurities of the child nodes, given by:

15 of 27

Source: Introduction to Data Mining

Impurity Computation: Collective Child Nodes

  • Entropy measurement will be demonstrated
  • Splitting on the Home Owner will generate two child nodes whose weighted entropy can be calculated as:

  • Splitting on the Marital Status will lead to three child nodes whose weighted entropy can be calculated as:

16 of 27

Source: Introduction to Data Mining

Feature Split Test Condition: Information Gain

  • Compute impurity measure before splitting (parent node)
  • Compute impurity measure after splitting (collective weighted child nodes)
    • Compute the impurity measure of each child node
    • Weigh impurity of child nodes
  • Choose feature test condition that produces the highest information gain:
    • Lowest impurity measure after splitting
    • Achieves the most entropy-reduction that maximizes information gain

(Information gain is the mutual information between the class variable and the splitting variable)

17 of 27

Source: Introduction to Data Mining

Goodness of a Feature Test Condition: Information Gain

  • Compare the degree of impurity of the parent node (before split) with the weighted degree of impurity of the child nodes (after split)
  • For this example, Home Owner and Marital Status (qualitative features) are used to demonstrate difference in entropy measure commonly known as information gain

  • Following calculation shows entropy of parent node of Home Owner with 3 instances of Yes and 7 instances of No (Marital Status not shown)

  • The information gain for Home Owner and Marital Status are each given by:

18 of 27

Stopping Criterion and Tree Pruning

  • Stopping Criterion – when to stop splitting:
    • All records belong to the same class (no information gain)
    • Reached minimum rows per node (min_rows)
    • Reached maximum depth (max_depth)�
  • Large trees may produce good predictions on the training dataset, but are likely to overfit the data, leading to poor real-world performance�
  • Overfitting can result from high-variance trees�
  • Pruning strategy to avoid overfitting:
    • Grow a very large tree
    • Use validation data or k-fold cross-validation to obtain prediction error and minimum average error
    • Prune large tree to obtain a subtree that leads to lowest test rate error

19 of 27

Source: Introduction to Statistical Learning

Preference Bias: Ockham’s Razor

The simplest consistent explanation is the best: smallest decision tree that correctly classifies most of the training examples is the best tree.

  • Fitting and pruning a regression tree with 6-fold cross-validation to estimate MSE of the subtrees:

20 of 27

Source: Patrick Hall, Decision Trees, GWU

Tree Pruning via Validation Error

21 of 27

Decision Tree Model Scoring

22 of 27

Variable Importance

23 of 27

Variable Importance

24 of 27

Characteristics of Decision Tree

Advantages

  • Relatively inexpensive to construct
  • Extremely fast at classifying unknown records
  • Easy to interpret for small-sized trees
  • Robust to noise (especially when methods to avoid overfitting are employed)
  • Can easily handle redundant or irrelevant features (unless the features are interacting)
  • Accepted; understood; interpretable; few assumptions
  • Excellent for: discontinuous, nonlinear phenomena, interactions, missing data, correlated variables, and features on different scales

Disadvantages

  • Due to the greedy nature of splitting criterion, interacting features (that can distinguish between classes together but not individually) may be passed over in favor of other features that are less discriminating
  • Each decision boundary involves only a single feature
  • Tendency to over-fit; many hyper-parameters to tune; no parameters, standard errors or confidence limit;
  • Single decision trees can be unstable;
  • Usually poor performance in pattern recognition tasks vs. neural networks

25 of 27

�Banknote Case

Using Python

26 of 27

Source: Machine Learning Algorithms from Scratch

Training with k-fold Cross-Validation Approach

  • Gini Index
  • Create Split
    • Splitting a dataset
    • Evaluating all splits
  • Build a Tree
    • Terminal nodes
    • Recursive splitting
    • Build a tree
  • Make a Prediction
  • Banknote dataset: 5-fold cross-validation approach (~270 records in each fold)
    • evaluate_algorithm() to evaluate the algorithm with cross-validation
    • accuracy_metrics() to calculate accuracy of prediction
    • decision_tree() to manage training

27 of 27

Reading

  • Elements of Statistical Learning
    • Section 9.2
  • Introduction to Data Mining
    • Sections 3.1 – 3.3