1 of 27

Machine Learning I:

Decision Trees

Patrick Hall

Visiting Faculty, Department of Decision Sciences

George Washington University

2 of 27

Lecture 4 Agenda

Brief Introduction
Decision Tree Algorithm

Greedy Heuristics
Impurity & Information Gain
Tree Pruning

Banknote Case: Python Example
Readings

3 of 27

Where are we in the modeling lifecycle?

Data Collection �& ETL

Feature Selection & Engineering

Supervised�Learning

Unsupervised�Learning

Deployment

Cost Intensive

Revenue

Generating

Assessment & Validation

4 of 27

�Brief Introduction

General Overview

5 of 27

Decision Tree Anatomy

Three node types:

Root node: no incoming links and zero or more outgoing links
Internal nodes: exactly one incoming and two or more outgoing links
Leaf or terminal nodes: exactly one incoming link and no outgoing links

Every leaf or terminal node in the decision tree is associated with a class label
Non-terminal nodes contain feature test conditions that are typically defined using a single feature
Depth: length of the longest decision-path from the tree root to a leaf node

6 of 27

Decision Tree Structure: Simple Example

7 of 27

�Decision Tree Algorithm

Greedy Heuristics,

Impurity, Information Gain, & Tree Pruning

8 of 27

Basic Algorithm Overview

Many possible decision trees can be constructed from a particular dataset

Rashomon Effect

Finding an optimal tree is computational expensive:

Learning the simplest decision tree is NP-hard

Efficient algorithms have been developed to learn reasonably accurate decision trees in a reasonable amount of time, resorting to greedy heuristics
Employ a greedy strategy to grow the decision tree in a top-down fashion by making a series of locally optimal decisions
Splitting criterion determines which feature is chosen as the test condition and determines how the training data is distributed amongst child nodes
Stopping criterion determines when a node should be stopped from expanding
Pruning the decision tree prevents overfitting using validation data

9 of 27

Sources: Introduction to Data Mining and Introduction to Statistical Learning

Training for Classification

Recursive binary splitting via greedy strategy

Split the dataset based on a feature test that optimizes a certain criterion

Split criterion: ex. information gain

Determine how to split the dataset

How to specify the feature test condition?

Depends on feature type
Depends on number of ways to split

How to determine the best split?

Impurity test, e.g., Gini Index, entropy, misclassification, etc.

Stopping criterion: determine when to stop splitting nodes
Overfitting and tree pruning

10 of 27

Source: Elements of Statistical Learning

Recursive Binary Split Method

Iteratively stratifying or segmenting the predictor space into a number of simple regions that can be organized into a hierarchical structure
Prediction for a given observation typically uses the mean or mode of training observations in the regions to which it belongs
Splitting rules used to segment the predictor space can be summarized in a tree

11 of 27

Cost Function: Impurity Test

Metrics used to determine the goodness of feature test condition that partitions the training data into purer subsets in the child nodes
Common metrics include:

Entropy
Gini Index
Classification Error

An impure node containing training data from multiple classes is likely to require several levels of node expansion that increases the depth of the tree
Larger trees are less desirable as they are more prone to overfitting, instability, difficulty in interpretation, and require more training and test time

12 of 27

Source: Elements of Statistical Learning

Impurity Measures of a Classification Tree

For two classes, if p is the proportion in the second class:

Misclassification: 1-max(p, 1-p)
Gini Index: 2p(1-p)
(Cross) entropy: -(p)log(p) - (1-p)log(1-p)

All three are similar, but entropy and Gini Index are differentiable and more sensitive to the changes in the node probabilities than the misclassification rate
Entropy measure has been scaled to fit the Gini and Misclassification impurity measures

13 of 27

Source: Introduction to Data Mining

Impurity Computation: At Single Node

The following example illustrates how the values of the impurity measure vary as we alter the classification distribution

Node N1 has the lowest impurity value, followed by N2 and N3

14 of 27

Source: Introduction to Data Mining

Impurity Computation: Collective Child Nodes

Consider the following example for the loan borrower classification problem with three candidate feature test conditions: Home Owner, Marital Status, and ID

Feature test condition that splits a node containing N training data into k children, the collective impurity of the child nodes can be computed by taking a weighted sum of the impurities of the child nodes, given by:

15 of 27

Source: Introduction to Data Mining

Impurity Computation: Collective Child Nodes

Entropy measurement will be demonstrated
Splitting on the Home Owner will generate two child nodes whose weighted entropy can be calculated as:

Splitting on the Marital Status will lead to three child nodes whose weighted entropy can be calculated as:

16 of 27

Source: Introduction to Data Mining

Feature Split Test Condition: Information Gain

Compute impurity measure before splitting (parent node)
Compute impurity measure after splitting (collective weighted child nodes)

Compute the impurity measure of each child node
Weigh impurity of child nodes

Choose feature test condition that produces the highest information gain:

Lowest impurity measure after splitting
Achieves the most entropy-reduction that maximizes information gain

(Information gain is the mutual information between the class variable and the splitting variable)

17 of 27

Source: Introduction to Data Mining

Goodness of a Feature Test Condition: Information Gain

Compare the degree of impurity of the parent node (before split) with the weighted degree of impurity of the child nodes (after split)
For this example, Home Owner and Marital Status (qualitative features) are used to demonstrate difference in entropy measure commonly known as information gain

Following calculation shows entropy of parent node of Home Owner with 3 instances of Yes and 7 instances of No (Marital Status not shown)

The information gain for Home Owner and Marital Status are each given by:

18 of 27

Stopping Criterion and Tree Pruning

Stopping Criterion – when to stop splitting:

All records belong to the same class (no information gain)
Reached minimum rows per node (min_rows)
Reached maximum depth (max_depth)�

Large trees may produce good predictions on the training dataset, but are likely to overfit the data, leading to poor real-world performance�
Overfitting can result from high-variance trees�
Pruning strategy to avoid overfitting:

Grow a very large tree
Use validation data or k-fold cross-validation to obtain prediction error and minimum average error
Prune large tree to obtain a subtree that leads to lowest test rate error

19 of 27

Source: Introduction to Statistical Learning

Preference Bias: Ockham’s Razor

The simplest consistent explanation is the best: smallest decision tree that correctly classifies most of the training examples is the best tree.

Fitting and pruning a regression tree with 6-fold cross-validation to estimate MSE of the subtrees:

20 of 27

Source: Patrick Hall, Decision Trees, GWU

Tree Pruning via Validation Error

21 of 27

Decision Tree Model Scoring

22 of 27

Variable Importance

23 of 27

Variable Importance

24 of 27

Characteristics of Decision Tree

Advantages

Relatively inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Robust to noise (especially when methods to avoid overfitting are employed)
Can easily handle redundant or irrelevant features (unless the features are interacting)
Accepted; understood; interpretable; few assumptions
Excellent for: discontinuous, nonlinear phenomena, interactions, missing data, correlated variables, and features on different scales

Disadvantages

Due to the greedy nature of splitting criterion, interacting features (that can distinguish between classes together but not individually) may be passed over in favor of other features that are less discriminating
Each decision boundary involves only a single feature
Tendency to over-fit; many hyper-parameters to tune; no parameters, standard errors or confidence limit;
Single decision trees can be unstable;
Usually poor performance in pattern recognition tasks vs. neural networks

25 of 27

�Banknote Case

Using Python

26 of 27

Source: Machine Learning Algorithms from Scratch

Training with k-fold Cross-Validation Approach

Gini Index
Create Split

Splitting a dataset
Evaluating all splits

Build a Tree

Terminal nodes
Recursive splitting
Build a tree

Make a Prediction
Banknote dataset: 5-fold cross-validation approach (~270 records in each fold)

evaluate_algorithm() to evaluate the algorithm with cross-validation
accuracy_metrics() to calculate accuracy of prediction
decision_tree() to manage training

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27