Lecture8a: Decision Trees
CS109A Introduction to Data Science
Pavlos Protopapas, Kevin Rader and Chris Tanner
Outline
2
CS109A, Protopapas, Rader, Tanner
Geometry of Data
Recall:
logistic regression for building classification boundaries works best when:
3
CS109A, Protopapas, Rader, Tanner
Geometry of Data
4
CS109A, Protopapas, Rader, Tanner
Geometry of Data
5
CS109A, Protopapas, Rader, Tanner
Geometry of Data
Question: How about these?
6
CS109A, Protopapas, Rader, Tanner
Geometry of Data
Question: Or these?
7
CS109A, Protopapas, Rader, Tanner
Geometry of Data
Notice that in all of the datasets the classes are still well-separated in the feature space, but the decision boundaries cannot easily be described by single equations:
8
CS109A, Protopapas, Rader, Tanner
Geometry of Data
9
CS109A, Protopapas, Rader, Tanner
Interpretable Models
People in every walk of life have long been using interpretable models for differentiating between classes of objects and phenomena:
10
CS109A, Protopapas, Rader, Tanner
Interpretable Models (cont.)
Or in the [inferential] data analysis world:
11
CS109A, Protopapas, Rader, Tanner
Decision Trees
It turns out that the simple flow charts in our examples can be formulated as mathematical models for classification and these models have the properties we desire; they are:
12
CS109A, Protopapas, Rader, Tanner
Decision Trees
13
CS109A, Protopapas, Rader, Tanner
The Geometry of Flow Charts
Flow charts whose graph is a tree (connected and no cycles) represents a model called a decision tree.
Formally, a decision tree model is one in which the final outcome of the model is based on a series of comparisons of the values of predictors against threshold values.
In a graphical representation (flow chart),
14
CS109A, Protopapas, Rader, Tanner
The Geometry of Flow Charts
Flow charts whose graph is a tree (connected and no cycles) represents a model called a decision tree.
Formally, a decision tree model is one in which the final outcome of the model is based on a series of comparisons of the values of predictors against threshold values.
15
CS109A, Protopapas, Rader, Tanner
The Geometry of Flow Charts
Every flow chart tree corresponds to a partition of the feature space by axis aligned lines or (hyper) planes. Conversely, every such partition can be written as a flow chart tree.
16
CS109A, Protopapas, Rader, Tanner
The Geometry of Flow Charts
Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor). Why?
17
CS109A, Protopapas, Rader, Tanner
Learning the Model
Given a training set, learning a decision tree model for binary classification means:
18
CS109A, Protopapas, Rader, Tanner
Learning the Model
Learning the smallest ‘optimal’ decision tree for any given set of data is NP complete for numerous simple definitions of ‘optimal’. Instead, we will seek a reasonably model using a greedy algorithm.
Now, we need only define our splitting criterion and stopping condition.
19
CS109A, Protopapas, Rader, Tanner
Numerical vs Categorical Attributes
Note that the ‘compare and branch’ method by which we defined classification tree works well for numerical features.
However, if a feature is categorical (with more than two possible values), comparisons like feature < threshold does not make sense.
How can we handle this?
A simple solution is to encode the values of a categorical feature using numbers and treat this feature like a numerical variable. This is indeed what some computational libraries (e.g. sklearn) do, however, this method has drawbacks.
20
CS109A, Protopapas, Rader, Tanner
Numerical vs Categorical Attributes
21
Example
Supposed the feature we want to split on is color, and the values are: Red, Blue and Yellow. If we encode the categories numerically as:
Red = 0, Blue = 1, Yellow = 2
Then the possible non-trivial splits on color are
{{Red}, {Blue, Yellow}} {{Red, Blue},{Yellow}}
But if we encode the categories numerically as:
Red = 2, Blue = 0, Yellow = 1
The possible splits are
{{Blue}, {Yellow, Red}} {{Blue,Yellow}, {Red}}
Depending on the encoding, the splits we can optimize over can be different!
CS109A, Protopapas, Rader, Tanner
Numerical vs Categorical Attributes
In practice, the effect of our choice of naive encoding of categorical variables are often negligible - models resulting from different choices of encoding will perform comparably.
In cases where you might worry about encoding, there is a more sophisticated way to numerically encode the values of categorical variables so that one can optimize over all possible partitions of the values of the variable.
This more principled encoding scheme is computationally more expensive but is implemented in a number of computational libraries (e.g. R’s randomForest).
22
CS109A, Protopapas, Rader, Tanner
Splitting Criteria
23
CS109A, Protopapas, Rader, Tanner
Optimality of Splitting
While there is no ‘correct’ way to define an optimal split, there are some common sensical guidelines for every splitting criterion:
24
CS109A, Protopapas, Rader, Tanner
Information Theory
25
CS109A, Protopapas, Rader, Tanner
Information Theory
One way to quantify the strength of a signal in a particular region is to analyze the distribution of classes within the region. We compute the entropy of this distribution.
For a random variable with a discrete distribution, the entropy is computed by:
Higher entropy means the distribution is uniform-like (flat histogram) and thus values sampled from it are ‘less predictable’ (all possible values are equally probable).
Lower entropy means the distribution has more defined peaks and valleys and thus values sampled from it are ‘more predictable’ (values around the peaks are more probable).
26
CS109A, Protopapas, Rader, Tanner
Entropy
27
CS109A, Protopapas, Rader, Tanner
Entropy
We can now try to find the predictor j and the threshold tj that minimizes the average entropy over the two regions, weighted by the population of the regions:
28
Example
CS109A, Protopapas, Rader, Tanner
Comparison of Criteria
Recall our intuitive guidelines for splitting criteria, which of the three criteria fits our guideline the best?
We have the following comparison of the value of the three criteria at different levels of purity (from 0 to 1) in a single region (for binary outcomes).
29
CS109A, Protopapas, Rader, Tanner
Comparison of Criteria
Recall our intuitive guidelines for splitting criteria, which of the three criteria fits our guideline the best?
To note that entropy penalizes impurity the most is not to say that it is the best splitting criteria. For one, a model with purer leaf nodes on a training set may not perform better on the testing test.
Another factor to consider is the size of the tree (i.e. model complexity) each criteria tends to promote.
To compare different decision tree models, we need to first discuss stopping conditions.
30
CS109A, Protopapas, Rader, Tanner
Stopping Conditions & Pruning
31
CS109A, Protopapas, Rader, Tanner
Variance vs Bias
If we don’t terminate the decision tree learning algorithm manually, the tree will continue to grow until each region defined by the model possibly contains exactly one training point (and the model attains 100% training accuracy).
To prevent this from happening, we can simply stop the algorithm at a particular depth.
But how do we determine the appropriate depth?
32
CS109A, Protopapas, Rader, Tanner
Variance vs Bias
33
CS109A, Protopapas, Rader, Tanner
Variance vs Bias
We make some observations about our models:
Not surprisingly, complex trees have low bias (able to capture more complex geometry in the data) but high variance (can overfit). Complex trees are also harder to interpret and more computationally expensive to train.
34
CS109A, Protopapas, Rader, Tanner
Regression Tree Example
35
How do we decide a split here?
CS109A, Protopapas, Rader, Tanner
Learning Regression Trees
36
CS109A, Protopapas, Rader, Tanner
Regression Trees Prediction
37
CS109A, Protopapas, Rader, Tanner
Regression Tree Example
38
How do we decide a split here?
CS109A, Protopapas, Rader, Tanner
Regression Tree (max_depth = 1)
39
CS109A, Protopapas, Rader, Tanner
Regression Tree (max_depth = 2)
40
CS109A, Protopapas, Rader, Tanner
Regression Tree (max_depth = 5)
41
CS109A, Protopapas, Rader, Tanner
Regression Tree (max_depth = 10)
42
CS109A, Protopapas, Rader, Tanner
Stopping Conditions
43
CS109A, Protopapas, Rader, Tanner
Overfitting
Same issues as with classification trees. Avoid overfitting by pruning or limiting the depth of the tree and using CV.
44
CS109A, Protopapas, Rader, Tanner