1 of 44

Lecture8a: Decision Trees

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

2 of 44

Outline

  • Motivation
  • Decision Trees
  • Classification Trees
  • Splitting Criteria
  • Stopping Conditions & Pruning
  • Regression Trees

2

CS109A, Protopapas, Rader, Tanner

3 of 44

Geometry of Data

Recall:

logistic regression for building classification boundaries works best when:

  • the classes are well-separated in the feature space
  • have a nice geometry to the classification boundary)

3

CS109A, Protopapas, Rader, Tanner

4 of 44

Geometry of Data

 

4

CS109A, Protopapas, Rader, Tanner

5 of 44

Geometry of Data

 

5

CS109A, Protopapas, Rader, Tanner

6 of 44

Geometry of Data

Question: How about these?

6

CS109A, Protopapas, Rader, Tanner

7 of 44

Geometry of Data

Question: Or these?

7

CS109A, Protopapas, Rader, Tanner

8 of 44

Geometry of Data

Notice that in all of the datasets the classes are still well-separated in the feature space, but the decision boundaries cannot easily be described by single equations:

8

CS109A, Protopapas, Rader, Tanner

9 of 44

Geometry of Data

 

9

CS109A, Protopapas, Rader, Tanner

10 of 44

Interpretable Models

People in every walk of life have long been using interpretable models for differentiating between classes of objects and phenomena:

10

CS109A, Protopapas, Rader, Tanner

11 of 44

Interpretable Models (cont.)

Or in the [inferential] data analysis world:

11

CS109A, Protopapas, Rader, Tanner

12 of 44

Decision Trees

It turns out that the simple flow charts in our examples can be formulated as mathematical models for classification and these models have the properties we desire; they are:

  1. interpretable by humans
  2. have sufficiently complex decision boundaries
  3. the decision boundaries are locally linear, each component of the decision boundary is simple to describe mathematically.

12

CS109A, Protopapas, Rader, Tanner

13 of 44

Decision Trees

13

CS109A, Protopapas, Rader, Tanner

14 of 44

The Geometry of Flow Charts

Flow charts whose graph is a tree (connected and no cycles) represents a model called a decision tree.

Formally, a decision tree model is one in which the final outcome of the model is based on a series of comparisons of the values of predictors against threshold values.

In a graphical representation (flow chart),

    • the internal nodes of the tree represent attribute testing.
    • branching in the next level is determined by attribute value (yes/no).
    • terminal leaf nodes represent class assignments.

14

CS109A, Protopapas, Rader, Tanner

15 of 44

The Geometry of Flow Charts

Flow charts whose graph is a tree (connected and no cycles) represents a model called a decision tree.

Formally, a decision tree model is one in which the final outcome of the model is based on a series of comparisons of the values of predictors against threshold values.

15

CS109A, Protopapas, Rader, Tanner

16 of 44

The Geometry of Flow Charts

Every flow chart tree corresponds to a partition of the feature space by axis aligned lines or (hyper) planes. Conversely, every such partition can be written as a flow chart tree.

16

CS109A, Protopapas, Rader, Tanner

17 of 44

The Geometry of Flow Charts

Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor). Why?

17

CS109A, Protopapas, Rader, Tanner

18 of 44

Learning the Model

Given a training set, learning a decision tree model for binary classification means:

  • producing an optimal partition of the feature space with axis-aligned linear boundaries (very interpretable!),
  • each region is predicted to have a class label based on the largest class of the training points in that region (Bayes’ classifier) when performing prediction.

18

CS109A, Protopapas, Rader, Tanner

19 of 44

Learning the Model

Learning the smallest ‘optimal’ decision tree for any given set of data is NP complete for numerous simple definitions of ‘optimal’. Instead, we will seek a reasonably model using a greedy algorithm.

  1. Start with an empty decision tree (undivided feature space)
  2. Choose the ‘optimal’ predictor on which to split and choose the ‘optimal’ threshold value for splitting.
  3. Recurse on each new node until stopping condition is met

Now, we need only define our splitting criterion and stopping condition.

19

CS109A, Protopapas, Rader, Tanner

20 of 44

Numerical vs Categorical Attributes

Note that the ‘compare and branch’ method by which we defined classification tree works well for numerical features.

However, if a feature is categorical (with more than two possible values), comparisons like feature < threshold does not make sense.

How can we handle this?

A simple solution is to encode the values of a categorical feature using numbers and treat this feature like a numerical variable. This is indeed what some computational libraries (e.g. sklearn) do, however, this method has drawbacks.

20

CS109A, Protopapas, Rader, Tanner

21 of 44

Numerical vs Categorical Attributes

21

Example

Supposed the feature we want to split on is color, and the values are: Red, Blue and Yellow. If we encode the categories numerically as:

Red = 0, Blue = 1, Yellow = 2

Then the possible non-trivial splits on color are

{{Red}, {Blue, Yellow}} {{Red, Blue},{Yellow}}

But if we encode the categories numerically as:

Red = 2, Blue = 0, Yellow = 1

The possible splits are

{{Blue}, {Yellow, Red}} {{Blue,Yellow}, {Red}}

Depending on the encoding, the splits we can optimize over can be different!

CS109A, Protopapas, Rader, Tanner

22 of 44

Numerical vs Categorical Attributes

In practice, the effect of our choice of naive encoding of categorical variables are often negligible - models resulting from different choices of encoding will perform comparably.

In cases where you might worry about encoding, there is a more sophisticated way to numerically encode the values of categorical variables so that one can optimize over all possible partitions of the values of the variable.

This more principled encoding scheme is computationally more expensive but is implemented in a number of computational libraries (e.g. R’s randomForest).

22

CS109A, Protopapas, Rader, Tanner

23 of 44

Splitting Criteria

23

CS109A, Protopapas, Rader, Tanner

24 of 44

Optimality of Splitting

While there is no ‘correct’ way to define an optimal split, there are some common sensical guidelines for every splitting criterion:

  • the regions in the feature space should grow progressively more pure with the number of splits. That is, we should see each region ‘specialize’ towards a single class.
  • the fitness metric of a split should take a differentiable form (making optimization possible).
  • we shouldn’t end up with empty regions - regions containing no training points.

24

CS109A, Protopapas, Rader, Tanner

25 of 44

Information Theory

 

25

CS109A, Protopapas, Rader, Tanner

26 of 44

Information Theory

One way to quantify the strength of a signal in a particular region is to analyze the distribution of classes within the region. We compute the entropy of this distribution.

For a random variable with a discrete distribution, the entropy is computed by:

Higher entropy means the distribution is uniform-like (flat histogram) and thus values sampled from it are ‘less predictable’ (all possible values are equally probable).

Lower entropy means the distribution has more defined peaks and valleys and thus values sampled from it are ‘more predictable’ (values around the peaks are more probable).

26

CS109A, Protopapas, Rader, Tanner

27 of 44

Entropy

 

27

CS109A, Protopapas, Rader, Tanner

28 of 44

Entropy

We can now try to find the predictor j and the threshold tj that minimizes the average entropy over the two regions, weighted by the population of the regions:

28

Example

CS109A, Protopapas, Rader, Tanner

29 of 44

Comparison of Criteria

Recall our intuitive guidelines for splitting criteria, which of the three criteria fits our guideline the best?

We have the following comparison of the value of the three criteria at different levels of purity (from 0 to 1) in a single region (for binary outcomes).

29

CS109A, Protopapas, Rader, Tanner

30 of 44

Comparison of Criteria

Recall our intuitive guidelines for splitting criteria, which of the three criteria fits our guideline the best?

To note that entropy penalizes impurity the most is not to say that it is the best splitting criteria. For one, a model with purer leaf nodes on a training set may not perform better on the testing test.

Another factor to consider is the size of the tree (i.e. model complexity) each criteria tends to promote.

To compare different decision tree models, we need to first discuss stopping conditions.

30

CS109A, Protopapas, Rader, Tanner

31 of 44

Stopping Conditions & Pruning

31

CS109A, Protopapas, Rader, Tanner

32 of 44

Variance vs Bias

If we don’t terminate the decision tree learning algorithm manually, the tree will continue to grow until each region defined by the model possibly contains exactly one training point (and the model attains 100% training accuracy).

To prevent this from happening, we can simply stop the algorithm at a particular depth.

But how do we determine the appropriate depth?

32

CS109A, Protopapas, Rader, Tanner

33 of 44

Variance vs Bias

33

CS109A, Protopapas, Rader, Tanner

34 of 44

Variance vs Bias

We make some observations about our models:

  • (High Bias) A tree of depth 4 is not a good fit for the training data - it’s unable to capture the nonlinear boundary separating the two classes.
  • (Low Bias) With an extremely high depth, we can obtain a model that correctly classifies all points on the boundary (by zig-zagging around each point).
  • (Low Variance) The tree of depth 4 is robust to slight perturbations in the training data - the square carved out by the model is stable if you move the boundary points a bit.
  • (High Variance) Trees of high depth are sensitive to perturbations in the training data, especially to changes in the boundary points.

Not surprisingly, complex trees have low bias (able to capture more complex geometry in the data) but high variance (can overfit). Complex trees are also harder to interpret and more computationally expensive to train.

34

CS109A, Protopapas, Rader, Tanner

35 of 44

Regression Tree Example

35

How do we decide a split here?

CS109A, Protopapas, Rader, Tanner

36 of 44

Learning Regression Trees

 

36

CS109A, Protopapas, Rader, Tanner

37 of 44

Regression Trees Prediction

 

37

CS109A, Protopapas, Rader, Tanner

38 of 44

Regression Tree Example

38

How do we decide a split here?

CS109A, Protopapas, Rader, Tanner

39 of 44

Regression Tree (max_depth = 1)

39

CS109A, Protopapas, Rader, Tanner

40 of 44

Regression Tree (max_depth = 2)

40

CS109A, Protopapas, Rader, Tanner

41 of 44

Regression Tree (max_depth = 5)

41

CS109A, Protopapas, Rader, Tanner

42 of 44

Regression Tree (max_depth = 10)

42

CS109A, Protopapas, Rader, Tanner

43 of 44

Stopping Conditions

 

43

CS109A, Protopapas, Rader, Tanner

44 of 44

Overfitting

Same issues as with classification trees. Avoid overfitting by pruning or limiting the depth of the tree and using CV.

44

CS109A, Protopapas, Rader, Tanner