1 of 20

Decision Trees

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

Understand the benefits and drawbacks of decision trees compared to the models we’ve seen so far.
Learn the algorithm for fitting a decision tree.
Develop intuition for entropy.

2 of 20

Break!

Fill out Attendance:

http://bit.ly/at-d100

3 of 20

Announcements

Project 2 due today!
Project 3 out today, due Tues
Final next Thurs, Aug 15 9:30am-12:30pm in 10 Evans.

Can bring two handwritten double-sided cheat sheets.
Bring pencil and eraser (not pen).
Will provide midterm reference sheet again.

4 of 20

Decision Trees

5 of 20

Decision Trees

Decision trees are a useful nonlinear algorithm used for both classification and regression.
Intuition: Learn a set of binary rules rather than a set of coefficients for a linear model.
E.g: Is it faster to drive to SF? Or take the BART?

If weekend, drive.
If ≤10 am, take BART.
If I can use carpool lane, drive.
Etc.

6 of 20

Decision Trees, Visualized

Drive

Sun

Day of week (x₁)

Time of day (x₂)

Sat

M-F

Drive

≤10am

x₁

x₂

Sun

M-F

Sat

Carpool? (x₃)

>10am,

<4pm

Drive

≥4pm

BART

...

Drive

BART

Check x₃

≥4pm

>10am,

<4pm

≤10am

7 of 20

Decision Tree Traits

Works with both numeric and categorical data w/o extra work.

Easier to interpret compared to linear/logistic model.

Fits complex, nonlinear boundaries w/o feature eng.

Can use for both regression and (multiclass) classification.

x₁

x₂

Sun

M-F

Sat

Drive

BART

Check x₃

8 of 20

Building a Decision Tree

Every branch in decision tree splits training data into nodes. A node is pure if all points have same labels.
Start with all training data in one node. Repeat the following until all nodes are pure:

Pick best feature j and best split value θ.

E.g. j=2 for Hour of Day and θ = 10 (for 10am)

Split data into two nodes (one for x_j < θ, one for x_j ≥ θ).

9 of 20

How do we find the best split?

Intuition: Good splits find features that get closer to a partition of the training labels.

As usual, we will define a loss function to minimize.

Worst split

20 D, 40 B

15 D, 30 B

5 D, 10 B

20 D, 40 B

10 D, 10 B

10 D, 30 B

20 D

20 D, 40 B

40 B

Good split

Ideal split

10 of 20

Entropy

Entropy measures how “disorderly” a node is. Better nodes have low entropy.

IOW: Suppose we pick a random point from a node. Low entropy means we are quite sure what that point will be.

Cross-entropy and entropy are closely related. Both come from information theory, a useful branch of math that examines the resolution of uncertainty.

11 of 20

Practice with Entropy

Check that:

For a pure node, S = 0.
For a node with an equal number of two labels, S = 1.
For a node with k points and k labels, S = log₂k.

S = -(1)(log₂1) = 0.
S = -(0.5)(log₂0.5) - (0.5)(log₂0.5) = 1.
S = -(1/k)(log₂1/k) * k = -log₂1/k = log₂k.

12 of 20

Loss of a Split

Parent node N has K points.
Child nodes N₁ and N₂ have k₁ and k₂ points (k₁ + k₂ = K).
Loss of split = split entropy� = weighted average entropy of N₁ and N₂:

IOW: Every time we grow tree, compute all possible splits. Then, pick the one that gives the least average entropy.

13 of 20

Practice with Loss of Split

Find the loss for each of the splits above.
Then find the information gain: S(N) - entropy of split.

20 D, 40 B

15 D, 30 B

5 D, 10 B

20 D, 40 B

10 D, 10 B

10 D, 30 B

20 D

20 D, 40 B

40 B

14 of 20

Practice with Loss of Split

S(N₁) = -(1/3)log₂(1/3) - (2/3)log₂(2/3) = 0.918

S(N₂) = -(1/3)log₂(1/3) - (2/3)log₂(2/3) = 0.918

L = (45 * 0.918 + 15 * 0.918) / 60 = 0.918

20 D, 40 B

15 D, 30 B

5 D, 10 B

15 of 20

Practice with Loss of Split

S(N₁) = 1

S(N₂) = -(1/4)log₂(1/4) - (3/4)log₂(3/4) = 0.811

L = (20 * 1 + 40 * 0.811) / 60 = 0.874

20 D, 40 B

10 D, 10 B

10 D, 30 B

16 of 20

Practice with Loss of Split

S(N₁) = 0

S(N₂) = 0

L = 0

20 D

20 D, 40 B

40 B

17 of 20

Practice with Loss of Split

20 D, 40 B

15 D, 30 B

5 D, 10 B

20 D, 40 B

10 D, 10 B

10 D, 30 B

20 D

20 D, 40 B

40 B

L = 0.918

Info gain = 0

L = 0.874

Info gain = 0.044

L = 0

Info gain = 0.918

Intuition check: Why can’t we always pick the rightmost split?

(Demo)

18 of 20

Split Entropy Traits

Info gain for split entropy is always +ve except in two cases:

Split puts all points in one child node.
Child nodes have same p_C as parent for all C.

This is true for strictly concave loss functions.

Good questions to ask your TA: Why won’t we ever gain entropy via a split? What does the dotted line represent in the plot? Are there non-strictly concave loss functions?

19 of 20

Fitting Decision Tree

Start with all training data in one node. Repeat the following until all nodes are pure:

Pick an impure node.
Find feature j and value θ that minimize loss of split.
Split into two child nodes (one for x_j < θ, one for x_j ≥ θ).

(Demo)

20 of 20

Decision Problems

A decision tree will always have 100% training accuracy. Why?

Sadly, this means decision trees grossly overfit.

Can approach by: enforcing a max tree depth, pruning tree, don’t split if node is too small, etc.

Or, do some clever bootstrapping! We’ll save that for tomorrow.