Decision Trees
UC Berkeley Data 100 Summer 2019
Sam Lau
Learning goals:
Announcements
Decision Trees
Decision Trees
Decision Trees, Visualized
Drive
Drive
Sun
Day of week (x1)
Time of day (x2)
Sat
M-F
Drive
Drive
≤10am
x1
x2
Sun
M-F
Sat
Carpool? (x3)
>10am,
<4pm
Drive
≥4pm
BART
...
Drive
BART
Check x3
≥4pm
>10am,
<4pm
≤10am
Decision Tree Traits
Works with both numeric and categorical data w/o extra work.
Easier to interpret compared to linear/logistic model.
Fits complex, nonlinear boundaries w/o feature eng.
Can use for both regression and (multiclass) classification.
x1
x2
Sun
M-F
Sat
Drive
Drive
Drive
BART
Check x3
Building a Decision Tree
How do we find the best split?
Worst split
20 D, 40 B
15 D, 30 B
5 D, 10 B
20 D, 40 B
10 D, 10 B
10 D, 30 B
20 D
20 D, 40 B
40 B
Good split
Ideal split
Entropy
Cross-entropy and entropy are closely related. Both come from information theory, a useful branch of math that examines the resolution of uncertainty.
Practice with Entropy
Loss of a Split
Practice with Loss of Split
20 D, 40 B
15 D, 30 B
5 D, 10 B
20 D, 40 B
10 D, 10 B
10 D, 30 B
20 D
20 D, 40 B
40 B
Practice with Loss of Split
S(N1) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.918
S(N2) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.918
L = (45 * 0.918 + 15 * 0.918) / 60 = 0.918
20 D, 40 B
15 D, 30 B
5 D, 10 B
Practice with Loss of Split
S(N1) = 1
S(N2) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.811
L = (20 * 1 + 40 * 0.811) / 60 = 0.874
20 D, 40 B
10 D, 10 B
10 D, 30 B
Practice with Loss of Split
S(N1) = 0
S(N2) = 0
L = 0
20 D
20 D, 40 B
40 B
Practice with Loss of Split
20 D, 40 B
15 D, 30 B
5 D, 10 B
20 D, 40 B
10 D, 10 B
10 D, 30 B
20 D
20 D, 40 B
40 B
L = 0.918
Info gain = 0
L = 0.874
Info gain = 0.044
L = 0
Info gain = 0.918
Intuition check: Why can’t we always pick the rightmost split?
(Demo)
Split Entropy Traits
Info gain for split entropy is always +ve except in two cases:
This is true for strictly concave loss functions.
Good questions to ask your TA: Why won’t we ever gain entropy via a split? What does the dotted line represent in the plot? Are there non-strictly concave loss functions?
Fitting Decision Tree
(Demo)
Decision Problems
A decision tree will always have 100% training accuracy. Why?
Sadly, this means decision trees grossly overfit.
Can approach by: enforcing a max tree depth, pruning tree, don’t split if node is too small, etc.
Or, do some clever bootstrapping! We’ll save that for tomorrow.