�Decision Trees
Dr. Dinesh Kumar Vishwakarma
Professor,
Department of Information Technology,
Delhi Technological University, Delhi-110042
dinesh@dtu.ac.in
http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php
Introduction
Applications
Advantages
Decision Tree: DT
Hungary
Yes
No
Have ₹ 100
Sleep
Restaurant
Yes
Apple
No
A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes)
E.g.
Call Centre
Root node
Leaf Node
Decision Node
Decision Tree: DT
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Splitting Attributes
Training Data
Model: Decision Tree
Another Example of Decision Tree
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
There could be more than one tree that fits the same data!
Decision Tree Classification Task
Decision Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Test Data
Start from the root of tree.
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Test Data
Assign Cheat to “No”
Decision Tree Classification Task
Decision Tree
Decision Tree Induction
Hunt’s Algorithm
Dt
?
Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes
No
Refund
Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K
>= 80K
Refund
Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tree Induction
Tree Induction
How to Specify Test Condition?
Splitting Based on Binary Attributes
Splitting Based on Nominal Attributes
CarType
Family
Sports
Luxury
CarType
{Family, �Luxury}
{Sports}
CarType
{Sports, Luxury}
{Family}
OR
Splitting Based on Ordinal Attributes
Size
S
M
L
Size
{M, L}
{S}
Size
{S, M}
{L, XL}
OR
Size
{S, L}
{M, XL}
Not allowed
Splitting Based on Continuous Attributes
Splitting Based on Continuous Attributes…
Tree Induction
How to determine the Best Split
Before Splitting: 10 records of class 0,� 10 records of class 1
Which test condition is the best?
How to determine the Best Split
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity
Concept of Entropy
More organized or Less organized or
ordered (less probable) disordered (more probable)
More ordered Less ordered
less entropy higher entropy
An Open Challenge!
Two sheets showing the tabulation of marks obtained in a course are shown.
Which tabulation of marks shows the “good” performance of the class?
How you can measure the same?
Roll No. | Assignment | Project | Mid-Sem | End-Sem |
12BT3FP06 | 89 | 99 | 56 | 91 |
10IM30013 | 95 | 98 | 55 | 93 |
12CE31005 | 98 | 96 | 58 | 97 |
12EC35015 | 93 | 95 | 54 | 99 |
12GG2005 | 90 | 91 | 53 | 98 |
12MI33006 | 91 | 93 | 57 | 97 |
13AG36001 | 96 | 94 | 58 | 95 |
13EE10009 | 92 | 96 | 56 | 96 |
13MA20012 | 88 | 98 | 59 | 96 |
14CS30017 | 94 | 90 | 60 | 94 |
14ME10067 | 90 | 92 | 58 | 95 |
14MT10038 | 99 | 89 | 55 | 93 |
Roll No. | Assignment | Project | Mid-Sem | End-Sem |
12BT3FP06 | 19 | 59 | 16 | 71 |
10IM30013 | 37 | 38 | 25 | 83 |
12CE31005 | 38 | 16 | 48 | 97 |
12EC35015 | 23 | 95 | 54 | 19 |
12GG2005 | 40 | 71 | 43 | 28 |
12MI33006 | 61 | 93 | 47 | 97 |
13AG36001 | 26 | 64 | 48 | 75 |
13EE10009 | 92 | 46 | 56 | 56 |
13MA20012 | 88 | 58 | 59 | 66 |
14CS30017 | 74 | 20 | 60 | 44 |
14ME10067 | 50 | 42 | 38 | 35 |
14MT10038 | 29 | 69 | 25 | 33 |
Entropy and its Meaning
Entropy in Information Theory
Measure of Information Content
Entropy Calculations
Entropy Calculations…
Entropy of a Training Set
Consider a dataset of OTPH as shown in the following table with total 24 instances in it.
Age | Eye sight | Astigmatic | Use Type | Class |
1 1 1 1 1 1 | 1 1 1 1 2 2 | 1 1 2 2 1 1 | 1 2 1 2 1 2 | 3 2 3 1 3 2 |
1 1 2 2 2 2 | 2 2 1 1 1 1 | 2 2 1 1 2 2 | 1 2 1 2 1 2 | 3 1 3 2 3 1 |
2 2 2 2 3 3 | 2 2 2 2 1 1 | 1 1 2 2 1 1 | 1 2 1 2 1 2 | 3 2 3 3 3 3 |
3 3 3 3 3 3 | 1 1 2 2 2 2 | 2 2 1 1 2 2 | 1 2 1 2 1 2 | 3 1 3 2 3 3 |
A coded forms for all values of attributes are used to avoid the cluttering in the table.
Entropy of a training set…
Specification of the attributes are as follows.
Age | Eye Sight | Astigmatic | Use Type |
1: Young | 1: Myopia | 1: No | 1: Frequent |
2: Middle-aged | 2: Hypermetropia | 2: Yes | 2: Less |
3: Old | | | |
Decision Tree Induction Techniques
Algorithm ID3
ID3: Decision Tree Induction Algorithms
Algorithm ID3
Defining Information Gain
Defining Information Gain…
Defining Information Gain…
Compute Information Gain
Missing value
Example
Age (x1) | Eye-sight (x2) | Astigmatism (x3) | Use type (x4) | Class (y) |
1 | 1 | 1 | 1 | 3 |
1 | 1 | 1 | 2 | 2 |
1 | 1 | 2 | 1 | 3 |
1 | 1 | 2 | 2 | 1 |
1 | 2 | 1 | 1 | 3 |
1 | 2 | 1 | 2 | 2 |
1 | 2 | 2 | 1 | 3 |
1 | 2 | 2 | 2 | 1 |
Example…
Age | Eye-sight | Astigmatism | Use type | Class |
2 | 1 | 1 | 1 | 3 |
2 | 1 | 1 | 2 | 2 |
2 | 1 | 2 | 1 | 3 |
2 | 1 | 2 | 2 | 1 |
2 | 2 | 1 | 1 | 3 |
2 | 2 | 1 | 2 | 2 |
2 | 2 | 2 | 1 | 3 |
2 | 2 | 2 | 2 | 3 |
Example…
Age | Eye-sight | Astigmatism | Use type | Class |
3 | 1 | 1 | 1 | 3 |
3 | 1 | 1 | 2 | 3 |
3 | 1 | 2 | 1 | 3 |
3 | 1 | 2 | 2 | 1 |
3 | 2 | 1 | 1 | 3 |
3 | 2 | 1 | 2 | 2 |
3 | 2 | 2 | 1 | 3 |
3 | 2 | 2 | 2 | 3 |
Information Gains for Different Attributes
Decision Tree Induction : ID3 Way
Decision Tree Induction : ID3 Way
| | | | |
| | | | |
Age
Eye-sight
Astigmatic
Use Type
Age | Eye | Ast | Use | Class |
1 | 1 | 1 | 1 | 3 |
1 | 1 | 2 | 1 | 3 |
1 | 2 | 1 | 1 | 3 |
1 | 2 | 2 | 1 | 3 |
2 | 1 | 1 | 1 | 3 |
2 | 2 | 1 | 1 | 3 |
2 | 2 | 2 | 1 | 3 |
3 | 1 | 1 | 1 | 3 |
3 | 1 | 2 | 1 | 3 |
3 | 2 | 1 | 1 | 3 |
3 | 2 | 2 | 1 | 3 |
Age | Eye | Ast | Use | Class |
1 | 1 | 1 | 2 | 2 |
1 | 1 | 2 | 2 | 1 |
1 | 2 | 1 | 2 | 2 |
1 | 2 | 2 | 2 | 1 |
2 | 1 | 1 | 2 | 2 |
2 | 1 | 2 | 2 | 1 |
2 | 2 | 1 | 2 | 2 |
3 | 1 | 1 | 2 | 3 |
3 | 1 | 2 | 2 | 3 |
3 | 2 | 1 | 2 | 2 |
3 | 2 | 2 | 2 | 3 |
✔
Age
Eye-sight
Astigmatic
Age
Eye-sight
Astigmatic
Example 3
Instances | | | | | |
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cool | Normal | Weak | Yes |
| Rain | Cool | Normal | Strong | No |
| Overcast | Cool | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cool | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
Example 3…
Example 3…
Example 3: Information Gain
Example 3: Information Gain…
Outlook has maximum gain
Root Node
Example 3: Information Gain…
Example 3: Information Gain…
Example 3: Information Gain…
We choose “Humidity”, since it has highest gain. No further splitting, reached as leaf node
Facts of Information Gain
C4.5�Gain Ratio
Gain Ratio
Dataset
Instances | | | | | |
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cool | Normal | Weak | Yes |
| Rain | Cool | Normal | Strong | No |
| Overcast | Cool | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cool | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
Example :Gain Ratio
Example :Gain Ratio…
Example :Gain Ratio…
CART�GINI INDEX
CLASSIFICATION AND REGRESSION TREE
GINI Index
GINI Index…
GINI Index…
GINI Index…
Measure of Impurity: GINI
Examples for computing GINI
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI
where, ni = number of records at child i,
n = number of records at node p.
Binary Attributes: Computing GINI Index
B?
Yes
No
Node N1
Node N2
Gini(N1) �= 1 – (5/6)2 – (2/6)2 �= 0.194
Gini(N2) �= 1 – (1/6)2 – (4/6)2 �= 0.528
Gini(Children) �= 7/12 * 0.194 + 5/12 * 0.528�= 0.333
Categorical Attributes: Computing Gini Index
Multi-way split
Two-way split
(find best partition of values)
Example
Past Trend | Open Interest | Trading Volume | Return |
Positive | Low | High | Up |
Negative | High | Low | Down |
Positive | Low | High | Up |
Positive | High | High | Up |
Negative | Low | High | Down |
Positive | Low | Low | Down |
Negative | High | High | Down |
Negative | Low | High | Down |
Positive | Low | Low | Down |
Positive | High | High | Up |
Continuous Attributes: Computing Gini Index
Continuous Attributes: Computing Gini Index...
Split Positions
Sorted Values
Splitting Criteria based on Classification Error
Examples for Computing Error
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Misclassification Error vs Gini
A?
Yes
No
Node N1
Node N2
Gini(N1) �= 1 – (3/3)2 – (0/3)2 �= 0
Gini(N2) �= 1 – (4/7)2 – (3/7)2 �= 0.489
Gini(Children) �= 3/10 * 0 �+ 7/10 * 0.489�= 0.342
Example: GINI Index
Instances | | | | | |
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cool | Normal | Weak | Yes |
| Rain | Cool | Normal | Strong | No |
| Overcast | Cool | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cool | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
Example: GINI Index…
Outlook | Yes | No | Number of instances |
Sunny | 2 | 3 | 5 |
Overcast | 4 | 0 | 4 |
Rain | 3 | 2 | 5 |
Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of Gini indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
Example: GINI Index…
Temperature | Yes | No | Number of instances |
Hot | 2 | 2 | 4 |
Cool | 3 | 1 | 4 |
Mild | 4 | 2 | 6 |
Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
Example: GINI Index…
Humidity | Yes | No | Number of instances |
High | 3 | 4 | 7 |
Normal | 6 | 1 | 7 |
Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489
Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
Weighted sum for humidity feature will be calculated next
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
Example: GINI Index…
Wind | Yes | No | Number of instances |
Weak | 6 | 2 | 8 |
Strong | 3 | 3 | 6 |
Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
Example: GINI Index…
Feature | Gini index |
Outlook | 0.342 |
Temperature | 0.439 |
Humidity | 0.367 |
Wind | 0.428 |
Decision Tree at First Split
In this tree, overcast has only yes. Hence, it reaches at leaf node
Decision Tree at First Split
In this tree, overcast has only yes. Hence, it reaches at leaf node
DT: GINI Index for sub dataset
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
DT: GINI Index for sub dataset
Temperature | Yes | No | Number of instances |
Hot | 0 | 2 | 2 |
Cool | 1 | 0 | 1 |
Mild | 1 | 1 | 2 |
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
DT: GINI Index for sub dataset
Humidity | Yes | No | Number of instances |
High | 0 | 3 | 3 |
Normal | 2 | 0 | 2 |
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
DT: GINI Index for sub dataset
Wind | Yes | No | Number of instances |
Weak | 1 | 2 | 3 |
Strong | 1 | 1 | 2 |
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466
DT: GINI Index for sub dataset
Feature | Gini index |
Temperature | 0.2 |
Humidity | 0 |
Wind | 0.466 |
We’ll put humidity check at the extension of sunny outlook.
DT: Second Split
As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over.
Decision Tree: Second Split
DT: Rain_Outlook
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
10 | Rain | Mild | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
We’ll calculate Gini index scores for temperature, humidity and wind features when outlook is rain.
DT: Rain_Outlook…
Temperature | Yes | No | Number of instances |
Cool | 1 | 1 | 2 |
Mild | 2 | 1 | 3 |
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466
DT: Rain_Outlook…
Humidity | Yes | No | Number of instances |
High | 1 | 1 | 2 |
Normal | 2 | 1 | 3 |
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
DT: Rain_Outlook…
Wind | Yes | No | Number of instances |
Weak | 3 | 0 | 3 |
Strong | 0 | 2 | 2 |
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
DT: Rain_Outlook…
Feature | Gini index |
Temperature | 0.466 |
Humidity | 0.466 |
Wind | 0 |
DT: at Third Split
As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.
Decision Tree: Final
Stopping Criteria for Tree Induction
Decision Tree Based Classification
Practical Issues of Classification
Underfitting and Overfitting (Example)
500 circular and 500 triangular data points.
Circular points:
0.5 ≤ sqrt(x12+x22) ≤ 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
Notes on Overfitting
Occam’s Razor
How to Address Overfitting
How to Address Overfitting…