1 of 29

Part 3: Decision Trees

Decision tree representation
ID3 learning algorithm
Entropy, information gain
Overfitting

2 of 29

Supplimentary material

www

http://dms.irb.hr/tutorial/tut_dtrees.php
http://www.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/4_dtrees1.html

ICS320

2

3 of 29

Decision Tree for PlayTennis

Attributes and their values:

Outlook: Sunny, Overcast, Rain
Humidity: High, Normal
Wind: Strong, Weak
Temperature: Hot, Mild, Cool

Target concept - Play Tennis: Yes, No

ICS320

3

4 of 29

Decision Tree for PlayTennis

ICS320

4

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

No

5 of 29

Decision Tree for PlayTennis

ICS320

5

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

No

Yes

Each internal node tests an attribute

Each branch corresponds to an

attribute value node

Each leaf node assigns a classification

6 of 29

Decision Tree for PlayTennis

ICS320

6

No

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

No

Outlook Temperature Humidity Wind PlayTennis

Sunny Hot High Weak ?

7 of 29

Decision Tree for Conjunction

ICS320

7

Outlook

Sunny

Overcast

Rain

Wind

Strong

Weak

No

Yes

No

Outlook=Sunny ∧ Wind=Weak

No

8 of 29

Decision Tree for Disjunction

ICS320

8

Outlook

Sunny

Overcast

Rain

Yes

Outlook=Sunny ∨ Wind=Weak

Wind

Strong

Weak

No

Yes

Wind

Strong

Weak

No

Yes

9 of 29

Decision Tree

ICS320

9

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

No

decision trees represent disjunctions of conjunctions

(Outlook=Sunny ∧ Humidity=Normal)

∨ (Outlook=Overcast)

∨ (Outlook=Rain ∧ Wind=Weak)

10 of 29

When to consider Decision Trees

Instances describable by attribute-value pairs

e.g Humidity: High, Normal

Target function is discrete valued

e.g Play tennis; Yes, No

Disjunctive hypothesis may be required

e.g Outlook=Sunny ∨ Wind=Weak

Possibly noisy training data
Missing attribute values
Application Examples:

Medical diagnosis
Credit risk analysis
Object classification for robot manipulator (Tan 1993)

ICS320

10

11 of 29

Top-Down Induction of Decision Trees ID3

ICS320

11

A ← the “best” decision attribute for next node
Assign A as decision attribute for node
For each value of A create new descendant
Sort training examples to leaf node according to

the attribute value of the branch

If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

12 of 29

Which Attribute is ”best”?

ICS320

12

A₁=?

True

False

[21+, 5-]

[8+, 30-]

[29+,35-]

A₂=?

True

False

[18+, 33-]

[11+, 2-]

[29+,35-]

13 of 29

Entropy

S is a sample of training examples
p₊ is the proportion of positive examples
p_- is the proportion of negative examples
Entropy measures the impurity of S

Entropy(S) = -p₊ log₂ p₊ - p_- log₂ p_-

ICS320

13

14 of 29

Entropy

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)

Why?

Information theory optimal length code assign

–log₂ p bits to messages having probability p.

So the expected number of bits to encode

(+ or -) of random member of S:

-p₊ log₂ p₊ - p_- log₂ p_-

ICS320

14

Note that: 0Log₂0 =0

15 of 29

Information Gain

Gain(S,A): expected reduction in entropy due to sorting S on attribute A

ICS320

15

A₁=?

True

False

[21+, 5-]

[8+, 30-]

[29+,35-]

A₂=?

True

False

[18+, 33-]

[11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - ∑_{v∈values(A)} |S_v|/|S| Entropy(S_v)

Entropy([29+,35-]) = -29/64 log₂ 29/64 – 35/64 log₂ 35/64

= 0.99

16 of 29

Information Gain

Entropy([21+,5-]) = 0.71

Entropy([8+,30-]) = 0.74

Gain(S,A₁)=Entropy(S)

-26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

ICS320

16

A₁=?

True

False

[21+, 5-]

[8+, 30-]

[29+,35-]

Entropy([18+,33-]) = 0.94

Entropy([11+,2-]) = 0.62

Gain(S,A₂)=Entropy(S)

-51/64*Entropy([18+,33-])

-13/64*Entropy([11+,2-])

=0.12

A₂=?

True

False

[18+, 33-]

[11+, 2-]

[29+,35-]

17 of 29

Training Examples

ICS320

17

No

Strong

High

Mild

Rain

D14

Yes

Weak

Normal

Hot

Overcast

D13

Yes

Strong

High

Mild

Overcast

D12

Yes

Strong

Normal

Mild

Sunny

D11

Yes

Strong

Normal

Mild

Rain

D10

Yes

Weak

Normal

Cool

Sunny

D9

No

Weak

High

Mild

Sunny

D8

Yes

Weak

Normal

Cool

Overcast

D7

No

Strong

Normal

Cool

Rain

D6

Yes

Weak

Normal

Cool

Rain

D5

Yes

Weak

High

Mild

Rain

D4

Yes

Weak

High

Hot

Overcast

D3

No

Strong

High

Hot

Sunny

D2

No

Weak

High

Hot

Sunny

D1

Play Tennis

Wind

Humidity

Temp.

Outlook

Day

18 of 29

Selecting the Next Attribute

ICS320

18

Humidity

High

Normal

[3+, 4-]

[6+, 1-]

S=[9+,5-]

E=0.940

Gain(S,Humidity)

=0.940-(7/14)*0.985

– (7/14)*0.592

=0.151

E=0.985

E=0.592

Wind

Weak

Strong

[6+, 2-]

[3+, 3-]

S=[9+,5-]

E=0.940

E=0.811

E=1.0

Gain(S,Wind)

=0.940-(8/14)*0.811

– (6/14)*1.0

=0.048

Humidity provides greater info. gain than Wind, w.r.t target classification.

19 of 29

Selecting the Next Attribute

ICS320

19

Outlook

Sunny

Rain

[2+, 3-]

[3+, 2-]

S=[9+,5-]

E=0.940

Gain(S,Outlook)

=0.940-(5/14)*0.971

-(4/14)*0.0 – (5/14)*0.0971

=0.247

E=0.971

Over

cast

[4+, 0]

E=0.0

20 of 29

Selecting the Next Attribute

ICS320

20

The information gain values for the 4 attributes are:

Gain(S,Outlook) =0.247
Gain(S,Humidity) =0.151
Gain(S,Wind) =0.048
Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log₂0 =0

21 of 29

ID3 Algorithm

ICS320

21

Outlook

Sunny

Overcast

Rain

Yes

[D1,D2,…,D14]

[9+,5-]

S_sunny=[D1,D2,D8,D9,D11]

[2+,3-]

?

[D3,D7,D12,D13]

[4+,0-]

[D4,D5,D6,D10,D14]

[3+,2-]

Gain(S_sunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970

Gain(S_sunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570

Gain(S_sunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

Test for this node

Note: 0Log₂0 =0

22 of 29

ID3 Algorithm

ICS320

22

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

No

[D3,D7,D12,D13]

[D8,D9,D11]

[D6,D14]

[D1,D2]

[D4,D5,D10]

23 of 29

Occam’s Razor

Why prefer short hypotheses?

Argument in favor:

Fewer short hypotheses than long hypotheses
A short hypothesis that fits the data is unlikely to be a coincidence
A long hypothesis that fits the data might be a coincidence

Argument opposed:

There are many ways to define small sets of hypotheses
E.g. All trees with a prime number of nodes that use attributes beginning with ”Z”
What is so special about small sets based on size of hypothesis

ICS320

23

24 of 29

Overfitting

One of the biggest problems with decision trees is Overfitting

ICS320

24

25 of 29

Overfitting in Decision Tree Learning

ICS320

25

26 of 29

Avoid Overfitting

How can we avoid overfitting?

Stop growing when data split not statistically significant
Grow full tree then post-prune
Minimum description length (MDL):

Minimize:

size(tree) + size(misclassifications(tree))

ICS320

26

27 of 29

Converting a Tree to Rules

ICS320

27

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

No

R₁: If (Outlook=Sunny) ∧ (Humidity=High) Then PlayTennis=No

R₂: If (Outlook=Sunny) ∧ (Humidity=Normal) Then PlayTennis=Yes

R₃: If (Outlook=Overcast) Then PlayTennis=Yes

R₄: If (Outlook=Rain) ∧ (Wind=Strong) Then PlayTennis=No

R₅: If (Outlook=Rain) ∧ (Wind=Weak) Then PlayTennis=Yes

28 of 29

Continuous Valued Attributes

Create a discrete attribute to test continuous

Temperature = 24.5⁰C
(Temperature > 20.0⁰C) = {true, false}

Where to set the threshold?

ICS320

28

No

Yes

No

PlayTennis

27⁰C

24⁰C

22⁰C

19⁰C

18⁰C

15⁰C

Temperature

(see paper by [Fayyad, Irani 1993]

29 of 29

Unknown Attribute Values

What if some examples have missing values of A?

Use training example anyway sort through tree

If node n tests A, assign most common value of A among other examples sorted to node n.
Assign most common value of A among other examples with same target value
Assign probability pi to each possible value vi of A

Assign fraction pi of example to each descendant in tree

Classify new examples in the same fashion

ICS320

29