1 of 29

Part 3: Decision Trees

  • Decision tree representation
  • ID3 learning algorithm
  • Entropy, information gain
  • Overfitting

2 of 29

Supplimentary material

www

  • http://dms.irb.hr/tutorial/tut_dtrees.php
  • http://www.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/4_dtrees1.html

ICS320

2

3 of 29

Decision Tree for PlayTennis

  • Attributes and their values:
    • Outlook: Sunny, Overcast, Rain
    • Humidity: High, Normal
    • Wind: Strong, Weak
    • Temperature: Hot, Mild, Cool

    • Target concept - Play Tennis: Yes, No

ICS320

3

4 of 29

Decision Tree for PlayTennis

ICS320

4

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

Yes

Yes

No

5 of 29

Decision Tree for PlayTennis

ICS320

5

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

No

Yes

Each internal node tests an attribute

Each branch corresponds to an

attribute value node

Each leaf node assigns a classification

6 of 29

Decision Tree for PlayTennis

ICS320

6

No

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

Yes

Yes

No

Outlook Temperature Humidity Wind PlayTennis

Sunny Hot High Weak ?

7 of 29

Decision Tree for Conjunction

ICS320

7

Outlook

Sunny

Overcast

Rain

Wind

Strong

Weak

No

Yes

No

Outlook=Sunny ∧ Wind=Weak

No

8 of 29

Decision Tree for Disjunction

ICS320

8

Outlook

Sunny

Overcast

Rain

Yes

Outlook=Sunny ∨ Wind=Weak

Wind

Strong

Weak

No

Yes

Wind

Strong

Weak

No

Yes

9 of 29

Decision Tree

ICS320

9

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

Yes

Yes

No

  • decision trees represent disjunctions of conjunctions

(Outlook=Sunny ∧ Humidity=Normal)

∨ (Outlook=Overcast)

∨ (Outlook=Rain ∧ Wind=Weak)

10 of 29

When to consider Decision Trees

  • Instances describable by attribute-value pairs
    • e.g Humidity: High, Normal
  • Target function is discrete valued
    • e.g Play tennis; Yes, No
  • Disjunctive hypothesis may be required
    • e.g Outlook=Sunny ∨ Wind=Weak
  • Possibly noisy training data
  • Missing attribute values
  • Application Examples:
    • Medical diagnosis
    • Credit risk analysis
    • Object classification for robot manipulator (Tan 1993)

ICS320

10

11 of 29

Top-Down Induction of Decision Trees ID3

ICS320

11

  1. A ← the “best” decision attribute for next node
  2. Assign A as decision attribute for node
  3. For each value of A create new descendant
  4. Sort training examples to leaf node according to

the attribute value of the branch

  1. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

12 of 29

Which Attribute is ”best”?

ICS320

12

A1=?

True

False

[21+, 5-]

[8+, 30-]

[29+,35-]

A2=?

True

False

[18+, 33-]

[11+, 2-]

[29+,35-]

13 of 29

Entropy

  • S is a sample of training examples
  • p+ is the proportion of positive examples
  • p- is the proportion of negative examples
  • Entropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-

ICS320

13

14 of 29

Entropy

  • Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)

Why?

  • Information theory optimal length code assign

–log2 p bits to messages having probability p.

  • So the expected number of bits to encode

(+ or -) of random member of S:

-p+ log2 p+ - p- log2 p-

ICS320

14

Note that: 0Log20 =0

15 of 29

Information Gain

  • Gain(S,A): expected reduction in entropy due to sorting S on attribute A

ICS320

15

A1=?

True

False

[21+, 5-]

[8+, 30-]

[29+,35-]

A2=?

True

False

[18+, 33-]

[11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - v∈values(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64

= 0.99

16 of 29

Information Gain

Entropy([21+,5-]) = 0.71

Entropy([8+,30-]) = 0.74

Gain(S,A1)=Entropy(S)

-26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

ICS320

16

A1=?

True

False

[21+, 5-]

[8+, 30-]

[29+,35-]

Entropy([18+,33-]) = 0.94

Entropy([11+,2-]) = 0.62

Gain(S,A2)=Entropy(S)

-51/64*Entropy([18+,33-])

-13/64*Entropy([11+,2-])

=0.12

A2=?

True

False

[18+, 33-]

[11+, 2-]

[29+,35-]

17 of 29

Training Examples

ICS320

17

No

Strong

High

Mild

Rain

D14

Yes

Weak

Normal

Hot

Overcast

D13

Yes

Strong

High

Mild

Overcast

D12

Yes

Strong

Normal

Mild

Sunny

D11

Yes

Strong

Normal

Mild

Rain

D10

Yes

Weak

Normal

Cool

Sunny

D9

No

Weak

High

Mild

Sunny

D8

Yes

Weak

Normal

Cool

Overcast

D7

No

Strong

Normal

Cool

Rain

D6

Yes

Weak

Normal

Cool

Rain

D5

Yes

Weak

High

Mild

Rain

D4

Yes

Weak

High

Hot

Overcast

D3

No

Strong

High

Hot

Sunny

D2

No

Weak

High

Hot

Sunny

D1

Play Tennis

Wind

Humidity

Temp.

Outlook

Day

18 of 29

Selecting the Next Attribute

ICS320

18

Humidity

High

Normal

[3+, 4-]

[6+, 1-]

S=[9+,5-]

E=0.940

Gain(S,Humidity)

=0.940-(7/14)*0.985

– (7/14)*0.592

=0.151

E=0.985

E=0.592

Wind

Weak

Strong

[6+, 2-]

[3+, 3-]

S=[9+,5-]

E=0.940

E=0.811

E=1.0

Gain(S,Wind)

=0.940-(8/14)*0.811

– (6/14)*1.0

=0.048

Humidity provides greater info. gain than Wind, w.r.t target classification.

19 of 29

Selecting the Next Attribute

ICS320

19

Outlook

Sunny

Rain

[2+, 3-]

[3+, 2-]

S=[9+,5-]

E=0.940

Gain(S,Outlook)

=0.940-(5/14)*0.971

-(4/14)*0.0 – (5/14)*0.0971

=0.247

E=0.971

E=0.971

Over

cast

[4+, 0]

E=0.0

20 of 29

Selecting the Next Attribute

ICS320

20

The information gain values for the 4 attributes are:

  • Gain(S,Outlook) =0.247
  • Gain(S,Humidity) =0.151
  • Gain(S,Wind) =0.048
  • Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0

21 of 29

ID3 Algorithm

ICS320

21

Outlook

Sunny

Overcast

Rain

Yes

[D1,D2,…,D14]

[9+,5-]

Ssunny=[D1,D2,D8,D9,D11]

[2+,3-]

?

?

[D3,D7,D12,D13]

[4+,0-]

[D4,D5,D6,D10,D14]

[3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970

Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570

Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

Test for this node

Note: 0Log20 =0

22 of 29

ID3 Algorithm

ICS320

22

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

Yes

Yes

No

[D3,D7,D12,D13]

[D8,D9,D11]

[D6,D14]

[D1,D2]

[D4,D5,D10]

23 of 29

Occam’s Razor

Why prefer short hypotheses?

Argument in favor:

    • Fewer short hypotheses than long hypotheses
    • A short hypothesis that fits the data is unlikely to be a coincidence
    • A long hypothesis that fits the data might be a coincidence

Argument opposed:

    • There are many ways to define small sets of hypotheses
    • E.g. All trees with a prime number of nodes that use attributes beginning with ”Z”
    • What is so special about small sets based on size of hypothesis

ICS320

23

24 of 29

Overfitting

  • One of the biggest problems with decision trees is Overfitting

ICS320

24

25 of 29

Overfitting in Decision Tree Learning

ICS320

25

26 of 29

Avoid Overfitting

How can we avoid overfitting?

  • Stop growing when data split not statistically significant
  • Grow full tree then post-prune
  • Minimum description length (MDL):

Minimize:

size(tree) + size(misclassifications(tree))

ICS320

26

27 of 29

Converting a Tree to Rules

ICS320

27

Outlook

Sunny

Overcast

Rain

Humidity

High

Normal

Wind

Strong

Weak

No

Yes

Yes

Yes

No

R1: If (Outlook=Sunny) ∧ (Humidity=High) Then PlayTennis=No

R2: If (Outlook=Sunny) ∧ (Humidity=Normal) Then PlayTennis=Yes

R3: If (Outlook=Overcast) Then PlayTennis=Yes

R4: If (Outlook=Rain) ∧ (Wind=Strong) Then PlayTennis=No

R5: If (Outlook=Rain) ∧ (Wind=Weak) Then PlayTennis=Yes

28 of 29

Continuous Valued Attributes

Create a discrete attribute to test continuous

  • Temperature = 24.50C
  • (Temperature > 20.00C) = {true, false}

Where to set the threshold?

ICS320

28

No

Yes

Yes

Yes

No

No

PlayTennis

270C

240C

220C

190C

180C

150C

Temperature

(see paper by [Fayyad, Irani 1993]

29 of 29

Unknown Attribute Values

What if some examples have missing values of A?

Use training example anyway sort through tree

  • If node n tests A, assign most common value of A among other examples sorted to node n.
  • Assign most common value of A among other examples with same target value
  • Assign probability pi to each possible value vi of A
    • Assign fraction pi of example to each descendant in tree

Classify new examples in the same fashion

ICS320

29