1 of 117

�Decision Trees

Dr. Dinesh Kumar Vishwakarma

Professor,

Department of Information Technology,

Delhi Technological University, Delhi-110042

dinesh@dtu.ac.in

http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php

2 of 117

Introduction

  • A decision tree is a support tool with a tree-like structure that models probable outcomes, cost of resources, utilities, and possible consequences.
  • It provides a way to present algorithms with conditional control statements.
  • It include branches that represent decision-making steps that can lead to a favorable result.
  • The flowchart structure includes internal nodes that represent tests or attributes at each stage.
  • Every branch stands for an outcome for the attributes, while the path from the leaf to the root represents rules for classification.

3 of 117

Applications

  • Assessing prospective growth opportunities
  • Using demographic data to find prospective clients
  • Marketing
  • Retention of Customers
  • Diagnosis of Diseases and Ailments
  • Detection of Frauds

4 of 117

Advantages

  • Easy to read and interpret
    • The o/p are easy to read and interpret without requiring statistical knowledge
  • Easy to prepare
    • take less effort for data preparation
  • Less data cleaning required
    • There is less data cleaning required once the variables have been created. In cases of missing values and outliers have less significance on the decision tree’s data.

5 of 117

Decision Tree: DT

  • Graphical representation of all possible solutions.
  • Decisions are based on some conditions.
  • Decision made can be easily explained.

Hungary

Yes

No

Have ₹ 100

Sleep

Restaurant

Yes

Apple

No

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes)

E.g.

Call Centre

Root node

Leaf Node

Decision Node

6 of 117

Decision Tree: DT

categorical

categorical

continuous

class

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Splitting Attributes

Training Data

Model: Decision Tree

7 of 117

Another Example of Decision Tree

categorical

categorical

continuous

class

MarSt

Refund

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

There could be more than one tree that fits the same data!

8 of 117

Decision Tree Classification Task

Decision Tree

9 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Test Data

Start from the root of tree.

10 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Test Data

11 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Test Data

12 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Test Data

13 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Test Data

14 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

NO

NO

NO

Yes

No

Married

Single, Divorced

< 80K

> 80K

Test Data

Assign Cheat to “No”

15 of 117

Decision Tree Classification Task

Decision Tree

16 of 117

Decision Tree Induction

  • Many Algorithms:
    • Hunt’s Algorithm (one of the earliest)
    • CART
    • ID3, C4.5

17 of 117

Hunt’s Algorithm

  •  

Dt

?

18 of 117

Hunt’s Algorithm

Don’t

Cheat

Refund

Don’t

Cheat

Don’t

Cheat

Yes

No

Refund

Don’t

Cheat

Yes

No

Marital

Status

Don’t

Cheat

Cheat

Single,

Divorced

Married

Taxable

Income

Don’t

Cheat

< 80K

>= 80K

Refund

Don’t

Cheat

Yes

No

Marital

Status

Don’t

Cheat

Cheat

Single,

Divorced

Married

19 of 117

Tree Induction

  • Greedy strategy.
    • Split the records based on an attribute test that optimizes certain criterion.
  • Issues
    • Determine how to split the records
      • How to specify the attribute test condition?
      • How to determine the best split?
    • Determine when to stop splitting

20 of 117

Tree Induction

  • Greedy strategy.
    • Split the records based on an attribute test that optimizes certain criterion.
  • Issues
    • Determine how to split the records
      • How to specify the attribute test condition?
      • How to determine the best split?
    • Determine when to stop splitting

21 of 117

How to Specify Test Condition?

  • Depends on attribute types
    • Binary
    • Nominal
    • Ordinal
    • Continuous
  • Depends on number of ways to split
    • 2-way split
    • Multi-way split

22 of 117

Splitting Based on Binary Attributes

  • BuildDT algorithm must provides a method for expressing an attribute test condition and corresponding outcome for different attribute type
  • Case: Binary attribute
    • This is the simplest case of node splitting

    • The test condition for a binary attribute generates only two outcomes

23 of 117

Splitting Based on Nominal Attributes

  • Multi-way split: Use as many partitions as distinct values.

  • Binary split: Divides values into two subsets. � Need to find optimal partitioning.

CarType

Family

Sports

Luxury

CarType

{Family, �Luxury}

{Sports}

CarType

{Sports, Luxury}

{Family}

OR

24 of 117

Splitting Based on Ordinal Attributes

  • Multi-way split: Use as many partitions as distinct values.
    • Small (S), Medium (M),
    • Large (L), Extra Large (XL)

  • Binary split: Divides values into two subsets. � Need to find optimal partitioning.

  • What about this split?

Size

S

M

L

Size

{M, L}

{S}

Size

{S, M}

{L, XL}

OR

Size

{S, L}

{M, XL}

Not allowed

25 of 117

Splitting Based on Continuous Attributes

  • Different ways of handling
    • Discretization to form an ordinal categorical attribute
      • Static – discretize once at the beginning
      • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing� (percentiles), or clustering.

    • Binary Decision: (A < v) or (A ≥ v)
      • consider all possible splits and finds the best cut
      • can be more compute intensive

26 of 117

Splitting Based on Continuous Attributes…

27 of 117

Tree Induction

  • Greedy strategy.
    • Split the records based on an attribute test that optimizes certain criterion.

  • Issues
    • Determine how to split the records
      • How to specify the attribute test condition?
      • How to determine the best split?
    • Determine when to stop splitting

28 of 117

How to determine the Best Split

Before Splitting: 10 records of class 0,� 10 records of class 1

Which test condition is the best?

29 of 117

How to determine the Best Split

  • Greedy approach:
    • Nodes with homogeneous class distribution are preferred.
  • Need a measure of node impurity:

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

30 of 117

Measures of Node Impurity

  • There are three popular way to measure impurity is:
    • Information Gain
    • Gain Ratio
    • Gini Index

31 of 117

Concept of Entropy

 

More organized or Less organized or

ordered (less probable) disordered (more probable)

More ordered Less ordered

less entropy higher entropy

32 of 117

An Open Challenge!

Two sheets showing the tabulation of marks obtained in a course are shown.

Which tabulation of marks shows the “good” performance of the class?

How you can measure the same?

Roll No.

Assignment

Project

Mid-Sem

End-Sem

12BT3FP06

89

99

56

91

10IM30013

95

98

55

93

12CE31005

98

96

58

97

12EC35015

93

95

54

99

12GG2005

90

91

53

98

12MI33006

91

93

57

97

13AG36001

96

94

58

95

13EE10009

92

96

56

96

13MA20012

88

98

59

96

14CS30017

94

90

60

94

14ME10067

90

92

58

95

14MT10038

99

89

55

93

Roll No.

Assignment

Project

Mid-Sem

End-Sem

12BT3FP06

19

59

16

71

10IM30013

37

38

25

83

12CE31005

38

16

48

97

12EC35015

23

95

54

19

12GG2005

40

71

43

28

12MI33006

61

93

47

97

13AG36001

26

64

48

75

13EE10009

92

46

56

56

13MA20012

88

58

59

66

14CS30017

74

20

60

44

14ME10067

50

42

38

35

14MT10038

29

69

25

33

33 of 117

Entropy and its Meaning

  • Entropy is an important concept used in Physics in the context of heat and thereby uncertainty of the states of a matter.

  • At a later stage, with the growth of Information Technology, entropy becomes an important concept in Information Theory.

  • To deal with the classification job, entropy is an important concept, which is considered as

    • an information-theoretic measure of the “uncertainty” contained in a training data

      • due to the presence of more than one classes.

34 of 117

Entropy in Information Theory

  • The entropy concept in information theory first time coined by Claude Shannon (1850).
  • The first time it was used to measure the “information content” in messages.
  • According to his concept of entropy, presently entropy is widely being used as a way of representing messages for efficient transmission by Telecommunication Systems.

35 of 117

Measure of Information Content

  •  

36 of 117

Entropy Calculations

  •  

37 of 117

Entropy Calculations…

  •  

38 of 117

Entropy of a Training Set

Consider a dataset of OTPH as shown in the following table with total 24 instances in it.

Age

Eye sight

Astigmatic

Use Type

Class

1

1

1

1

1

1

1

1

1

1

2

2

1

1

2

2

1

1

1

2

1

2

1

2

3

2

3

1

3

2

1

1

2

2

2

2

2

2

1

1

1

1

2

2

1

1

2

2

1

2

1

2

1

2

3

1

3

2

3

1

2

2

2

2

3

3

2

2

2

2

1

1

1

1

2

2

1

1

1

2

1

2

1

2

3

2

3

3

3

3

3

3

3

3

3

3

1

1

2

2

2

2

2

2

1

1

2

2

1

2

1

2

1

2

3

1

3

2

3

3

A coded forms for all values of attributes are used to avoid the cluttering in the table.

39 of 117

Entropy of a training set…

Specification of the attributes are as follows.

Age

Eye Sight

Astigmatic

Use Type

1: Young

1: Myopia

1: No

1: Frequent

2: Middle-aged

2: Hypermetropia

2: Yes

2: Less

3: Old

 

40 of 117

Decision Tree Induction Techniques

  • Decision tree induction is a top-down, recursive and divide-and-conquer approach.
  • The procedure is to choose an attribute and split it into from a larger training set into smaller training sets.
  • Different algorithms have been proposed to take a good control over
    1. Choosing the best attribute to be splitted, and
    2. Splitting criteria
  • Several algorithms have been proposed for the above tasks. In this lecture, we shall limit our discussions into three important of them
    • ID3
    • C 4.5
    • CART

41 of 117

Algorithm ID3

42 of 117

ID3: Decision Tree Induction Algorithms

  • Quinlan [1986] introduced the ID3, a popular short form of Iterative Dichotomizer 3 for decision trees from a set of training data.

  • In ID3, each node corresponds to a splitting attribute and each arc is a possible value of that attribute.

  • At each node, the splitting attribute is selected to be the most informative among the attributes not yet considered in the path starting from the root.

43 of 117

Algorithm ID3

  • In ID3, entropy is used to measure how informative a node is.

    • It is observed that splitting on any attribute has the property that average entropy of the resulting training subsets will be less than or equal to that of the previous training set.

  • ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness of a split.

    • The attribute with the largest value of information gain is chosen as the splitting attribute and

    • it partitions into a number of smaller training sets based on the distinct values of attribute under split.

44 of 117

Defining Information Gain

  •  

45 of 117

Defining Information Gain…

  •  

46 of 117

Defining Information Gain…

  •  

47 of 117

Compute Information Gain

 

Missing value

 

48 of 117

Example

  •  

Age (x1)

Eye-sight (x2)

Astigmatism (x3)

Use type (x4)

Class

(y)

1

1

1

1

3

1

1

1

2

2

1

1

2

1

3

1

1

2

2

1

1

2

1

1

3

1

2

1

2

2

1

2

2

1

3

1

2

2

2

1

 

49 of 117

Example…

Age

Eye-sight

Astigmatism

Use type

Class

2

1

1

1

3

2

1

1

2

2

2

1

2

1

3

2

1

2

2

1

2

2

1

1

3

2

2

1

2

2

2

2

2

1

3

2

2

2

2

3

 

 

50 of 117

Example…

Age

Eye-sight

Astigmatism

Use type

Class

3

1

1

1

3

3

1

1

2

3

3

1

2

1

3

3

1

2

2

1

3

2

1

1

3

3

2

1

2

2

3

2

2

1

3

3

2

2

2

3

 

 

 

51 of 117

Information Gains for Different Attributes

 

52 of 117

Decision Tree Induction : ID3 Way

 

53 of 117

Decision Tree Induction : ID3 Way

Age

Eye-sight

Astigmatic

 

Use Type

Age

Eye

Ast

Use

Class

1

1

1

1

3

1

1

2

1

3

1

2

1

1

3

1

2

2

1

3

2

1

1

1

3

2

2

1

1

3

2

2

2

1

3

3

1

1

1

3

3

1

2

1

3

3

2

1

1

3

3

2

2

1

3

Age

Eye

Ast

Use

Class

1

1

1

2

2

1

1

2

2

1

1

2

1

2

2

1

2

2

2

1

2

1

1

2

2

2

1

2

2

1

2

2

1

2

2

3

1

1

2

3

3

1

2

2

3

3

2

1

2

2

3

2

2

2

3

 

 

 

 

 

 

 

 

 

 

 

 

 

Age

Eye-sight

Astigmatic

Age

Eye-sight

Astigmatic

 

 

 

 

 

 

 

 

 

 

 

 

 

54 of 117

Example 3

  •  

Instances

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Sunny

Cool

Normal

Weak

Yes

Rain

Mild

Normal

Weak

Yes

Sunny

Mild

Normal

Strong

Yes

Overcast

Mild

High

Strong

Yes

Overcast

Hot

Normal

Weak

Yes

Rain

Mild

High

Strong

No

55 of 117

Example 3…

56 of 117

Example 3…

  • A partially learned decision tree
  • There are three possibility to select further attribute Temperature, Humidity, and Wind.
  • Intuitively, “Humidity” is the best to choose

57 of 117

Example 3: Information Gain

  •  

58 of 117

Example 3: Information Gain…

  •  

Outlook has maximum gain

Root Node

59 of 117

Example 3: Information Gain…

  • Same strategy is applied recursively to each subset of training instances.

60 of 117

Example 3: Information Gain…

  • Further branching at the node reached when outlook is sunny.
  • The information gain at daughter node are:
  • Temperature: Hot-02, Mild-02, Cool-01

 

 

 

 

 

 

61 of 117

Example 3: Information Gain…

  • Similarly, for Humidity and Wind
  • Humidity: High-03 (N-3, Y-0), Normal-02(N-0, Y-2)

  • Wind: Weak-03(Y-1, N-2), Strong-02 (Y-1, N-1)

 

 

 

 

 

 

 

 

 

We choose “Humidity”, since it has highest gain. No further splitting, reached as leaf node

62 of 117

Facts of Information Gain

  •  

63 of 117

C4.5�Gain Ratio

64 of 117

Gain Ratio

  •  

65 of 117

Dataset

Instances

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Sunny

Cool

Normal

Weak

Yes

Rain

Mild

Normal

Weak

Yes

Sunny

Mild

Normal

Strong

Yes

Overcast

Mild

High

Strong

Yes

Overcast

Hot

Normal

Weak

Yes

Rain

Mild

High

Strong

No

66 of 117

Example :Gain Ratio

  •  

67 of 117

Example :Gain Ratio…

  • Normalize the information gain by dividing by the split info value to get gain ratio.

68 of 117

Example :Gain Ratio…

  • Similarly, it can be calculated for others attributes such as:

  • Outlook still comes out on top but Humidity is much closers contender because it split the data into two subsets instead three.

69 of 117

CART�GINI INDEX

CLASSIFICATION AND REGRESSION TREE

70 of 117

GINI Index

  • The Gini Index facilitates the bigger distributions so easy to implement whereas the Information Gain favors lesser distributions having small count with multiple specific values.
  • The method of the Gini Index is used by CART algorithms, in contrast to it, Information Gain is used in ID3, C4.5 algorithms.
  • Gini index operates on the categorical target variables in terms of “success” or “failure” and performs only binary split, in opposite to that Information Gain computes the difference between entropy before and after the split and indicates the impurity in classes of elements.

71 of 117

GINI Index…

  •  

72 of 117

GINI Index…

  •  

73 of 117

GINI Index…

  •  

74 of 117

Measure of Impurity: GINI

  • Gini Index for a given node t :
    • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information.
    • Minimum (0.0) when all records belong to one class, implying most interesting information.

 

75 of 117

Examples for computing GINI

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

 

76 of 117

Splitting Based on GINI

  • Used in CART: Classification And Regression Trees
  • When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node p.

77 of 117

Binary Attributes: Computing GINI Index

  • Splits into two partitions
  • Effect of Weighing partitions:
    • Larger and Purer Partitions are sought for.

B?

Yes

No

Node N1

Node N2

Gini(N1) �= 1 – (5/6)2 – (2/6)2 �= 0.194

Gini(N2) �= 1 – (1/6)2 – (4/6)2 �= 0.528

Gini(Children) �= 7/12 * 0.194 + 5/12 * 0.528�= 0.333

78 of 117

Categorical Attributes: Computing Gini Index

  • For each distinct value, gather counts for each class in the dataset
  • Use the count matrix to make decisions

Multi-way split

Two-way split

(find best partition of values)

79 of 117

Example

Past Trend

Open Interest

Trading Volume

Return

Positive

Low

High

Up

Negative

High

Low

Down

Positive

Low

High

Up

Positive

High

High

Up

Negative

Low

High

Down

Positive

Low

Low

Down

Negative

High

High

Down

Negative

Low

High

Down

Positive

Low

Low

Down

Positive

High

High

Up

80 of 117

Continuous Attributes: Computing Gini Index

  • Use Binary Decisions based on one value
  • Several Choices for the splitting value
    • Number of possible splitting values �= Number of distinct values
  • Each splitting value has a count matrix associated with it
    • Class counts in each of the partitions, A < v and A ≥ v
  • Simple method to choose best v
    • For each v, scan the database to gather count matrix and compute its Gini index
    • Computationally Inefficient! Repetition of work.

81 of 117

Continuous Attributes: Computing Gini Index...

  • For efficient computation: for each attribute,
    • Sort the attribute on values
    • Linearly scan these values, each time updating the count matrix and computing gini index
    • Choose the split position that has the least gini index

Split Positions

Sorted Values

82 of 117

Splitting Criteria based on Classification Error

  • Classification error at a node t :

  • Measures misclassification error made by a node.
      • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
      • Minimum (0.0) when all records belong to one class, implying most interesting information

83 of 117

Examples for Computing Error

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

84 of 117

Misclassification Error vs Gini

A?

Yes

No

Node N1

Node N2

Gini(N1) �= 1 – (3/3)2 – (0/3)2 �= 0

Gini(N2) �= 1 – (4/7)2 – (3/7)2 �= 0.489

Gini(Children) �= 3/10 * 0 �+ 7/10 * 0.489�= 0.342

85 of 117

Example: GINI Index

Instances

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Sunny

Cool

Normal

Weak

Yes

Rain

Mild

Normal

Weak

Yes

Sunny

Mild

Normal

Strong

Yes

Overcast

Mild

High

Strong

Yes

Overcast

Hot

Normal

Weak

Yes

Rain

Mild

High

Strong

No

 

86 of 117

Example: GINI Index…

  • First attribute is “Outlook”

Outlook

Yes

No

Number of instances

Sunny

2

3

5

Overcast

4

0

4

Rain

3

2

5

Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48

Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0

Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48

Then, we will calculate weighted sum of Gini indexes for outlook feature.

Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342

87 of 117

Example: GINI Index…

  • Second attribute is “Temperature”

Temperature

Yes

No

Number of instances

Hot

2

2

4

Cool

3

1

4

Mild

4

2

6

Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5

Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375

Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445

We’ll calculate weighted sum of gini index for temperature feature

Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439

88 of 117

Example: GINI Index…

  • Third attribute is “Humidity”

Humidity

Yes

No

Number of instances

High

3

4

7

Normal

6

1

7

Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489

Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244

Weighted sum for humidity feature will be calculated next

Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367

89 of 117

Example: GINI Index…

  • Fourth attribute is “Wind

Wind

Yes

No

Number of instances

Weak

6

2

8

Strong

3

3

6

Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375

Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5

Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

90 of 117

Example: GINI Index…

  • Gini Index of all attributes are as:

Feature

Gini index

Outlook

0.342

Temperature

0.439

Humidity

0.367

Wind

0.428

91 of 117

Decision Tree at First Split

In this tree, overcast has only yes. Hence, it reaches at leaf node

92 of 117

Decision Tree at First Split

In this tree, overcast has only yes. Hence, it reaches at leaf node

93 of 117

DT: GINI Index for sub dataset

  • Same procedure is applied for the sub datasets.
  • The sub dataset for “sunny” outlook. We need to find the GINI index scores for temperature, humidity and wind features, respectively.

Day

Outlook

Temp.

Humidity

Wind

Decision

1

Sunny

Hot

High

Weak

No

2

Sunny

Hot

High

Strong

No

8

Sunny

Mild

High

Weak

No

9

Sunny

Cool

Normal

Weak

Yes

11

Sunny

Mild

Normal

Strong

Yes

94 of 117

DT: GINI Index for sub dataset

  • Gini of temperature for “sunny” outlook.

Temperature

Yes

No

Number of instances

Hot

0

2

2

Cool

1

0

1

Mild

1

1

2

Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0

Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0

Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5

Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2

95 of 117

DT: GINI Index for sub dataset

  • Gini of humidity for sunny outlook.

Humidity

Yes

No

Number of instances

High

0

3

3

Normal

2

0

2

Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0

Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0

Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0

96 of 117

DT: GINI Index for sub dataset

  • Gini of wind for sunny outlook.

Wind

Yes

No

Number of instances

Weak

1

2

3

Strong

1

1

2

Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266

Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2

Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466

97 of 117

DT: GINI Index for sub dataset

  • Decision for sunny outlook.

Feature

Gini index

Temperature

0.2

Humidity

0

Wind

0.466

We’ll put humidity check at the extension of sunny outlook.

98 of 117

DT: Second Split

As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over.

99 of 117

Decision Tree: Second Split

100 of 117

DT: Rain_Outlook

  • Rain outlook

Day

Outlook

Temp.

Humidity

Wind

Decision

4

Rain

Mild

High

Weak

Yes

5

Rain

Cool

Normal

Weak

Yes

6

Rain

Cool

Normal

Strong

No

10

Rain

Mild

Normal

Weak

Yes

14

Rain

Mild

High

Strong

No

We’ll calculate Gini index scores for temperature, humidity and wind features when outlook is rain.

101 of 117

DT: Rain_Outlook…

  • Gini of temperature for rain outlook

Temperature

Yes

No

Number of instances

Cool

1

1

2

Mild

2

1

3

Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5

Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444

Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466

102 of 117

DT: Rain_Outlook…

  • Gini of humidity for rain outlook

Humidity

Yes

No

Number of instances

High

1

1

2

Normal

2

1

3

Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5

Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444

Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466

103 of 117

DT: Rain_Outlook…

  • Gini of wind for rain outlook

Wind

Yes

No

Number of instances

Weak

3

0

3

Strong

0

2

2

Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0

Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0

Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

104 of 117

DT: Rain_Outlook…

  • Decision for rain outlook
    • The winner is wind feature for rain outlook because it has the minimum Gini index score in features.

Feature

Gini index

Temperature

0.466

Humidity

0.466

Wind

0

105 of 117

DT: at Third Split

  • Put the wind feature for rain outlook branch and monitor the new sub data sets.

As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.

106 of 117

Decision Tree: Final

107 of 117

Stopping Criteria for Tree Induction

  • Stop expanding a node when all the records belong to the same class
  • Stop expanding a node when all the records have similar attribute values.

108 of 117

Decision Tree Based Classification

  • Advantages:
    • Inexpensive to construct
    • Extremely fast at classifying unknown records
    • Easy to interpret for small-sized trees
    • Accuracy is comparable to other classification techniques for many simple data sets

109 of 117

Practical Issues of Classification

  • Underfitting and Overfitting

  • Missing Values

  • Costs of Classification

110 of 117

Underfitting and Overfitting (Example)

500 circular and 500 triangular data points.

Circular points:

0.5 ≤ sqrt(x12+x22) ≤ 1

Triangular points:

sqrt(x12+x22) > 0.5 or

sqrt(x12+x22) < 1

111 of 117

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

112 of 117

Overfitting due to Noise

Decision boundary is distorted by noise point

113 of 117

Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

114 of 117

Notes on Overfitting

  • Overfitting results in decision trees that are more complex than necessary.

  • Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

  • Need new ways for estimating errors

115 of 117

Occam’s Razor

  • Given two models of similar generalization errors, one should prefer the simpler model over the more complex model.

  • For complex models, there is a greater chance that it was fitted accidentally by errors in data.

  • Therefore, one should include model complexity when evaluating a model.

116 of 117

How to Address Overfitting

  • Pre-Pruning (Early Stopping Rule)
    • Stop the algorithm before it becomes a fully-grown tree
    • Typical stopping conditions for a node:
      • Stop if all instances belong to the same class
      • Stop if all the attribute values are the same
    • More restrictive conditions:
      • Stop if number of instances is less than some user-specified threshold
      • Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test)
      • Stop if expanding the current node does not improve impurity� measures (e.g., Gini or information gain).

117 of 117

How to Address Overfitting…

  • Post-pruning
    • Grow decision tree to its entirety
    • Trim the nodes of the decision tree in a bottom-up fashion
    • If generalization error improves after trimming, replace sub-tree by a leaf node.
    • Class label of leaf node is determined from majority class of instances in the sub-tree
    • Can use MDL for post-pruning