JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 117

�Decision Trees

Dr. Dinesh Kumar Vishwakarma

Professor,

Department of Information Technology,

Delhi Technological University, Delhi-110042

dinesh@dtu.ac.in

http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php

2 of 117

Introduction

A decision tree is a support tool with a tree-like structure that models probable outcomes, cost of resources, utilities, and possible consequences.
It provides a way to present algorithms with conditional control statements.
It include branches that represent decision-making steps that can lead to a favorable result.

The flowchart structure includes internal nodes that represent tests or attributes at each stage.
Every branch stands for an outcome for the attributes, while the path from the leaf to the root represents rules for classification.

3 of 117

Applications

Assessing prospective growth opportunities
Using demographic data to find prospective clients
Marketing
Retention of Customers
Diagnosis of Diseases and Ailments
Detection of Frauds

4 of 117

Advantages

Easy to read and interpret

The o/p are easy to read and interpret without requiring statistical knowledge

Easy to prepare

take less effort for data preparation

Less data cleaning required

There is less data cleaning required once the variables have been created. In cases of missing values and outliers have less significance on the decision tree’s data.

5 of 117

Decision Tree: DT

Graphical representation of all possible solutions.
Decisions are based on some conditions.
Decision made can be easily explained.

Hungary

Yes

Have ₹ 100

Sleep

Restaurant

Yes

Apple

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes)

E.g.

Call Centre

Root node

Leaf Node

Decision Node

6 of 117

Decision Tree: DT

categorical

continuous

class

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Splitting Attributes

Training Data

Model: Decision Tree

7 of 117

Another Example of Decision Tree

categorical

continuous

class

MarSt

Refund

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

There could be more than one tree that fits the same data!

8 of 117

Decision Tree Classification Task

Decision Tree

9 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Test Data

Start from the root of tree.

10 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Test Data

11 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Test Data

12 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Test Data

13 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Test Data

14 of 117

Apply Model to Test Data

Refund

MarSt

TaxInc

YES

Yes

Married

Single, Divorced

< 80K

> 80K

Test Data

Assign Cheat to “No”

15 of 117

Decision Tree Classification Task

Decision Tree

16 of 117

Decision Tree Induction

Many Algorithms:

Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5

17 of 117

Hunt’s Algorithm

D_t

18 of 117

Hunt’s Algorithm

Don’t

Cheat

Refund

Don’t

Cheat

Don’t

Cheat

Yes

Refund

Don’t

Cheat

Yes

Marital

Status

Don’t

Cheat

Single,

Divorced

Married

Taxable

Income

Don’t

Cheat

< 80K

>= 80K

Refund

Don’t

Cheat

Yes

Marital

Status

Don’t

Cheat

Single,

Divorced

Married

19 of 117

Tree Induction

Greedy strategy.

Split the records based on an attribute test that optimizes certain criterion.

Issues

Determine how to split the records

How to specify the attribute test condition?
How to determine the best split?

Determine when to stop splitting

20 of 117

Tree Induction

Greedy strategy.

Split the records based on an attribute test that optimizes certain criterion.

Issues

Determine how to split the records

How to specify the attribute test condition?
How to determine the best split?

Determine when to stop splitting

21 of 117

How to Specify Test Condition?

Depends on attribute types

Binary
Nominal
Ordinal
Continuous

Depends on number of ways to split

2-way split
Multi-way split

22 of 117

Splitting Based on Binary Attributes

BuildDT algorithm must provides a method for expressing an attribute test condition and corresponding outcome for different attribute type
Case: Binary attribute

This is the simplest case of node splitting

The test condition for a binary attribute generates only two outcomes

23 of 117

Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. � Need to find optimal partitioning.

CarType

Family

Sports

Luxury

CarType

{Family, �Luxury}

{Sports}

CarType

{Sports, Luxury}

{Family}

24 of 117

Splitting Based on Ordinal Attributes

Multi-way split: Use as many partitions as distinct values.

Small (S), Medium (M),
Large (L), Extra Large (XL)

Binary split: Divides values into two subsets. � Need to find optimal partitioning.

What about this split?

Size

{M, L}

{S}

Size

{S, M}

{L, XL}

Size

{S, L}

{M, XL}

Not allowed

25 of 117

Splitting Based on Continuous Attributes

Different ways of handling

Discretization to form an ordinal categorical attribute

Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing� (percentiles), or clustering.

Binary Decision: (A < v) or (A ≥ v)

consider all possible splits and finds the best cut
can be more compute intensive

26 of 117

Splitting Based on Continuous Attributes…

27 of 117

Tree Induction

Greedy strategy.

Split the records based on an attribute test that optimizes certain criterion.

Issues

Determine how to split the records

How to specify the attribute test condition?
How to determine the best split?

Determine when to stop splitting

28 of 117

How to determine the Best Split

Before Splitting: 10 records of class 0,� 10 records of class 1

Which test condition is the best?

29 of 117

How to determine the Best Split

Greedy approach:

Nodes with homogeneous class distribution are preferred.

Need a measure of node impurity:

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

30 of 117

Measures of Node Impurity

There are three popular way to measure impurity is:

Information Gain
Gain Ratio
Gini Index

31 of 117

Concept of Entropy

More organized or Less organized or

ordered (less probable) disordered (more probable)

More ordered Less ordered

less entropy higher entropy

32 of 117

An Open Challenge!

Two sheets showing the tabulation of marks obtained in a course are shown.

Which tabulation of marks shows the “good” performance of the class?

How you can measure the same?

Roll No.	Assignment	Project	Mid-Sem	End-Sem
12BT3FP06	89	99	56	91
10IM30013	95	98	55	93
12CE31005	98	96	58	97
12EC35015	93	95	54	99
12GG2005	90	91	53	98
12MI33006	91	93	57	97
13AG36001	96	94	58	95
13EE10009	92	96	56	96
13MA20012	88	98	59	96
14CS30017	94	90	60	94
14ME10067	90	92	58	95
14MT10038	99	89	55	93

Roll No.	Assignment	Project	Mid-Sem	End-Sem
12BT3FP06	19	59	16	71
10IM30013	37	38	25	83
12CE31005	38	16	48	97
12EC35015	23	95	54	19
12GG2005	40	71	43	28
12MI33006	61	93	47	97
13AG36001	26	64	48	75
13EE10009	92	46	56	56
13MA20012	88	58	59	66
14CS30017	74	20	60	44
14ME10067	50	42	38	35
14MT10038	29	69	25	33

33 of 117

Entropy and its Meaning

Entropy is an important concept used in Physics in the context of heat and thereby uncertainty of the states of a matter.

At a later stage, with the growth of Information Technology, entropy becomes an important concept in Information Theory.

To deal with the classification job, entropy is an important concept, which is considered as

an information-theoretic measure of the “uncertainty” contained in a training data

due to the presence of more than one classes.

34 of 117

Entropy in Information Theory

The entropy concept in information theory first time coined by Claude Shannon (1850).
The first time it was used to measure the “information content” in messages.
According to his concept of entropy, presently entropy is widely being used as a way of representing messages for efficient transmission by Telecommunication Systems.

35 of 117

Measure of Information Content

36 of 117

Entropy Calculations

37 of 117

Entropy Calculations…

38 of 117

Entropy of a Training Set

Consider a dataset of OTPH as shown in the following table with total 24 instances in it.

Age	Eye sight	Astigmatic	Use Type	Class
1 1 1 1 1 1	1 1 1 1 2 2	1 1 2 2 1 1	1 2 1 2 1 2	3 2 3 1 3 2
1 1 2 2 2 2	2 2 1 1 1 1	2 2 1 1 2 2	1 2 1 2 1 2	3 1 3 2 3 1
2 2 2 2 3 3	2 2 2 2 1 1	1 1 2 2 1 1	1 2 1 2 1 2	3 2 3 3 3 3
3 3 3 3 3 3	1 1 2 2 2 2	2 2 1 1 2 2	1 2 1 2 1 2	3 1 3 2 3 3

A coded forms for all values of attributes are used to avoid the cluttering in the table.

39 of 117

Entropy of a training set…

Specification of the attributes are as follows.

Age	Eye Sight	Astigmatic	Use Type
1: Young	1: Myopia	1: No	1: Frequent
2: Middle-aged	2: Hypermetropia	2: Yes	2: Less
3: Old

40 of 117

Decision Tree Induction Techniques

Decision tree induction is a top-down, recursive and divide-and-conquer approach.
The procedure is to choose an attribute and split it into from a larger training set into smaller training sets.
Different algorithms have been proposed to take a good control over

Choosing the best attribute to be splitted, and
Splitting criteria

Several algorithms have been proposed for the above tasks. In this lecture, we shall limit our discussions into three important of them

ID3
C 4.5
CART

41 of 117

Algorithm ID3

42 of 117

ID3: Decision Tree Induction Algorithms

Quinlan [1986] introduced the ID3, a popular short form of Iterative Dichotomizer 3 for decision trees from a set of training data.

In ID3, each node corresponds to a splitting attribute and each arc is a possible value of that attribute.

At each node, the splitting attribute is selected to be the most informative among the attributes not yet considered in the path starting from the root.

43 of 117

Algorithm ID3

In ID3, entropy is used to measure how informative a node is.

It is observed that splitting on any attribute has the property that average entropy of the resulting training subsets will be less than or equal to that of the previous training set.

ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness of a split.

The attribute with the largest value of information gain is chosen as the splitting attribute and

it partitions into a number of smaller training sets based on the distinct values of attribute under split.

44 of 117

Defining Information Gain

45 of 117

Defining Information Gain…

46 of 117

Defining Information Gain…

47 of 117

Compute Information Gain

Missing value

48 of 117

Example

Age (x1)	Eye-sight (x2)	Astigmatism (x3)	Use type (x4)	Class (y)
1	1	1	1	3
1	1	1	2	2
1	1	2	1	3
1	1	2	2	1
1	2	1	1	3
1	2	1	2	2
1	2	2	1	3
1	2	2	2	1

49 of 117

Example…

Age	Eye-sight	Astigmatism	Use type	Class
2	1	1	1	3
2	1	1	2	2
2	1	2	1	3
2	1	2	2	1
2	2	1	1	3
2	2	1	2	2
2	2	2	1	3
2	2	2	2	3

50 of 117

Example…

Age	Eye-sight	Astigmatism	Use type	Class
3	1	1	1	3
3	1	1	2	3
3	1	2	1	3
3	1	2	2	1
3	2	1	1	3
3	2	1	2	2
3	2	2	1	3
3	2	2	2	3

51 of 117

Information Gains for Different Attributes

52 of 117

Decision Tree Induction : ID3 Way

53 of 117

Decision Tree Induction : ID3 Way

Age

Eye-sight

Astigmatic

Use Type

Age	Eye	Ast	Use	Class
1	1	1	1	3
1	1	2	1	3
1	2	1	1	3
1	2	2	1	3
2	1	1	1	3
2	2	1	1	3
2	2	2	1	3
3	1	1	1	3
3	1	2	1	3
3	2	1	1	3
3	2	2	1	3

Age	Eye	Ast	Use	Class
1	1	1	2	2
1	1	2	2	1
1	2	1	2	2
1	2	2	2	1
2	1	1	2	2
2	1	2	2	1
2	2	1	2	2
3	1	1	2	3
3	1	2	2	3
3	2	1	2	2
3	2	2	2	3

✔

Age

Eye-sight

Astigmatic

Age

Eye-sight

Astigmatic

54 of 117

Example 3

Instances
	Sunny	Hot	High	Weak	No
	Sunny	Hot	High	Strong	No
	Overcast	Hot	High	Weak	Yes
	Rain	Mild	High	Weak	Yes
	Rain	Cool	Normal	Weak	Yes
	Rain	Cool	Normal	Strong	No
	Overcast	Cool	Normal	Strong	Yes
	Sunny	Mild	High	Weak	No
	Sunny	Cool	Normal	Weak	Yes
	Rain	Mild	Normal	Weak	Yes
	Sunny	Mild	Normal	Strong	Yes
	Overcast	Mild	High	Strong	Yes
	Overcast	Hot	Normal	Weak	Yes
	Rain	Mild	High	Strong	No

55 of 117

Example 3…

56 of 117

Example 3…

A partially learned decision tree
There are three possibility to select further attribute Temperature, Humidity, and Wind.
Intuitively, “Humidity” is the best to choose

57 of 117

Example 3: Information Gain

58 of 117

Example 3: Information Gain…

Outlook has maximum gain

Root Node

59 of 117

Example 3: Information Gain…

Same strategy is applied recursively to each subset of training instances.

60 of 117

Example 3: Information Gain…

Further branching at the node reached when outlook is sunny.
The information gain at daughter node are:
Temperature: Hot-02, Mild-02, Cool-01

61 of 117

Example 3: Information Gain…

Similarly, for Humidity and Wind
Humidity: High-03 (N-3, Y-0), Normal-02(N-0, Y-2)

Wind: Weak-03(Y-1, N-2), Strong-02 (Y-1, N-1)

We choose “Humidity”, since it has highest gain. No further splitting, reached as leaf node

62 of 117

Facts of Information Gain

63 of 117

C4.5�Gain Ratio

64 of 117

Gain Ratio

65 of 117

Dataset

Instances
	Sunny	Hot	High	Weak	No
	Sunny	Hot	High	Strong	No
	Overcast	Hot	High	Weak	Yes
	Rain	Mild	High	Weak	Yes
	Rain	Cool	Normal	Weak	Yes
	Rain	Cool	Normal	Strong	No
	Overcast	Cool	Normal	Strong	Yes
	Sunny	Mild	High	Weak	No
	Sunny	Cool	Normal	Weak	Yes
	Rain	Mild	Normal	Weak	Yes
	Sunny	Mild	Normal	Strong	Yes
	Overcast	Mild	High	Strong	Yes
	Overcast	Hot	Normal	Weak	Yes
	Rain	Mild	High	Strong	No

66 of 117

Example :Gain Ratio

67 of 117

Example :Gain Ratio…

Normalize the information gain by dividing by the split info value to get gain ratio.

68 of 117

Example :Gain Ratio…

Similarly, it can be calculated for others attributes such as:

Outlook still comes out on top but Humidity is much closers contender because it split the data into two subsets instead three.

69 of 117

CART�GINI INDEX

CLASSIFICATION AND REGRESSION TREE

70 of 117

GINI Index

The Gini Index facilitates the bigger distributions so easy to implement whereas the Information Gain favors lesser distributions having small count with multiple specific values.
The method of the Gini Index is used by CART algorithms, in contrast to it, Information Gain is used in ID3, C4.5 algorithms.
Gini index operates on the categorical target variables in terms of “success” or “failure” and performs only binary split, in opposite to that Information Gain computes the difference between entropy before and after the split and indicates the impurity in classes of elements.

71 of 117

GINI Index…

72 of 117

GINI Index…

73 of 117

GINI Index…

74 of 117

Measure of Impurity: GINI

Gini Index for a given node t :

Maximum (1 - 1/n_c) when records are equally distributed among all classes, implying least interesting information.
Minimum (0.0) when all records belong to one class, implying most interesting information.

75 of 117

Examples for computing GINI

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)²– P(C2)² = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)²– (5/6)² = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)²– (4/6)² = 0.444

76 of 117

Splitting Based on GINI

Used in CART: Classification And Regression Trees
When a node p is split into k partitions (children), the quality of split is computed as,

where, n_i = number of records at child i,

n = number of records at node p.

77 of 117

Binary Attributes: Computing GINI Index

Splits into two partitions
Effect of Weighing partitions:

Larger and Purer Partitions are sought for.

Yes

Node N1

Node N2

Gini(N1) �= 1 – (5/6)²– (2/6)² �= 0.194

Gini(N2) �= 1 – (1/6)²– (4/6)² �= 0.528

Gini(Children) �= 7/12 * 0.194 + 5/12 * 0.528�= 0.333

78 of 117

Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions

Multi-way split

Two-way split

(find best partition of values)

79 of 117

Example

Past Trend	Open Interest	Trading Volume	Return
Positive	Low	High	Up
Negative	High	Low	Down
Positive	Low	High	Up
Positive	High	High	Up
Negative	Low	High	Down
Positive	Low	Low	Down
Negative	High	High	Down
Negative	Low	High	Down
Positive	Low	Low	Down
Positive	High	High	Up

80 of 117

Continuous Attributes: Computing Gini Index

Use Binary Decisions based on one value
Several Choices for the splitting value

Number of possible splitting values �= Number of distinct values

Each splitting value has a count matrix associated with it

Class counts in each of the partitions, A < v and A ≥ v

Simple method to choose best v

For each v, scan the database to gather count matrix and compute its Gini index
Computationally Inefficient! Repetition of work.

81 of 117

Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,

Sort the attribute on values
Linearly scan these values, each time updating the count matrix and computing gini index
Choose the split position that has the least gini index

Split Positions

Sorted Values

82 of 117

Splitting Criteria based on Classification Error

Classification error at a node t :

Measures misclassification error made by a node.

Maximum (1 - 1/n_c) when records are equally distributed among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most interesting information

83 of 117

Examples for Computing Error

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

84 of 117

Misclassification Error vs Gini

Yes

Node N1

Node N2

Gini(N1) �= 1 – (3/3)²– (0/3)² �= 0

Gini(N2) �= 1 – (4/7)²– (3/7)² �= 0.489

Gini(Children) �= 3/10 * 0 �+ 7/10 * 0.489�= 0.342

85 of 117

Example: GINI Index

Instances
	Sunny	Hot	High	Weak	No
	Sunny	Hot	High	Strong	No
	Overcast	Hot	High	Weak	Yes
	Rain	Mild	High	Weak	Yes
	Rain	Cool	Normal	Weak	Yes
	Rain	Cool	Normal	Strong	No
	Overcast	Cool	Normal	Strong	Yes
	Sunny	Mild	High	Weak	No
	Sunny	Cool	Normal	Weak	Yes
	Rain	Mild	Normal	Weak	Yes
	Sunny	Mild	Normal	Strong	Yes
	Overcast	Mild	High	Strong	Yes
	Overcast	Hot	Normal	Weak	Yes
	Rain	Mild	High	Strong	No

86 of 117

Example: GINI Index…

First attribute is “Outlook”

Outlook	Yes	No	Number of instances
Sunny	2	3	5
Overcast	4	0	4
Rain	3	2	5

Gini(Outlook=Sunny) = 1 – (2/5)² – (3/5)² = 1 – 0.16 – 0.36 = 0.48

Gini(Outlook=Overcast) = 1 – (4/4)² – (0/4)² = 0

Gini(Outlook=Rain) = 1 – (3/5)² – (2/5)² = 1 – 0.36 – 0.16 = 0.48

Then, we will calculate weighted sum of Gini indexes for outlook feature.

Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342

87 of 117

Example: GINI Index…

Second attribute is “Temperature”

Temperature	Yes	No	Number of instances
Hot	2	2	4
Cool	3	1	4
Mild	4	2	6

Gini(Temp=Hot) = 1 – (2/4)² – (2/4)² = 0.5

Gini(Temp=Cool) = 1 – (3/4)² – (1/4)² = 1 – 0.5625 – 0.0625 = 0.375

Gini(Temp=Mild) = 1 – (4/6)² – (2/6)² = 1 – 0.444 – 0.111 = 0.445

We’ll calculate weighted sum of gini index for temperature feature

Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439

88 of 117

Example: GINI Index…

Third attribute is “Humidity”

Humidity	Yes	No	Number of instances
High	3	4	7
Normal	6	1	7

Gini(Humidity=High) = 1 – (3/7)² – (4/7)² = 1 – 0.183 – 0.326 = 0.489

Gini(Humidity=Normal) = 1 – (6/7)² – (1/7)² = 1 – 0.734 – 0.02 = 0.244

Weighted sum for humidity feature will be calculated next

Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367

89 of 117

Example: GINI Index…

Fourth attribute is “Wind”

Wind	Yes	No	Number of instances
Weak	6	2	8
Strong	3	3	6

Gini(Wind=Weak) = 1 – (6/8)² – (2/8)² = 1 – 0.5625 – 0.062 = 0.375

Gini(Wind=Strong) = 1 – (3/6)² – (3/6)² = 1 – 0.25 – 0.25 = 0.5

Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

90 of 117

Example: GINI Index…

Gini Index of all attributes are as:

Feature	Gini index
Outlook	0.342
Temperature	0.439
Humidity	0.367
Wind	0.428

91 of 117

Decision Tree at First Split

In this tree, overcast has only yes. Hence, it reaches at leaf node

92 of 117

Decision Tree at First Split

In this tree, overcast has only yes. Hence, it reaches at leaf node

93 of 117

DT: GINI Index for sub dataset

Same procedure is applied for the sub datasets.
The sub dataset for “sunny” outlook. We need to find the GINI index scores for temperature, humidity and wind features, respectively.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes

94 of 117

DT: GINI Index for sub dataset

Gini of temperature for “sunny” outlook.

Temperature	Yes	No	Number of instances
Hot	0	2	2
Cool	1	0	1
Mild	1	1	2

Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)² – (2/2)² = 0

Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)² – (0/1)² = 0

Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)² – (1/2)² = 1 – 0.25 – 0.25 = 0.5

Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2

95 of 117

DT: GINI Index for sub dataset

Gini of humidity for sunny outlook.

Humidity	Yes	No	Number of instances
High	0	3	3
Normal	2	0	2

Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)² – (3/3)² = 0

Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)² – (0/2)² = 0

Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0

96 of 117

DT: GINI Index for sub dataset

Gini of wind for sunny outlook.

Wind	Yes	No	Number of instances
Weak	1	2	3
Strong	1	1	2

Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)² – (2/3)² = 0.266

Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)² – (1/2)² = 0.2

Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466

97 of 117

DT: GINI Index for sub dataset

Decision for sunny outlook.

Feature	Gini index
Temperature	0.2
Humidity	0
Wind	0.466

We’ll put humidity check at the extension of sunny outlook.

98 of 117

DT: Second Split

As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over.

99 of 117

Decision Tree: Second Split

100 of 117

DT: Rain_Outlook

Rain outlook

Day	Outlook	Temp.	Humidity	Wind	Decision
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
10	Rain	Mild	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

We’ll calculate Gini index scores for temperature, humidity and wind features when outlook is rain.

101 of 117

DT: Rain_Outlook…

Gini of temperature for rain outlook

Temperature	Yes	No	Number of instances
Cool	1	1	2
Mild	2	1	3

Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)² – (1/2)² = 0.5

Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)² – (1/3)² = 0.444

Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466

102 of 117

DT: Rain_Outlook…

Gini of humidity for rain outlook

Humidity	Yes	No	Number of instances
High	1	1	2
Normal	2	1	3

Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)² – (1/2)² = 0.5

Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)² – (1/3)² = 0.444

Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466

103 of 117

DT: Rain_Outlook…

Gini of wind for rain outlook

Wind	Yes	No	Number of instances
Weak	3	0	3
Strong	0	2	2

Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)² – (0/3)² = 0

Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)² – (2/2)² = 0

Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

104 of 117

DT: Rain_Outlook…

Decision for rain outlook

The winner is wind feature for rain outlook because it has the minimum Gini index score in features.

Feature	Gini index
Temperature	0.466
Humidity	0.466
Wind	0

105 of 117

DT: at Third Split

Put the wind feature for rain outlook branch and monitor the new sub data sets.

As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.

106 of 117

Decision Tree: Final

107 of 117

Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values.

108 of 117

Decision Tree Based Classification

Advantages:

Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many simple data sets

109 of 117

Practical Issues of Classification

Underfitting and Overfitting

Missing Values

Costs of Classification

110 of 117

Underfitting and Overfitting (Example)

500 circular and 500 triangular data points.

Circular points:

0.5 ≤ sqrt(x₁²+x₂²) ≤ 1

Triangular points:

sqrt(x₁²+x₂²) > 0.5 or

sqrt(x₁²+x₂²) < 1

111 of 117

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

112 of 117

Overfitting due to Noise

Decision boundary is distorted by noise point

113 of 117

Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

114 of 117

Notes on Overfitting

Overfitting results in decision trees that are more complex than necessary.

Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

Need new ways for estimating errors

115 of 117

Occam’s Razor

Given two models of similar generalization errors, one should prefer the simpler model over the more complex model.

For complex models, there is a greater chance that it was fitted accidentally by errors in data.

Therefore, one should include model complexity when evaluating a model.

116 of 117

How to Address Overfitting

Pre-Pruning (Early Stopping Rule)

Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:

Stop if all instances belong to the same class
Stop if all the attribute values are the same

More restrictive conditions:

Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available features (e.g., using χ² test)
Stop if expanding the current node does not improve impurity� measures (e.g., Gini or information gain).

117 of 117

How to Address Overfitting…

Post-pruning

Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
If generalization error improves after trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from majority class of instances in the sub-tree
Can use MDL for post-pruning