1 of 60

Decision Tree

Guest Lecture by Aviva Prins

CMSC 320 - Introduction to Data Science

Follow Textbook Chapter:: https://ffalam.github.io/CMSC320TextBook/chapter14/Chapter_14_0.html

2 of 60

Before We Start, Remember:

No free lunch theorem

Much of ML is concerned with devising different models and different

algorithms to fit them. However, there’s no single best model that works optimally for all ends of problems.

2

Topics to cover today:

What is Decision Tree
Common Terminology
Concept of How to Split a Node
Entropy
Information Gain
Overfitting Problem

Pre-Pruning
Post- Pruning

3 of 60

A Decision Tree

A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It splits the dataset into subsets based on the most significant feature at each step.

Visually represents decisions and their possible consequences, making it easy to understand and interpret.

4 of 60

Terminology: Decision Tree Structure

A binary decision tree (BST) is a decision tree in which each node has two branches.

Starting point�(does not have any incoming branches)

Terminal points�(represent all the possible outcomes within the dataset.)

Root Node: The top decision node (starting point for decision making).

Decision/Internal Node: Tests a feature and splits into branches.

Leaf/ Terminal Node: Final node with a prediction (no further splits).

Branch/Sub-Tree: A connection between nodes, representing a possible outcome based on a feature test.

Layer: Nodes at the same depth from the root.

Depth: The maximum number of layers from root to farthest leaf.

5 of 60

Decision Trees Terminology Example:

Decision Tree (DT) defines a hierarchy of rules to make a prediction

Decision Tree (DT) learning is about learning such a tree from labeled data

5

Body

temp.

Gives

birth

Cold

Non-mammal

Yes

No

Non-mammal

Mammal

Root Node

An Internal Node

Warm

A Leaf Node

Root and internal nodes test rules

Leaf nodes make predictions

6 of 60

Output of Decision Trees

Classification Trees: Output is a discrete category (e.g., "Yes/No", "Cat/Dog").

Regression trees: Predict a continuous numerical value (e.g., house price, temperature).

Decision Rules: Each path from root to leaf represents an IF-THEN rule.

6

Color

Shape

B

+

-

7 of 60

Example: Decision Trees for Classification

Predict: Deciding whether to play or not to play Tennis on a Saturday.

Each input (Saturday) has 4 categorical features: Outlook, Temp., Humidity, Wind
A binary classification problem (play vs no-play)

Below Left: Training data, Below Right: A decision tree constructed using this data

7

Example credit: Tom Mitchell

8 of 60

9 of 60

HARD TO GUESS!
TRY TO UNDERSTAND WHEN JOHN PLAY.

Solution: Decision Tree

10 of 60

Decision Tree:

Divide and Conquer

11 of 60

Decision Tree:

Divide and Conquer

12 of 60

We need to figure out a way to split further in order to get pure subsets.

13 of 60

14 of 60

15 of 60

Resulting

Final

Decision

Tree

A logical Predicates or formula that told you in which cases John will play and vice versa.

16 of 60

Concept Learning in Decision Trees

Decision trees learn boolean functions (True/False outputs) from examples.

Representation

Learn True/False (Yes/No) outputs
Tree = OR of AND rules
Each Path → AND (conjunction)
Full tree → OR (disjunction)

Expressivity

Can represent any Boolean function (AND, OR, XOR)
Some functions → very large (exponential) trees
Produces human-interpretable rules

Design Principle

Prefer simpler trees (Occam’s Razor)
Fewer splits → better generalization
Easier to interpret & explain

Example

IF (Outlook=Sunny AND Humidity=High) OR IF (Outlook= Rain AND Wind=Strong) THEN Play Tennis=No

17 of 60

Decision Tree Induction

The process of automatically constructing a decision tree from training data

Many Algorithms:

ID3 (Iterative Dichotomiser 3) → We will focus only on this one
C4.5
Hunt’s Algorithm
CART
SLIQ,SPRINT

18 of 60

19 of 60

Decision trees use entropy and information gain to decide the best feature/attribute for splitting.

20 of 60

Concept of “pure” vs “impure” node

Try:

Pure Node: All (or majority) samples belong to one class

No further splitting needed
Easy to make prediction

Impure Node: Contains mixed classes. Requires further splitting

Example “Sunny”, “Rain” → Impure nodes (need splitting)

Here, “Overcast” is a pure node

Purity Goal: Seek splits that create subsets that are: Either all positive (e.g., 4Y/0N) Or all negative (e.g., 0Y/4N)

21 of 60

Hypothesis Space & Learning Challenges

Multiple trees can fit the same data → Many possible hypotheses

Challenge: Learn the best tree from data

Minimal tree search is NP-hard → use heuristics
Greedy algorithms are practical but suboptimal
Split selection is the core computational challenge

22 of 60

Remember: Ockham’s Razor

The goal is to have the resulting decision tree as small as possible (Occam’s Razor)

The main decision in the algorithm is the selection of the next attribute to condition on.

22

23 of 60

Back to: How to Split at Internal Nodes?

Decision Tree Splitting: Purity & Entropy

Goal: Create maximally pure child nodes

Pure = Majority of samples share same class

23

For classification problems (discrete outputs), entropy measures “purity”.

Low entropy (e.g., 0.0) ⇒ high purity (less uniform label distribution → labels are mostly the same)
High entropy (e.g., 1.0) ⇒ Low purity (uniform label distribution → labels are mix)

Optimal Split: Maximizes information gain: Entropy(before) – Weighted Entropy(after)

Key Points: Pure nodes stop further splitting; Greedily select highest IG at each step

24 of 60

Entropy (Measure of Impurity)

Measure of uncertainty or impurity in a dataset. In the context of classification, it quantifies how mixed the class labels are.

If a dataset S contains c classes, the entropy of S is:

Interpretation:

Entropy = 0 → Pure (all same class).
Entropy = 1 → Maximum impurity (equal class distribution).

Example:

5 "Yes", 5 "No" → Entropy = 1.
8 "Yes", 2 "No" → Entropy ≈ 0.72.

25 of 60

Example: Impurity/ Entropy

25

26 of 60

27 of 60

ENTROPY

28 of 60

Entropy Computing Example:

28

Entropy =

29 of 60

Information Gain

Measures the reduction in entropy after splitting a dataset based on a feature.

Objective: Choose the feature with the highest information gain for splitting.

At each node of the decision tree, we choose the attribute that maximizes information gain as the splitting criterion. This helps us decide how to partition the dataset into smaller, more homogeneous subsets based on the values of the chosen attribute.

30 of 60

From Entropy to Information Gain Formula

30

31 of 60

32 of 60

Information Gain with Formula:

32

This split has a low IG

(in fact zero IG)

This split has higher IG

33 of 60

Remember: goal is to

Important: Find the attribute or feature for splitting the dataset that minimizes entropy or maximizes information gain, thereby reducing the uncertainty and making the dataset more homogeneous (in terms of the target variable or class labels) after the split; in order to achieve better classification performance in decision tree algorithms.

33

34 of 60

Consider, Training Examples

Example: Entropy and Information Gain

Day	Outlook	Temp	Humidity	Wind	Tennis?
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

35 of 60

Determine the Root Attribute:

Let’s use IG based criterion to construct a DT for the Tennis example

At root node, let’s compute IG of each of the 4 features
Consider feature “wind”. Root contains all examples S = [9+,5-]

S_weak = [6+, 2−] ⇒ H(S_weak ) = 0.811

S_strong = [3+, 3−] ⇒ H(S_strong) = 1

35

Wind

Weak

Strong

6+, 2−�Ε=0.811

3+, 3−�Ε=1.000

9+, 5− Ε=0.940

36 of 60

Determine the Root Attribute:

Let’s use IG based criterion to construct a DT for the Tennis example

At root node, let’s compute IG of each of the 4 features
Consider feature “wind”. Root contains all examples S = [9+,5-]

S_weak = [6+, 2−] ⇒ H(S_weak ) = 0.811

S_strong = [3+, 3−] ⇒ H(S_strong) = 1

36

Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
Thus we choose “outlook” feature to be tested at the root node

37 of 60

Now

Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
Rule: Iterate - for each child node, select the feature with the highest IG

37

38 of 60

Sort the Training Example and Growing the tree

S_sunny= {D1,D2,D8,D9,D11}

Gain (S_sunny, Humidity) = .970

Gain (S_sunny, Temp) = .570

Gain (S_sunny, Wind) = .019

Outlook

Sunny

Overcast

9+, 5− {D1,…,D14}

Rain

{D3,D7,D12,D13}� 4+, 0−

Yes

{D4,D5,D6,D10,D15}� 3+, 2−

{D1,D2,D8,D9,D11}� 2+, 3−

?

38

Organize training data at each decision node based on the attribute that best separates the examples into distinct classes.

39 of 60

Growing the tree

So, proceeding as before, for level 2, left node, we can verify that

IG(S,temp) = 0.570, IG(S, humidity) = 0.970, IG(S, wind) = 0.019

Thus humidity chosen as the feature to be tested at level 2, left node

No need to expand the middle node (already “pure” - all “yes” training examples)

Can also verify that wind has the largest IG for the right node.
Note: If a feature has already been tested along a path earlier, we don’t consider it again

39

40 of 60

Final Decision Tree

Outlook

Sunny

Overcast

Rain

Humidity

Wind

Yes

High

Normal

Strong

Weak

No

40

Yes

No

41 of 60

Split vertically

Entropy

One more example: Training a Decision Tree

Information Gain

Information Gain (IG) = Entropy(before split) – Weighted Average Entropy(after split)�

IG Measures how much uncertainty is reduced by a split. Choose the split with the highest Information Gain

“Information Gain tells us how much a split improves the purity of the data.

Before splitting, we measure the entropy of the parent node.�
After splitting, each child node has its own entropy, weighted by the number of samples in that node.�
Information Gain is the difference: how much the entropy decreased after the split.�
The bigger the reduction, the better the split.�
Decision trees use this to choose which feature and threshold to split on, aiming to maximize Information Gain and create purer nodes.”

—-------

Entropy is the uncertainty, large entropy is bad (1 is the largest value)

Want the largest possible information gain

R: ratio of the number of examples of each class divided by the total number of examples in the parent

42 of 60

One more example: Training a Decision Tree

Split horizontally

Split vertically

Final decision tree!

43 of 60

What about? If features are continuous:

Internal nodes can test the value of a feature against a threshold.

43

44 of 60

45 of 60

45

+ + + +

+ + - +

+ + + +

- - -
- - -
- - -
- - -

+ + + +

- - -
- - -
- - -
- - -
- - -
- - -
- - -
- - -

Attribute 1

Attribute 2

A very deep tree required

To fit just one odd training

example

Overfitting

46 of 60

When to stop Splitting Further?

46

+ + + +

+ + - +

+ + + +

- - -
- - -
- - -
- - -

+ + + +

- - -
- - -
- - -
- - -
- - -
- - -
- - -
- - -

Attribute 1

Attribute 2

A very deep tree required

To fit just one odd training

example

47 of 60

Consider adding noisy training example (should be +):

What effect on earlier tree?

Day	Outlook	Temp	Humidity	Wind	Tennis?
D15	Sunny	Hot	Normal	Strong	No

47

Overfitting in Decision Tree

48 of 60

48

Outlook

Overcast

Rain

3,7,12,13

4,5,6,10,14

3+,2-

Sunny

1,2,8,9,11

4+,0-

2+,3-

Yes

Humidity

Wind

Normal

High

No

Weak

Strong

No

Yes

Weak

Strong

No

Yes

Wind

Noise or other

coincidental regularities

Overfitting in Decision Tree - Example

49 of 60

When to stop growing the tree?

Stop expanding a node further (i.e., make it a leaf node) when

It consist of all training examples having the same label (the node becomes “pure”)
We run out of features to test along the path to that node
The DT starts to overfit (can be checked by monitoring

the validation set accuracy)

49

50 of 60

When to stop growing the tree?

Important: No need to obsess too much for purity

(To help prevent the tree from growing too much!)

It is okay to have a leaf node that is not fully pure, e.g., this
Predictions at Impure Leaf Nodes: For test inputs that reach an impure leaf, we can predict probability of belonging to each class (in above example, p(red) = 3/8, p(green) = 5/8), or simply predict the majority labell, which in this case would be green.

50

OR

51 of 60

Avoiding Overfitting in DTs

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches to prune a complex DT

Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

51

52 of 60

53 of 60

Avoiding Overfitting in DTs

Criteria for judging which nodes could potentially be pruned

Use a validation set (separate from the training set)

Prune each possible node that doesn’t hurt the accuracy on the validation set
Greedily remove the node that improves the validation accuracy the most
Stop when the validation set accuracy starts worsening

53

54 of 60

54

Two basic approaches

Prepruning: Stop growing the tree at some point during construction when it is determined that there is not enough data to make reliable choices.

Advantages:

Prevents overfitting early on.
Faster computation as the tree is smaller.

Disadvantages:

May miss important patterns if stopped too early.
Decision to stop may be subjective.

Prepruning

55 of 60

55

Two basic approaches

2. Postpruning: Grow the full tree (Allow tree to grow until best fit (allow overfitting)) and then remove nodes that seem not to have sufficient evidence or or do not significantly contribute to the overall accuracy of the tree. (more popular)

Advantages:

Captures all possible patterns.
Reduces overfitting by removing unnecessary nodes.

Disadvantages:

More computationally expensive initially.
Risk of overfitting if not properly validated.

Postpruning

56 of 60

Summary: Determine when to stop splitting

Stopping Criteria:

The recursive splitting process continues until a stopping criterion is met. Common stopping criteria include

Reaching a maximum depth (preventing it from becoming overly complex).
Having a minimum number of samples per leaf (Splitting stops if a leaf node would contain fewer samples than a specified threshold).
Encountering pure nodes (nodes containing only one class).

56

57 of 60

58 of 60

Scikit Learn Decision Tree

59 of 60

Scikit Learn: Example with Custom Parameters:

60 of 60

The End