1 of 60

Decision Tree

Guest Lecture by Aviva Prins

CMSC 320 - Introduction to Data Science

2 of 60

Before We Start, Remember:

No free lunch theorem

Much of ML is concerned with devising different models and different

algorithms to fit them. However, there’s no single best model that works optimally for all ends of problems.

2

Topics to cover today:

  • What is Decision Tree
  • Common Terminology
  • Concept of How to Split a Node
  • Entropy
  • Information Gain
  • Overfitting Problem
    • Pre-Pruning
    • Post- Pruning

3 of 60

A Decision Tree

A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It splits the dataset into subsets based on the most significant feature at each step.

Visually represents decisions and their possible consequences, making it easy to understand and interpret.

4 of 60

Terminology: Decision Tree Structure

A binary decision tree (BST) is a decision tree in which each node has two branches.

Starting point�(does not have any incoming branches)

Terminal points�(represent all the possible outcomes within the dataset.)

Root Node: The top decision node (starting point for decision making).

Decision/Internal Node: Tests a feature and splits into branches.

Leaf/ Terminal Node: Final node with a prediction (no further splits).

Branch/Sub-Tree: A connection between nodes, representing a possible outcome based on a feature test.

Layer: Nodes at the same depth from the root.

Depth: The maximum number of layers from root to farthest leaf.

5 of 60

Decision Trees Terminology Example:

  • Decision Tree (DT) defines a hierarchy of rules to make a prediction

Decision Tree (DT) learning is about learning such a tree from labeled data

5

Body

temp.

Gives

birth

Cold

Non-mammal

Yes

No

Non-mammal

Mammal

Root Node

An Internal Node

Warm

A Leaf Node

Root and internal nodes test rules

Leaf nodes make predictions

6 of 60

Output of Decision Trees

      • Classification Trees: Output is a discrete category (e.g., "Yes/No", "Cat/Dog").

      • Regression trees: Predict a continuous numerical value (e.g., house price, temperature).

Decision Rules: Each path from root to leaf represents an IF-THEN rule.

6

Color

Shape

Shape

B

+

+

+

-

-

7 of 60

Example: Decision Trees for Classification

Predict: Deciding whether to play or not to play Tennis on a Saturday.

  • Each input (Saturday) has 4 categorical features: Outlook, Temp., Humidity, Wind
  • A binary classification problem (play vs no-play)

Below Left: Training data, Below Right: A decision tree constructed using this data

7

Example credit: Tom Mitchell

8 of 60

9 of 60

  • HARD TO GUESS!
  • TRY TO UNDERSTAND WHEN JOHN PLAY.

Solution: Decision Tree

10 of 60

Decision Tree:

Divide and Conquer

11 of 60

Decision Tree:

Divide and Conquer

12 of 60

We need to figure out a way to split further in order to get pure subsets.

13 of 60

14 of 60

15 of 60

Resulting

Final

Decision

Tree

A logical Predicates or formula that told you in which cases John will play and vice versa.

16 of 60

Concept Learning in Decision Trees

Decision trees learn boolean functions (True/False outputs) from examples.

Representation

  • Learn True/False (Yes/No) outputs
  • Tree = OR of AND rules
  • Each Path → AND (conjunction)
  • Full tree → OR (disjunction)

Expressivity

  • Can represent any Boolean function (AND, OR, XOR)
  • Some functions → very large (exponential) trees
  • Produces human-interpretable rules

Design Principle

  • Prefer simpler trees (Occam’s Razor)
  • Fewer splits → better generalization
  • Easier to interpret & explain

Example

  • IF (Outlook=Sunny AND Humidity=High) OR IF (Outlook= Rain AND Wind=Strong) THEN Play Tennis=No

17 of 60

Decision Tree Induction

The process of automatically constructing a decision tree from training data

Many Algorithms:

  • ID3 (Iterative Dichotomiser 3) → We will focus only on this one
  • C4.5
  • Hunt’s Algorithm
  • CART
  • SLIQ,SPRINT

18 of 60

19 of 60

Next

Key Concepts: Entropy and Information Gain

Decision trees use entropy and information gain to decide the best feature/attribute for splitting.

20 of 60

Concept of “pure” vs “impure” node

Try:

Pure Node: All (or majority) samples belong to one class

  • No further splitting needed
  • Easy to make prediction

Impure Node: Contains mixed classes. Requires further splitting

Example “Sunny”, “Rain” → Impure nodes (need splitting)

Here, “Overcast” is a pure node

Purity Goal: Seek splits that create subsets that are: Either all positive (e.g., 4Y/0N) Or all negative (e.g., 0Y/4N)

21 of 60

Hypothesis Space & Learning Challenges

Multiple trees can fit the same data → Many possible hypotheses

Challenge: Learn the best tree from data

  • Minimal tree search is NP-hard → use heuristics
  • Greedy algorithms are practical but suboptimal
  • Split selection is the core computational challenge

22 of 60

Remember: Ockham’s Razor

The goal is to have the resulting decision tree as small as possible (Occam’s Razor)

  • The main decision in the algorithm is the selection of the next attribute to condition on.

22

23 of 60

Back to: How to Split at Internal Nodes?

Decision Tree Splitting: Purity & Entropy

Goal: Create maximally pure child nodes

  • Pure = Majority of samples share same class

23

  • For classification problems (discrete outputs), entropy measures “purity”.
    • Low entropy (e.g., 0.0) ⇒ high purity (less uniform label distribution → labels are mostly the same)
    • High entropy (e.g., 1.0) ⇒ Low purity (uniform label distribution → labels are mix)

Optimal Split: Maximizes information gain: Entropy(before) – Weighted Entropy(after)

  • Key Points: Pure nodes stop further splitting; Greedily select highest IG at each step

24 of 60

Entropy (Measure of Impurity)

Measure of uncertainty or impurity in a dataset. In the context of classification, it quantifies how mixed the class labels are.

If a dataset S contains c classes, the entropy of S is:

Interpretation:

  • Entropy = 0 → Pure (all same class).
  • Entropy = 1 → Maximum impurity (equal class distribution).

Example:

  • 5 "Yes", 5 "No" → Entropy = 1.
  • 8 "Yes", 2 "No" → Entropy ≈ 0.72.

25 of 60

Example: Impurity/ Entropy

25

26 of 60

27 of 60

ENTROPY

28 of 60

Entropy Computing Example:

28

Entropy =

29 of 60

Information Gain

Measures the reduction in entropy after splitting a dataset based on a feature.

Objective: Choose the feature with the highest information gain for splitting.

At each node of the decision tree, we choose the attribute that maximizes information gain as the splitting criterion. This helps us decide how to partition the dataset into smaller, more homogeneous subsets based on the values of the chosen attribute.

30 of 60

From Entropy to Information Gain Formula

30

31 of 60

32 of 60

Information Gain with Formula:

  •  

32

 

This split has a low IG

(in fact zero IG)

This split has higher IG

33 of 60

Remember: goal is to

Important: Find the attribute or feature for splitting the dataset that minimizes entropy or maximizes information gain, thereby reducing the uncertainty and making the dataset more homogeneous (in terms of the target variable or class labels) after the split; in order to achieve better classification performance in decision tree algorithms.

33

34 of 60

Consider, Training Examples

Example: Entropy and Information Gain

Day

Outlook

Temp

Humidity

Wind

Tennis?

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Strong

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cool

Normal

Weak

Yes

D10

Rain

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No

35 of 60

Determine the Root Attribute:

Let’s use IG based criterion to construct a DT for the Tennis example

  • At root node, let’s compute IG of each of the 4 features
  • Consider feature “wind”. Root contains all examples S = [9+,5-]

Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811

Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1

35

 

 

Wind

Weak

Strong

6+, 2−�Ε=0.811

3+, 3−�Ε=1.000

9+, 5− Ε=0.940

36 of 60

Determine the Root Attribute:

Let’s use IG based criterion to construct a DT for the Tennis example

  • At root node, let’s compute IG of each of the 4 features
  • Consider feature “wind”. Root contains all examples S = [9+,5-]

Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811

Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1

36

 

 

  • Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
  • Thus we choose “outlook” feature to be tested at the root node

37 of 60

Now

  • Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
  • Rule: Iterate - for each child node, select the feature with the highest IG

37

38 of 60

Sort the Training Example and Growing the tree

Ssunny= {D1,D2,D8,D9,D11}

Gain (Ssunny, Humidity) = .970

Gain (Ssunny, Temp) = .570

Gain (Ssunny, Wind) = .019

Outlook

Sunny

Overcast

9+, 5− {D1,…,D14}

Rain

{D3,D7,D12,D13}� 4+, 0

Yes

{D4,D5,D6,D10,D15}� 3+, 2

{D1,D2,D8,D9,D11}� 2+, 3

?

?

38

Organize training data at each decision node based on the attribute that best separates the examples into distinct classes.

39 of 60

Growing the tree

  • So, proceeding as before, for level 2, left node, we can verify that
    • IG(S,temp) = 0.570, IG(S, humidity) = 0.970, IG(S, wind) = 0.019
  • Thus humidity chosen as the feature to be tested at level 2, left node
    • No need to expand the middle node (already “pure” - all “yes” training examples)
  • Can also verify that wind has the largest IG for the right node.
  • Note: If a feature has already been tested along a path earlier, we don’t consider it again

39

40 of 60

Final Decision Tree

Outlook

Sunny

Overcast

Rain

Humidity

Wind

Yes

High

Normal

Strong

Weak

No

40

Yes

Yes

No

41 of 60

Split vertically

Entropy

One more example: Training a Decision Tree

Information Gain

42 of 60

One more example: Training a Decision Tree

Split horizontally

Split vertically

Final decision tree!

43 of 60

What about? If features are continuous:

Internal nodes can test the value of a feature against a threshold.

43

44 of 60

45 of 60

45

+ + + +

+ + - +

+ + + +

  • - - -
  • - - -
  • - - -
  • - - -

+ + + +

+ + + +

+ + + +

+ + + +

+ + + +

+ + + +

  • - - -
  • - - -
  • - - -
  • - - -
  • - - -
  • - - -
  • - - -
  • - - -

Attribute 1

Attribute 2

A very deep tree required

To fit just one odd training

example

Overfitting

46 of 60

When to stop Splitting Further?

46

+ + + +

+ + - +

+ + + +

  • - - -
  • - - -
  • - - -
  • - - -

+ + + +

+ + + +

+ + + +

+ + + +

+ + + +

+ + + +

  • - - -
  • - - -
  • - - -
  • - - -
  • - - -
  • - - -
  • - - -
  • - - -

Attribute 1

Attribute 2

A very deep tree required

To fit just one odd training

example

47 of 60

  • Consider adding noisy training example (should be +):

  • What effect on earlier tree?

Day

Outlook

Temp

Humidity

Wind

Tennis?

D15

Sunny

Hot

Normal

Strong

No

47

Overfitting in Decision Tree

48 of 60

48

Outlook

Overcast

Rain

3,7,12,13

4,5,6,10,14

3+,2-

Sunny

1,2,8,9,11

4+,0-

2+,3-

Yes

Humidity

Wind

Normal

High

No

Weak

Strong

No

Yes

Weak

Strong

No

Yes

Wind

Noise or other

coincidental regularities

Overfitting in Decision Tree - Example

49 of 60

When to stop growing the tree?

  • Stop expanding a node further (i.e., make it a leaf node) when
    • It consist of all training examples having the same label (the node becomes “pure”)
    • We run out of features to test along the path to that node
    • The DT starts to overfit (can be checked by monitoring

the validation set accuracy)

49

50 of 60

When to stop growing the tree?

  • Important: No need to obsess too much for purity

(To help prevent the tree from growing too much!)

    • It is okay to have a leaf node that is not fully pure, e.g., this
    • Predictions at Impure Leaf Nodes: For test inputs that reach an impure leaf, we can predict probability of belonging to each class (in above example, p(red) = 3/8, p(green) = 5/8), or simply predict the majority labell, which in this case would be green.

50

OR

51 of 60

Avoiding Overfitting in DTs

  • Desired: a DT that is not too big in size, yet fits the training data reasonably

  • Mainly two approaches to prune a complex DT
    • Prune while building the tree (stopping early)
    • Prune after building the tree (post-pruning)

51

52 of 60

53 of 60

Avoiding Overfitting in DTs

  • Criteria for judging which nodes could potentially be pruned
    • Use a validation set (separate from the training set)
      • Prune each possible node that doesn’t hurt the accuracy on the validation set
      • Greedily remove the node that improves the validation accuracy the most
      • Stop when the validation set accuracy starts worsening

53

54 of 60

54

Two basic approaches

  1. Prepruning: Stop growing the tree at some point during construction when it is determined that there is not enough data to make reliable choices.

Advantages:

  • Prevents overfitting early on.
  • Faster computation as the tree is smaller.

Disadvantages:

  • May miss important patterns if stopped too early.
  • Decision to stop may be subjective.

Prepruning

55 of 60

55

Two basic approaches

2. Postpruning: Grow the full tree (Allow tree to grow until best fit (allow overfitting)) and then remove nodes that seem not to have sufficient evidence or or do not significantly contribute to the overall accuracy of the tree. (more popular)

Advantages:

  • Captures all possible patterns.
  • Reduces overfitting by removing unnecessary nodes.

Disadvantages:

  • More computationally expensive initially.
  • Risk of overfitting if not properly validated.

Postpruning

56 of 60

Summary: Determine when to stop splitting

Stopping Criteria:

The recursive splitting process continues until a stopping criterion is met. Common stopping criteria include

  • Reaching a maximum depth (preventing it from becoming overly complex).
  • Having a minimum number of samples per leaf (Splitting stops if a leaf node would contain fewer samples than a specified threshold).
  • Encountering pure nodes (nodes containing only one class).

56

57 of 60

58 of 60

Scikit Learn Decision Tree

59 of 60

Scikit Learn: Example with Custom Parameters:

60 of 60

The End