Decision Tree
Guest Lecture by Aviva Prins
CMSC 320 - Introduction to Data Science
Before We Start, Remember:
No free lunch theorem
Much of ML is concerned with devising different models and different
algorithms to fit them. However, there’s no single best model that works optimally for all ends of problems.
2
Topics to cover today:
A Decision Tree
A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It splits the dataset into subsets based on the most significant feature at each step.
Visually represents decisions and their possible consequences, making it easy to understand and interpret.
Terminology: Decision Tree Structure
A binary decision tree (BST) is a decision tree in which each node has two branches.
Starting point�(does not have any incoming branches)
Terminal points�(represent all the possible outcomes within the dataset.)
Root Node: The top decision node (starting point for decision making).
Decision/Internal Node: Tests a feature and splits into branches.
Leaf/ Terminal Node: Final node with a prediction (no further splits).
Branch/Sub-Tree: A connection between nodes, representing a possible outcome based on a feature test.
Layer: Nodes at the same depth from the root.
Depth: The maximum number of layers from root to farthest leaf.
Decision Trees Terminology Example:
Decision Tree (DT) learning is about learning such a tree from labeled data
5
Body
temp.
Gives
birth
Cold
Non-mammal
Yes
No
Non-mammal
Mammal
Root Node
An Internal Node
Warm
A Leaf Node
Root and internal nodes test rules
Leaf nodes make predictions
Output of Decision Trees
Decision Rules: Each path from root to leaf represents an IF-THEN rule.
6
Color
Shape
Shape
B
+
+
+
-
-
Example: Decision Trees for Classification
Predict: Deciding whether to play or not to play Tennis on a Saturday.
Below Left: Training data, Below Right: A decision tree constructed using this data
7
Example credit: Tom Mitchell
Solution: Decision Tree
Decision Tree:
Divide and Conquer
Decision Tree:
Divide and Conquer
We need to figure out a way to split further in order to get pure subsets.
Resulting
Final
Decision
Tree
A logical Predicates or formula that told you in which cases John will play and vice versa.
Concept Learning in Decision Trees
Decision trees learn boolean functions (True/False outputs) from examples.
Representation
Expressivity
Design Principle
Example
Decision Tree Induction
The process of automatically constructing a decision tree from training data
Many Algorithms:
Next
Key Concepts: Entropy and Information Gain
Decision trees use entropy and information gain to decide the best feature/attribute for splitting.
Concept of “pure” vs “impure” node
Try:
Pure Node: All (or majority) samples belong to one class
Impure Node: Contains mixed classes. Requires further splitting
Example “Sunny”, “Rain” → Impure nodes (need splitting)
Here, “Overcast” is a pure node
Purity Goal: Seek splits that create subsets that are: Either all positive (e.g., 4Y/0N) Or all negative (e.g., 0Y/4N)
Hypothesis Space & Learning Challenges
Multiple trees can fit the same data → Many possible hypotheses
Challenge: Learn the best tree from data
Remember: Ockham’s Razor
The goal is to have the resulting decision tree as small as possible (Occam’s Razor)
22
Back to: How to Split at Internal Nodes?
Decision Tree Splitting: Purity & Entropy
Goal: Create maximally pure child nodes
23
Optimal Split: Maximizes information gain: Entropy(before) – Weighted Entropy(after)
Entropy (Measure of Impurity)
Measure of uncertainty or impurity in a dataset. In the context of classification, it quantifies how mixed the class labels are.
If a dataset S contains c classes, the entropy of S is:
Interpretation:
Example:
Example: Impurity/ Entropy
25
ENTROPY
Entropy Computing Example:
28
Entropy =
Information Gain
Measures the reduction in entropy after splitting a dataset based on a feature.
Objective: Choose the feature with the highest information gain for splitting.
At each node of the decision tree, we choose the attribute that maximizes information gain as the splitting criterion. This helps us decide how to partition the dataset into smaller, more homogeneous subsets based on the values of the chosen attribute.
From Entropy to Information Gain Formula
30
Information Gain with Formula:
32
This split has a low IG
(in fact zero IG)
This split has higher IG
Remember: goal is to
Important: Find the attribute or feature for splitting the dataset that minimizes entropy or maximizes information gain, thereby reducing the uncertainty and making the dataset more homogeneous (in terms of the target variable or class labels) after the split; in order to achieve better classification performance in decision tree algorithms.
33
Consider, Training Examples
Example: Entropy and Information Gain
Day | Outlook | Temp | Humidity | Wind | Tennis? |
D1 | Sunny | Hot | High | Weak | No |
D2 | Sunny | Hot | High | Strong | No |
D3 | Overcast | Hot | High | Weak | Yes |
D4 | Rain | Mild | High | Weak | Yes |
D5 | Rain | Cool | Normal | Weak | Yes |
D6 | Rain | Cool | Normal | Strong | No |
D7 | Overcast | Cool | Normal | Strong | Yes |
D8 | Sunny | Mild | High | Weak | No |
D9 | Sunny | Cool | Normal | Weak | Yes |
D10 | Rain | Mild | Normal | Weak | Yes |
D11 | Sunny | Mild | Normal | Strong | Yes |
D12 | Overcast | Mild | High | Strong | Yes |
D13 | Overcast | Hot | Normal | Weak | Yes |
D14 | Rain | Mild | High | Strong | No |
Determine the Root Attribute:
Let’s use IG based criterion to construct a DT for the Tennis example
Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1
35
Wind
Weak
Strong
6+, 2−�Ε=0.811
3+, 3−�Ε=1.000
9+, 5− Ε=0.940
Determine the Root Attribute:
Let’s use IG based criterion to construct a DT for the Tennis example
Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1
36
Now
37
Sort the Training Example and Growing the tree
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019
Outlook
Sunny
Overcast
9+, 5− {D1,…,D14}
Rain
{D3,D7,D12,D13}� 4+, 0−
Yes
{D4,D5,D6,D10,D15}� 3+, 2−
{D1,D2,D8,D9,D11}� 2+, 3−
?
?
38
Organize training data at each decision node based on the attribute that best separates the examples into distinct classes.
Growing the tree
39
Final Decision Tree
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
High
Normal
Strong
Weak
No
40
Yes
Yes
No
Split vertically
Entropy
One more example: Training a Decision Tree
Information Gain
One more example: Training a Decision Tree
Split horizontally
Split vertically
Final decision tree!
What about? If features are continuous:
Internal nodes can test the value of a feature against a threshold.
43
45
+ + + +
+ + - +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
Attribute 1
Attribute 2
A very deep tree required
To fit just one odd training
example
Overfitting
When to stop Splitting Further?
46
+ + + +
+ + - +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
Attribute 1
Attribute 2
A very deep tree required
To fit just one odd training
example
Day | Outlook | Temp | Humidity | Wind | Tennis? |
D15 | Sunny | Hot | Normal | Strong | No |
47
Overfitting in Decision Tree
48
Outlook
Overcast
Rain
3,7,12,13
4,5,6,10,14
3+,2-
Sunny
1,2,8,9,11
4+,0-
2+,3-
Yes
Humidity
Wind
Normal
High
No
Weak
Strong
No
Yes
Weak
Strong
No
Yes
Wind
Noise or other
coincidental regularities
Overfitting in Decision Tree - Example
When to stop growing the tree?
the validation set accuracy)
49
When to stop growing the tree?
(To help prevent the tree from growing too much!)
50
OR
Avoiding Overfitting in DTs
51
Avoiding Overfitting in DTs
53
54
Two basic approaches
Advantages:
Disadvantages:
Prepruning
55
Two basic approaches
2. Postpruning: Grow the full tree (Allow tree to grow until best fit (allow overfitting)) and then remove nodes that seem not to have sufficient evidence or or do not significantly contribute to the overall accuracy of the tree. (more popular)
Advantages:
Disadvantages:
Postpruning
Summary: Determine when to stop splitting
Stopping Criteria:
The recursive splitting process continues until a stopping criterion is met. Common stopping criteria include
56
Scikit Learn Decision Tree
Scikit Learn: Example with Custom Parameters:
The End