1 of 24

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 33

Recursive Partitioning: Decision Trees

By

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 24

Classification

  • Task of assigning objects to predefined categories
  • Objective of many data mining tools

Examples

  • Detecting spam emails based upon message content
  • Tumor cells are malignant or benign using MRI scan
  • Classifying galaxies based upon their shapes.

Data Mining_Anoop Chaturvedi

2

3 of 24

Decision Tree or Classification Tree

  • A flow-chart-like tree structure used for classification purpose.
  • A supervised learning approach.
  • Result of asking an ordered sequence of questions
  • Type of question asked at each step depends upon the answers to the previous questions of the sequence.
  • Sequence terminates in a prediction of the class.

Data Mining_Anoop Chaturvedi

3

4 of 24

Data Mining_Anoop Chaturvedi

4

5 of 24

Node ⇒ Subset of the set of variables. Can be a terminal or non-terminal node.

Non-terminal node or parent node ⇒ Node that can split into two daughter nodes.

Terminal node or Leaf nodes Node that cannot split.

The decision tree is a binary tree. Each non-terminal node is split into two daughter nodes, each leading to two disjoint binary trees called the left subtree and right subtree of the root.

Data Mining_Anoop Chaturvedi

5

6 of 24

  • Goal is to learn a decision tree using training data such that labels for new examples can be determined.
  • Unique starting point is called root node and consist of entire learning set at the top of tree.
  • Nodes are tests for feature values.
  • Internal node denotes an attribute.
  • Branch or non-leaf node represents the values of the node attribute.
  • Conjunctions of features at branches lead to class labels.

Data Mining_Anoop Chaturvedi

6

7 of 24

  • A terminal or leaf node is assigned a class label or class distribution. Set of all terminal nodes define a partition of data.
  • Each root is the parent of the roots of its subtrees.
  • A single split node with only two terminal nodes is called a stump.
  • Two nodes with the same parent are called siblings.
  • Root node has no parent.
  • Tree processing may be considered as an edge extending from a parent to its children. Thus edges connect a root with the roots of each subtree.

Data Mining_Anoop Chaturvedi

7

8 of 24

  •  

Data Mining_Anoop Chaturvedi

8

9 of 24

Example: Decision tree with two input variables, five terminal nodes and four splits.

  •  

Data Mining_Anoop Chaturvedi

9

10 of 24

  •  

Data Mining_Anoop Chaturvedi

10

11 of 24

  •  

Data Mining_Anoop Chaturvedi

11

12 of 24

  •  

Data Mining_Anoop Chaturvedi

12

13 of 24

  •  

Data Mining_Anoop Chaturvedi

13

14 of 24

  •  

Data Mining_Anoop Chaturvedi

14

15 of 24

Node impurity functions for the two-class case.

Entropy function (rescaled) ⇒ Red curve

Gini index ⇒ Green curve

Re-substitution estimate of the misclassification rate ⇒ blue curve.

Data Mining_Anoop Chaturvedi

15

16 of 24

  •  

Data Mining_Anoop Chaturvedi

16

17 of 24

Data Mining_Anoop Chaturvedi

17

 

L

R

Total

0

1

L

R

Total

 

18 of 24

  •  

Data Mining_Anoop Chaturvedi

18

19 of 24

  •  

Data Mining_Anoop Chaturvedi

19

20 of 24

  •  

Data Mining_Anoop Chaturvedi

20

21 of 24

  •  

Data Mining_Anoop Chaturvedi

21

22 of 24

  •  

Data Mining_Anoop Chaturvedi

22

23 of 24

  •  

Data Mining_Anoop Chaturvedi

23

24 of 24

  •  

Data Mining_Anoop Chaturvedi

24