2 of 35

BASIC CONCEPTS

Input data ->collection of records. E
Record / instance / example -> tuple (x, y)

x - attribute set
y - special attribute (class label / category / target attribute)

Attribute set - properties of a Data Object – Discrete / Continuous
Class label –

Classification – y is Discrete attribute
Regression (Predictive Modeling Task) - y is a continuous attribute.

3 of 35

BASIC CONCEPTS

Definition:

Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.
The target function is also known informally as a classification model.

4 of 35

BASIC CONCEPTS

A classification model is useful for the following purposes.

Descriptive modeling: A classification model can serve as an explanatory tool to distinguish between objects of different classes.

5 of 35

BASIC CONCEPTS

A classification model is useful for the following purposes.

Predictive Modeling:

A classification model can also be used to predict the class label of unknown records.
Automatically assigns a class label when presented with the attribute set of an unknown record.

Classification techniques best suit for binary or nominal categories.
Do not consider the implicit order
Relationships are also ignored(super-sub class)

6 of 35

General approach to solving a classification problem

Classification technique (or classifier)

Systematic approach to building classification models from an input data set.
Examples

Decision tree classifiers,
Rule-based classifiers,
Neural networks,
Support vector machines, and
Naive bayes classifiers.

Learning algorithm

Used by the classifier
To identify a model

That best fits the relationship between the attribute set and class label of the input data.

7 of 35

General approach to solving a classification problem

Model

Generated by a learning algorithm
Should satisfy the following:

Fit the input data well
Correctly predict the class labels of records it has never seen before.

Training set

Consisting of records whose class labels are known
used to build a classification model

8 of 35

General approach to solving a classification problem

Confusion Matrix

Used to evaluate the performance of a classification model
Holds details about

counts of test records correctly and incorrectly predicted by the model.
Table 4.2 depicts the confusion matrix for a binary classification problem.
fij – no. of records from class i predicted to be of class j.
f01 – no. of records from class 0 incorrectly predicted as class 1.
total no. of correct predictions made (f11 + f00)
total number of incorrect predictions (f10 + f01).

9 of 35

General approach to solving a classification problem

Performance Metrics:

Accuracy

Error Rate

10 of 35

DECISION TREE INDUCTION

Working of Decision Tree

We can solve a classification problem by asking a series of carefully crafted questions about the attributes of the test record.
Each time we receive an answer, a follow-up question is asked until we reach a conclusion about the class label of the record.
The series of questions and their possible answers can be organized in the form of a decision tree
Decision tree is a hierarchical structure consisting of nodes and directed edges.

11 of 35

DECISION TREE INDUCTION

Working of Decision Tree

Three types of nodes:

Root node

No incoming edges
Zero or more outgoing edges.

Internal nodes

Exactly one incoming edge and
Two or more outgoing edges.

Leaf or terminal nodes

Exactly one incoming edge and
No outgoing edges.

Each leaf node is assigned a class label.
Non-terminal nodes (root & other internal nodes) contain attribute test conditions to separate records that have different characteristics.

12 of 35

DECISION TREE INDUCTION

Working of Decision Tree

13 of 35

DECISION TREE INDUCTION

Buiding Decision Tree

Hunt’s algorithm:

basis of many existing decision tree induction algorithms, including

ID3,
C4.5, and
CART.

Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records into successively purer subsets.
D_t - set of training records with node t
y = {y1, y2,..., yc} -> class labels.
Hunt’s algorithm.
Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
Step 2: If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and the records in Dt are distributed to the children based on the outcomes.
Note:-algorithm is then recursively applied to each child node.

14 of 35

DECISION TREE INDUCTION

Buiding Decision Tree

Example:-predicting whether a loan applicant will repay or not (defaulted)

Construct a training set by examining the records of previous borrowers.

15 of 35

DECISION TREE INDUCTION

Building Decision Tree

Hunt’s algorithm will work fine

if every combination of attribute values is present in the training data and
if each combination has a unique class label.

Additional conditions

If a child nodes is empty(no records in training set) then declare it as a leaf node with the same class label as the majority class of training records associated with its parent node.
Identical attribute values and diff. class label. Not possible to further split. declare node as leaf with the same class label as the majority class of training records associated with this node.

Design Issues of Decision Tree Induction

1. How should the training records be split?

Test condition to divide the records into smaller subsets.
provide a method for specifying the test condition
measure for evaluating the goodness of each test condition.

2. How should the splitting procedure stop?

A stopping condition is needed
stop when either all the records belong to the same class or all the records have identical attribute values.

16 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

Test condition for Binary Attributes

The test condition for a binary attribute generates two potential outcomes.

17 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

Test condition for Nominal Attributes

nominal attribute can have many values
Test condition can be expressed in two ways

Multiway split - number of outcomes depends on the number of distinct values
Binary splits(used in CART) - produces binary splits by considering all 2^k−1− 1 ways of creating a binary partition of k attribute values.

18 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

Test condition for Ordinal Attributes

Ordinal attributes can also produce binary or multiway splits.
values can be grouped without violating the order property.
4.10© is invalid

19 of 35

DECISION TREE INDUCTION

Methods for Expressing Attribute Test Conditions

Test condition for Continuous Attributes

Test condition - Comparison test (A < v) or (A ≥ v) with binary outcomes,

Test condition - a range query with outcomes of the form vi ≤ A < vi+1, for i = 1,..., k.

Multiway split
Apply the discretization strategies

20 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

p(i|t) - fraction of records belonging to class i at a given node t.

Sometimes – used as only P_i

Two-class problem

(p0, p1) - class distribution at any node
p1 = 1 − p0
(0.5, 0.5) because there are an equal number of records from each class
Car Type, will result in purer partitions

21 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

selection of best split is based on the degree of impurity of the child nodes
Node with class distribution (0, 1) has zero impurity,
Node with uniform class distribution (0.5, 0.5) has the highest impurity.
p - fraction of records that belong to one of the two classes.
P – maximum(0.5) – class distribution is even
P- min. (0 or 1)– all records belong to the same class

22 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

Node N1 has the lowest impurity value, followed by N2 and N3.

23 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

To Determine the performance of test condition – compare the degree of impurity of the parent node (before splitting) with the degree of impurity of the child nodes (after splitting).
The larger their difference, the better the test condition.
Information Gain:

I(·) - impurity measure of a given node,
N - total no. of records at parent node,
k - no. of attribute values
N(vj) - no. of records associated with the child node, vj.

The difference in entropy(Impurity measure) is known as the Information gain, ∆_info

24 of 35

Calculate Impurity using Gini

Find out, which attribute is selected?

26 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

Splitting of Binary Attributes

Before splitting, the Gini index is 0.5

because equal number of records from both classes.

If attribute A is chosen to split the data,

Gini index

node N1 = 0.4898, and
node N2 = 0.480.

Weighted average of the Gini index for the descendent nodes is

(7/12) × 0.4898 + (5/12) × 0.480 = 0.486.

Weighted average of the Gini index for attribute B is 0.375.
B is selected because of small value

28 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

Splitting of Nominal Attributes

First Binary Grouping

Gini index of {Sports, Luxury} is 0.4922 and
the Gini index of {Family} is 0.3750.
The weighted average Gini index

16/20 × 0.4922 + 4/20 × 0.3750 = 0.468.

Second binary grouping of {Sports} and {Family, Luxury},

weighted average Gini index is 0.167.

The second grouping has a

lower Gini index

because its corresponding subsets

are much purer.

29 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

Splitting of Continuous Attributes

A brute-force method -Take every value of the attribute in the N records as a candidate split position.

Count the number of records with annual income less than or greater than v(computationally expensive).

To reduce the complexity, the training records are sorted based on their annual income,
Candidate split positions are identified by taking the midpoints between two adjacent sorted values:

30 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

Gain Ratio

Problem:

Customer ID - produce purer partitions.
Customer ID is not a predictive attribute because its value is unique for each record.

Two Strategies:

First strategy(used in CART)

restrict the test conditions to binary splits only.

Second Strategy(used in C4.5 - Gain Ratio - to determine goodness of a split)

modify the splitting criterion
consider - number of outcomes produced by the attribute test condition.

31 of 35

DECISION TREE INDUCTION

Measures for Selecting the Best Split

Gain Ratio

33 of 35

Tree-Pruning

After building the decision tree,

Tree-pruning step - to reduce the size of the decision tree.
Pruning -

trims the branches of the initial tree
improves the generalization capability of the decision tree.

Decision trees that are too large are susceptible to a phenomenon known as overfitting.

1 of 35

2 of 35

3 of 35

4 of 35

5 of 35

6 of 35

7 of 35

8 of 35

9 of 35

10 of 35

11 of 35

12 of 35

13 of 35

14 of 35

15 of 35

16 of 35

17 of 35

18 of 35

19 of 35

20 of 35

21 of 35

22 of 35

23 of 35

24 of 35

25 of 35

26 of 35

27 of 35

28 of 35

29 of 35

30 of 35

31 of 35

32 of 35

33 of 35

34 of 35

35 of 35