1 of 73

CS60050: Machine Learning

Sourangshu Bhattacharya

CSE, IIT Kharagpur

2 of 73

NON-PARAMETRIC MODELS

3 of 73

Instance-Based Classifiers

Store the training records
Use training records to � predict the class label of � unseen cases

4 of 73

Instance Based Classifiers

Examples:

Rote-learner

Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

Nearest neighbor

Uses k “closest” points (nearest neighbors) for performing classification

5 of 73

Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

6 of 73

Nearest Neighbor Classification

Compute distance between two points:

Euclidean distance

Determine the class from nearest neighbor list

take the majority vote of class labels among the k-nearest neighbors
Weigh the vote according to distance

weight factor, w = 1/d²

7 of 73

Nearest Neighbor Classification…

Choosing the value of k:

If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes

8 of 73

Nearest Neighbor Classification…

Scaling issues

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
Example of an attribute dominating distance computation:

height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M

9 of 73

Decision Tree

Setting:

Instances describable by attribute value pair.
Target function is discrete valued.
Disjunctive hypothesis is required.
Training data may contain errors.
Training data may contain missing attribute values.

10 of 73

Decision Tree

Instance has values of attributes.

Same as feature values

Target value is discrete.
Each internal node tests an attribute.
Each branch corresponds to a value of an attribute.
Each leaf node assigns a classification.

11 of 73

Decision tree - Example

12 of 73

Decision tree learning

For a test datapoint:

Determine the branches to take at each level based on attribute value.
At a leaf level, return the label of current node.

For training, at each level:

Select an attribute split the training dataset on.
Examine the purity of data at each splitted node and determine whether it is a leaf node.
Determine the label at each node

13 of 73

What attribute to select ?

Entropy:

14 of 73

What attribute to select ?

15 of 73

What attribute to select ?

Entropy:

16 of 73

Information Gain

17 of 73

ID3

Hypothesis space: set of all finite discrete valued functions.

The hypothesis space contains all functions.

ID3 maintains a single hypothesis.

Does not enumerate all consistent hypothesis – cannot recommend best training example.

Utilizes all the data at each stage.

Higher computational cost compared to one pass algorithms.

Performs hill climbing without backtracking

Can converge to a local minimum.

Feature selection criteria robust to errors in data.

18 of 73

ID3 Algorithm

19 of 73

ID3 example

20 of 73

Root level choice

Gain(S,outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029

21 of 73

Root level tree

22 of 73

Tree

23 of 73

Inductive Bias

Shorter trees are preferred over longer trees.
Occam’s razor: “Simpler” hypothesis are better than more complex hypothesis.
Shorter trees are “simpler” because there are less number of them.

Actually: shorter trees with attributes having more information gain are preferred.

24 of 73

Inductive Bias

25 of 73

Occam’s razor

26 of 73

Overfitting

27 of 73

Overfitting

28 of 73

Overfitting

29 of 73

Avoiding overfitting

Which is computationally better ?

30 of 73

Avoiding overfitting

Which tree is best ?

Measure the accuracy over a separate validation set.
Perform statistical tests over training data, e.g. chi square tests.
Minimize an explicit performance measure which comprises of training set performance and “complexity” of the model, e.g. MDL.

31 of 73

Reduced error pruning

32 of 73

Reduced error pruning

33 of 73

Rule post pruning

34 of 73

Other issues

Handling continuous valued attribute.

Create a split which produces the highest information gain.

Handling large number of discrete valued attributes.

35 of 73

Other issues

Handling missing attributes.
Training set:

Assign a distribution based on exiting values.
Default branch for the set of attributed.

Test set:

Default branch which is populated using missing values.

36 of 73

Attributes with Costs

Consider

medical diagnosis: BloodTest has cost $150, Pulse has a cost of $5.
robotics, Width-From-1ft has cost 23 sec., from 2 ft 10s.

How to learn a consistent tree with low expected cost?
Replace gain by

Tan and Schlimmer (1990)

Nunez (1988)

where ω ∈ [0, 1] determines importance of cost

36

37 of 73

Gini Index

Another sensible measure of impurity�(i and j are classes)

After applying attribute A, the resulting Gini index is

Gini can be interpreted as expected error rate

38 of 73

Gini Index

38

.

Attributes: color, border, dot

Classification: triangle, square

39 of 73

Gini Index for Color

39

.

Color?

red

yellow

green

40 of 73

Gain of Gini Index

40

41 of 73

Three Impurity Measures

41

42 of 73

Regression Tree

Similar to classification
Use a set of attributes to predict the value (instead of a class label)
Instead of computing information gain, compute the sum of squared errors
Partition the attribute space into a set of rectangular subspaces, each with its own predictor

The simplest predictor is a constant value

42

43 of 73

Rectilinear Division

A regression tree is a piecewise constant function of the input attributes

43

X1≤ t₁

X2 ≤ t₂

X1 ≤ t₃

X2 ≤ t₄

r₁

r₂

r₃

r₄

r₅

r₂

r₁

r₃

r₅

r₄

t₂

t₁

t₃

X₁

X₂

44 of 73

Growing Regression Trees

To minimize the square error on the learning sample, the prediction at a leaf is the average output of the learning cases reaching that leaf
Impurity of a sample is defined by the variance of the output in that sample:

I(LS)=var_y_|_LS{y}=E_y_|_LS{(y-E_y_|_LS{y})²}�

The best split is the one that reduces the most variance:

44

}

{

var

|

}

{

var

)

,

(

|

y

LS

y

A

LS

I

a

LS

y

a

LS

y

∑

−

=

Δ

45 of 73

Regression Tree Pruning

Exactly the same algorithms apply: pre-pruning and post-pruning.
In post-pruning, the tree that minimizes the squared error on VS is selected.
In practice, pruning is more important in regression because full trees are much more complex (often all objects have a different output values and hence the full tree has as many leaves as there are objects in the learning sample)

45

46 of 73

When Are Decision Trees Useful ?

Advantages

Very fast: can handle very large datasets with many attributes
Flexible: several attribute types, classification and regression problems, missing values…
Interpretability: provide rules and attribute importance

Disadvantages

Instability of the trees (high variance)
Not always competitive with other algorithms in terms of accuracy

46

47 of 73

47

Hunt and colleagues in Psychology used full search decision

trees methods to model human concept learning in the 60’s

Quinlan developed ID3, with the information gain heuristics

in the late 70’s to learn expert systems from examples

Breiman, Friedmans and colleagues in statistics developed

CART (classification and regression trees simultaneously

A variety of improvements in the 80’s: coping with noise,

continuous attributes, missing data, non-axis parallel etc.

Quinlan’s updated algorithm, C4.5 (1993) is commonly used (New:C5)

Boosting (or Bagging) over DTs is a good general purpose algorithm

History of Decision Tree Research

48 of 73

ENSEMBLE METHODS

49 of 73

Rationale for Ensemble Learning

No Free Lunch thm: There is no algorithm that is always the most accurate
Generate a group of base-learners which when combined have higher accuracy
Different learners use different

Algorithms
Parameters
Representations (Modalities)
Training sets
Subproblems

49

50 of 73

Voting

Linear combination

Classification

50

51 of 73

BAGGING

52 of 73

Ensemble methods

A single decision tree does not perform well
But, it is super fast
What if we learn multiple trees?

We need to make sure they do not all just learn the same

53 of 73

Bagging

If we split the data in random different ways, decision trees give different results, high variance.

Bagging: Bootstrap aggregating is a method that result in low variance.

If we had multiple realizations of the data (or multiple samples) we could calculate the predictions multiple times and take the average of the fact that averaging multiple onerous estimations produce less uncertain results

54 of 73

Bagging

Say for each sample b, we calculate f^b(x), then:

How?

Bootstrap

Construct B (hundreds of) trees (no pruning)

Learn a classifier for each bootstrap sample and average them

Very effective

55 of 73

X₁

X₂

56 of 73

Example

Pima Indians Diabetes data:
768 examples, 8 features
Features: 'preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age'

57 of 73

Example

58 of 73

Example

# trees

59 of 73

Out-of-Bag Error Estimation

No cross validation?
Remember, in bootstrapping we sample with replacement, and therefore not all observations are used for each bootstrap sample.
We call them out-of-bag samples (OOB)
We can predict the response for the i-th observation using each of the trees in which that observation was OOB and do this for n observations
Calculate overall OOB MSE or classification error

60 of 73

Bagging

Reduces overfitting (variance)
Normally uses one type of classifier
Decision trees are popular
Easy to parallelize

61 of 73

Bagging - issues

Each tree is identically distributed (i.d.)

🡺 the expectation of the average of B such trees is the same as the expectation of any one of them

the bias of bagged trees is the same as that of the individual trees

i.d. and not i.i.d

62 of 73

Bagging - issues

An average of B i.i.d. random variables, each with variance σ², has variance: σ²/B

If i.d. (identical but not independent) and pair correlation ρ is present, then the variance is:

As B increases the second term disappears but the first term remains

Why does bagging generate correlated trees?

63 of 73

Bagging - issues

Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors.

Then all bagged trees will select the strong predictor at the top of the tree and therefore all trees will look similar.

How do we avoid this?

64 of 73

RANDOM FORESTS

65 of 73

Random Forests

As in bagging, we build a number of decision trees on bootstrapped training samples each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

Note that if m = p, then this is bagging.

66 of 73

Random Forests

Random forests are popular. Leo Breiman’s and Adele Cutler maintains a random forest website where the software is freely available, and of course it is included in every ML/STAT package

http://www.stat.berkeley.edu/~breiman/RandomForests/

67 of 73

Random Forests Algorithm

For b = 1 to B:

(a) Draw a bootstrap sample Z∗ of size N from the training data.

(b) Grow a random-forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size n_min is reached.

i. Select m variables at random from the p variables.

ii. Pick the best variable/split-point among the m.

iii. Split the node into two daughter nodes.

Output the ensemble of trees.

To make a prediction at a new point x we do:

For regression: average the results

For classification: majority vote

68 of 73

Random Forests Tuning

The inventors make the following recommendations:

For classification, the default value for m is √p and the minimum node size is one.
For regression, the default value for m is p/3 and the minimum node size is five.

In practice the best values for these parameters will depend on the problem, and they should be treated as tuning parameters.

Like with Bagging, we can use OOB and therefore RF can be fit in one sequence, with cross-validation being performed along the way. Once the OOB error stabilizes, the training can be terminated.

69 of 73

Example

70 of 73

Example

4,718 genes measured on tissue samples from 349 patients.
Each gene has different expression
Each of the patient samples has a qualitative label with 15 different levels: either normal or 1 of 14 different types of cancer.

Use random forests to predict cancer type based on the 500 genes that have the largest variance in the training set.

71 of 73

Null choice (Normal)

72 of 73

Random Forests Issues

When the number of variables is large, but the fraction of relevant variables is small, random forests are likely to perform poorly when m is small

Why?

Because:

At each split the chance can be small that the relevant variables will be selected

For example, with 3 relevant and 100 not so relevant variables the probability of any of the relevant variables being selected at any split is ~0.25

73 of 73

Can RF overfit?

Random forests “cannot overfit” the data wrt to number of trees.

Why?

The number of trees, B does not mean increase in the flexibility of the model