CS60050: Machine Learning
Sourangshu Bhattacharya
CSE, IIT Kharagpur
NON-PARAMETRIC MODELS
Instance-Based Classifiers
Instance Based Classifiers
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
Nearest Neighbor Classification
Nearest Neighbor Classification…
Nearest Neighbor Classification…
Decision Tree
Decision Tree
Decision tree - Example
Decision tree learning
What attribute to select ?
What attribute to select ?
What attribute to select ?
Information Gain
ID3
ID3 Algorithm
ID3 example
Root level choice
Root level tree
Tree
Inductive Bias
Inductive Bias
Occam’s razor
Overfitting
Overfitting
Overfitting
Avoiding overfitting
Avoiding overfitting
Which tree is best ?
Reduced error pruning
Reduced error pruning
Rule post pruning
Other issues
Other issues
Attributes with Costs
where ω ∈ [0, 1] determines importance of cost
36
Gini Index
Gini Index
38
.
.
.
.
.
.
Attributes: color, border, dot
Classification: triangle, square
Gini Index for Color
39
.
.
.
.
.
.
.
.
.
.
.
.
Color?
red
yellow
green
Gain of Gini Index
40
Three Impurity Measures
41
Regression Tree
42
Rectilinear Division
43
X1≤ t1
X2 ≤ t2
X1 ≤ t3
X2 ≤ t4
r1
r2
r3
r4
r5
r2
r1
r3
r5
r4
t2
t1
t3
X1
X2
Growing Regression Trees
I(LS)=vary|LS{y}=Ey|LS{(y-Ey|LS{y})2}�
44
}
{
var
|
|
|
|
}
{
var
)
,
(
|
|
y
LS
LS
y
A
LS
I
a
LS
y
a
a
LS
y
∑
−
=
Δ
Regression Tree Pruning
45
When Are Decision Trees Useful ?
46
47
trees methods to model human concept learning in the 60’s
in the late 70’s to learn expert systems from examples
CART (classification and regression trees simultaneously
continuous attributes, missing data, non-axis parallel etc.
History of Decision Tree Research
ENSEMBLE METHODS
Rationale for Ensemble Learning
49
Voting
50
BAGGING
Ensemble methods
We need to make sure they do not all just learn the same
Bagging
If we split the data in random different ways, decision trees give different results, high variance.
Bagging: Bootstrap aggregating is a method that result in low variance.
If we had multiple realizations of the data (or multiple samples) we could calculate the predictions multiple times and take the average of the fact that averaging multiple onerous estimations produce less uncertain results
Bagging
Say for each sample b, we calculate fb(x), then:
How?
Bootstrap
Construct B (hundreds of) trees (no pruning)
Learn a classifier for each bootstrap sample and average them
Very effective
X1
X2
Example
Example
Example
# trees
Out-of-Bag Error Estimation
Bagging
Bagging - issues
Each tree is identically distributed (i.d.)
🡺 the expectation of the average of B such trees is the same as the expectation of any one of them
i.d. and not i.i.d
Bagging - issues
An average of B i.i.d. random variables, each with variance σ2, has variance: σ2/B
If i.d. (identical but not independent) and pair correlation ρ is present, then the variance is:
As B increases the second term disappears but the first term remains
Why does bagging generate correlated trees?
Bagging - issues
Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors.
Then all bagged trees will select the strong predictor at the top of the tree and therefore all trees will look similar.
How do we avoid this?
RANDOM FORESTS
Random Forests
As in bagging, we build a number of decision trees on bootstrapped training samples each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.
Note that if m = p, then this is bagging.
Random Forests
Random forests are popular. Leo Breiman’s and Adele Cutler maintains a random forest website where the software is freely available, and of course it is included in every ML/STAT package
http://www.stat.berkeley.edu/~breiman/RandomForests/
Random Forests Algorithm
For b = 1 to B:
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.
Output the ensemble of trees.
To make a prediction at a new point x we do:
For regression: average the results
For classification: majority vote
Random Forests Tuning
The inventors make the following recommendations:
In practice the best values for these parameters will depend on the problem, and they should be treated as tuning parameters.
Like with Bagging, we can use OOB and therefore RF can be fit in one sequence, with cross-validation being performed along the way. Once the OOB error stabilizes, the training can be terminated.
Example
Example
Use random forests to predict cancer type based on the 500 genes that have the largest variance in the training set.
Null choice (Normal)
Random Forests Issues
When the number of variables is large, but the fraction of relevant variables is small, random forests are likely to perform poorly when m is small
Why?
Because:
At each split the chance can be small that the relevant variables will be selected
For example, with 3 relevant and 100 not so relevant variables the probability of any of the relevant variables being selected at any split is ~0.25
Can RF overfit?
Random forests “cannot overfit” the data wrt to number of trees.
Why?
The number of trees, B does not mean increase in the flexibility of the model