1 of 55

Machine Learning Basics�3. Ensemble Learning

Cong Li

Mar. 3rd ~ Mar. 24th, 2018

2 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

3 of 55

Ensembles Methods in ML

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning

Ensembles of Classifiers
Application

Bagging
Boosting
Dropout: an Ensemble View

4 of 55

Inductive Learning

Inductive Learning

Induce the feature-class mapping
Find an appropriate classifier

Performing well on training set
Within certain candidates (hypothesis)

Learning Process

Searching the right classifier

Is opportunistic, and
Can be accidental in some cases

Machine Learning Basics: 3. Ensemble Learning

5 of 55

Uncertainty in Learning

Many Learning Methods

Sensitive to variance in data, e.g.,

A few difficult-to-learn samples

Sensitive to noises in data, e.g.,

A few irrelevant features

Uncertainty in Learning

Small variance/noises leading to different learning results, that is

Different classifiers

Machine Learning Basics: 3. Ensemble Learning

6 of 55

Different Classifiers (1)

Different Classifiers

Conduct classification on a same set of class labels
May use different input or have different parameters
May produce different output for a certain example

Learning Different Classifiers

Use different training examples
Use different features

Machine Learning Basics: 3. Ensemble Learning

7 of 55

Different Classifiers (2)

Performance

Each of the classifiers is not perfect
Complementary

Examples which are not correctly classified by one classifier may be correctly classified by the other classifiers

Potential Improvements?

Utilize the complementary property

Machine Learning Basics: 3. Ensemble Learning

8 of 55

Ensembles of Classifiers

Idea

Combine the classifiers to improve the performance

Ensembles of Classifiers

Combine the classification results from different classifiers to produce the final output

Unweighted voting
Weighted voting

Machine Learning Basics: 3. Ensemble Learning

9 of 55

Example: Weather Forecast

Machine Learning Basics: 3. Ensemble Learning

Reality
1
2
3
4
5
Combine

X

10 of 55

Ensemble Learning

Ensemble Learning

Relatively new field in machine learning
Achieve state-of-the-art performance

Central Issues in Ensemble Learning

How to create classifiers with complementary performances
How to conduct voting

Machine Learning Basics: 3. Ensemble Learning

11 of 55

Ensembles Methods in ML

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning

Ensembles of Classifiers
Application

Bagging
Boosting
Dropout: an Ensemble View

12 of 55

Application: WSD (Pedersen 2000)

Ensembles of Classifiers with Different Features

Use different features in training and classification in each classifier

Ensembles of Naive Bayesian Classifiers for WSD

Use different context windows to create different naive Bayesian classifiers

Machine Learning Basics: 3. Ensemble Learning

13 of 55

Implementation

81 Base Classifiers

Context window (l, r)
Possible values for l and r : 0, 1, 2,(narrow) 3, 4, 5, (medium) 10, 25, 50 (wide)

9 Selected Range Classifiers

For each range (e.g., (narrow, medium)), select the best classifiers from 9 candidates (using a development set)

Combination

Unweighted voting of the 9 Classifiers

Machine Learning Basics: 3. Ensemble Learning

14 of 55

WSD Results

Benchmark: Interest

Six senses
2368 examples for training and testing

Results

Ensembles of naive Bayesian classifiers: 89% (Pedersen 2000)
Achieve the best performance reported

Machine Learning Basics: 3. Ensemble Learning

15 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

16 of 55

Bagging

An Important Strategy for Ensemble Learning

Create different training sets

Bootstrap AGGregatING

Take repeated bootstrap samples to create a sequence of training sets
Train classifiers using the training sets
Classification by majority voting

Machine Learning Basics: 3. Ensemble Learning

17 of 55

Replicating Data Sets

Original Training Set

{(x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), ^…,(x⁽^m⁾, y⁽^m⁾)}

Sample with Replacement

At each time, randomly draw m examples according to the uniform distribution on the original training set
Allow duplicating and missing
Used for training classifiers

Machine Learning Basics: 3. Ensemble Learning

18 of 55

Performance

Data Set

27 data sets from UCI ML Repository

Methods for Comparison

Decision tree classifier: C4.5
Bagging: ensembles of 100 C4.5 classifiers

Machine Learning Basics: 3. Ensemble Learning

19 of 55

Results (Freund and Schapire 1996)

Machine Learning Basics: 3. Ensemble Learning

Error rate of bagging C4.5

Error rate of C4.5

20 of 55

Why Bagging Works? (1)

Stability in Training

Training: construct classifier f from D
Stability: small changes on D results in small changes on f

Unstable Classifiers

Small changes on D results in large changes on f
Typical unstable classifier: decision tree

Machine Learning Basics: 3. Ensemble Learning

21 of 55

Why Bagging Works? (2)

Fluctuation

In learning unstable classfiers, fluctuations play a role
Classifiers constructed from the data may not be good sometimes

Explanations

Breiman’s explanation (Breiman 1996)
My explanation

Different in terms of the appearance
Might be same with Breiman’s in nature

Machine Learning Basics: 3. Ensemble Learning

22 of 55

Why Bagging Works? (3)

Bayes Optimal Classifier

Given training data D

Theoretically perform the best

Machine Learning Basics: 3. Ensemble Learning

Why Bagging Works?

Re-sample training data and train the classifiers to approximate the Bayes optimal classification (P(f |D), P(y |x, f ) with Monte-Carlo sampling)
Require unstable classifiers

23 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

24 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting

Basic Idea
AdaBoost Algorithm
Performance
Why AdaBoost Works

Dropout: an Ensemble View

25 of 55

Strong and Weak Learners

Strong (PAC) Learner

Take labeled data for training
Produce a classifier which can be arbitrarily accurate
Objective of machine learning

Weak (PAC) Learner

Take labeled data for training
Produce a classifier which is more accurate than random guessing

Machine Learning Basics: 3. Ensemble Learning

26 of 55

Boosting

Learners

Strong learners are very difficult to construct
Constructing weaker learners is relatively easy

Strategy

Derive strong learner from weak learner
Boost weak classifiers to a strong learner

Machine Learning Basics: 3. Ensemble Learning

27 of 55

Construct Weak Classifiers

Using Different Data Distribution

Start with uniform weighting
During each step of learning

Increase weights of the examples which are not correctly learned by the weak learner
Decrease weights of the examples which are correctly learned by the weak learner

Idea

Focus on difficult examples which are not correctly classified in the previous steps

Machine Learning Basics: 3. Ensemble Learning

28 of 55

Combine Weak Classifiers

Weighted Voting

Construct strong classifier by weighted voting of the weak classifiers

Idea

Better weak classifier gets a larger weight
Iteratively add weak classifiers

Increase accuracy of the combined classifier through minimization of a cost function

Machine Learning Basics: 3. Ensemble Learning

29 of 55

Example

Machine Learning Basics: 3. Ensemble Learning

Training

Combined classifier

30 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting

Basic Idea
AdaBoost Algorithm
Performance
Why AdaBoost Works

Dropout: an Ensemble View

31 of 55

AdaBoost: Algorithm Basics

Training Data

D ={(x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), ^…,(x⁽^m⁾, y⁽^m⁾)}, y⁽ⁱ⁾∈ {−1, 1}

Machine Learning Basics: 3. Ensemble Learning

Some Notations

Current iteration: t
Weight of training examples:
Weak Classifier:
Training error of weak classifier:
An intermediate ratio:
Weight of weak classifier:

32 of 55

AdaBoost: Algorithm

Machine Learning Basics: 3. Ensemble Learning

33 of 55

AdaBoost.M1: Detail Calculation

Calculating Training Error

Machine Learning Basics: 3. Ensemble Learning

Calculation of Weight of the Weak Classifier

34 of 55

AdaBoost: Final

Output

Machine Learning Basics: 3. Ensemble Learning

Margin Classifier

Margin in majority vote classifiers

Related to generalization error
AdaBoost optimizes the margins

35 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting

Basic Idea
AdaBoost Algorithm
Performance
Why AdaBoost Works

Dropout: an Ensemble View

36 of 55

Performance

Data Set

27 data sets from UCI ML Repository

Methods for Comparison

Decision tree classifier: C4.5
Bagging: ensembles of 100 C4.5 classifiers
Boosting: AdaBoost using C4.5 as the weak learner

Machine Learning Basics: 3. Ensemble Learning

37 of 55

Results (Freund and Schapire 1996)

Machine Learning Basics: 3. Ensemble Learning

Error rate of boosting C4.5

Error rate of C4.5

38 of 55

Results (Freund and Schapire 1996)

Machine Learning Basics: 3. Ensemble Learning

Error rate of boosting C4.5

Error rate of bagging C4.5

39 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting

Basic Idea
AdaBoost Algorithm
Performance
Why AdaBoost Works

Dropout: an Ensemble View

40 of 55

Training Errors vs Test Errors

Machine Learning Basics: 3. Ensemble Learning

Performance on ‘letter’ dataset (Schapire et al. 1997)

Training error

Test error

Training error drops to 0 on round 5

Test error continues to drop after round 5 (from 8.4% to 3.1%)

41 of 55

Occam’s Razor (1)

Contradiction with Occam’s Razor

More rounds -> more classifiers for voting -> more complicated
With the 0 training error, a more complicated classifier may perform worse

Why Test Error Decreases?

Related to margin: the confidence of classification
Large margin -> more confident in classification

Machine Learning Basics: 3. Ensemble Learning

42 of 55

Occam’s Razor (2)

Approximating Set

Ensemble of N classifiers
Large confidence in classification -> perturbing the weights may not impact the results -> more probable to find rules in the approximating set with similar performance

Generalization Error

Rules in the approximating set are simple -> generalization error is similar with the training error

Machine Learning Basics: 3. Ensemble Learning

43 of 55

AdaBoost: Generalization Error

Expected Risk

VC dimension of the base classifier: d
Training examples: m
With probability: 1−η

Machine Learning Basics: 3. Ensemble Learning

Margin Maximization in AdaBoost

44 of 55

Margin Distribution Graph

Machine Learning Basics: 3. Ensemble Learning

Fraction of examples whose margin is at most x

Round 5

Round 100

Round 1000

45 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

46 of 55

Dropout: an Ensemble View

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

Dropout
An Ensemble View

47 of 55

Dropout

Complicated Neural Network

Many layers of neurons, e.g.,

Multi-layer perceptrons
Convolutional neural networks

Dropout

Randomly disable certain neurons in individual training ‘batches’

‘Thin’ networks

Scale weights in prediction
Originally a heuristic

Machine Learning Basics: 3. Ensemble Learning

48 of 55

Example: Standard Network

Machine Learning Basics: 3. Ensemble Learning

Output y

Input x₁, x₂, x₃, x₄

Feed-forward prediction with current parameters

Back-propagation of error to adjust parameters

49 of 55

Example: Dropout Network

Machine Learning Basics: 3. Ensemble Learning

Output y

Input x₁, x₂, x₃, x₄

Feed-forward prediction with current parameters

Back-propagation of error to adjust parameters

50 of 55

Example: Final Network

Machine Learning Basics: 3. Ensemble Learning

Output y

Input x₁, x₂, x₃, x₄

51 of 55

Dropout: an Ensemble View

Machine Learning Basics: 3. Ensemble Learning

Ensemble Methods in Machine Learning
Bagging
Boosting
Dropout: an Ensemble View

Dropout
An Ensemble View

52 of 55

Complicated Network

Complicated Network

Uncertain in what to learn

What to encode in the network parameters

Tend to overfitting

Capture ‘noise’ rather than ‘knowledge’

To Improve Performance

‘Wash out’ the noises

Machine Learning Basics: 3. Ensemble Learning

53 of 55

Bayesian Approximation

Machine Learning Basics: 3. Ensemble Learning

54 of 55

References

L. Breiman (1996). Bagging Predictors. Machine Learning, 24(2), 123-140.
Y. Freund and R. Schapire (1996). Experiments with a New Boosting Algorithm. In Proc. ICML-1996, 148-156.
T. Pedersen (2000). A Simple Approach to Build Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation. In Proc. NAACL-2000, 63-69.
R. Schapire, Y. Freund, P. Bartlett, and W. Lee (1997). Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. In Proc. ICML-1997, 322-330.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 2014(15), 1929-1958.

Machine Learning Basics: 3. Ensemble Learning

55 of 55

The End