1 of 55

Machine Learning Basics�3. Ensemble Learning

Cong Li

Mar. 3rd ~ Mar. 24th, 2018

2 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View
  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View

3 of 55

Ensembles Methods in ML

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
    • Ensembles of Classifiers
    • Application
  • Bagging
  • Boosting
  • Dropout: an Ensemble View

4 of 55

Inductive Learning

  • Inductive Learning
    • Induce the feature-class mapping
    • Find an appropriate classifier
      • Performing well on training set
      • Within certain candidates (hypothesis)
  • Learning Process
    • Searching the right classifier
      • Is opportunistic, and
      • Can be accidental in some cases

Machine Learning Basics: 3. Ensemble Learning

5 of 55

Uncertainty in Learning

  • Many Learning Methods
    • Sensitive to variance in data, e.g.,
      • A few difficult-to-learn samples
    • Sensitive to noises in data, e.g.,
      • A few irrelevant features
  • Uncertainty in Learning
    • Small variance/noises leading to different learning results, that is
      • Different classifiers

Machine Learning Basics: 3. Ensemble Learning

6 of 55

Different Classifiers (1)

  • Different Classifiers
    • Conduct classification on a same set of class labels
    • May use different input or have different parameters
    • May produce different output for a certain example
  • Learning Different Classifiers
    • Use different training examples
    • Use different features

Machine Learning Basics: 3. Ensemble Learning

7 of 55

Different Classifiers (2)

  • Performance
    • Each of the classifiers is not perfect
    • Complementary
      • Examples which are not correctly classified by one classifier may be correctly classified by the other classifiers
  • Potential Improvements?
    • Utilize the complementary property

Machine Learning Basics: 3. Ensemble Learning

8 of 55

Ensembles of Classifiers

  • Idea
    • Combine the classifiers to improve the performance
  • Ensembles of Classifiers
    • Combine the classification results from different classifiers to produce the final output
      • Unweighted voting
      • Weighted voting

Machine Learning Basics: 3. Ensemble Learning

9 of 55

Example: Weather Forecast

Machine Learning Basics: 3. Ensemble Learning

Reality

1

2

3

4

5

Combine

X

X

X

X

X

X

X

X

X

X

X

X

X

10 of 55

Ensemble Learning

  • Ensemble Learning
    • Relatively new field in machine learning
    • Achieve state-of-the-art performance
  • Central Issues in Ensemble Learning
    • How to create classifiers with complementary performances
    • How to conduct voting

Machine Learning Basics: 3. Ensemble Learning

11 of 55

Ensembles Methods in ML

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
    • Ensembles of Classifiers
    • Application
  • Bagging
  • Boosting
  • Dropout: an Ensemble View

12 of 55

Application: WSD (Pedersen 2000)

  • Ensembles of Classifiers with Different Features
    • Use different features in training and classification in each classifier
  • Ensembles of Naive Bayesian Classifiers for WSD
    • Use different context windows to create different naive Bayesian classifiers

Machine Learning Basics: 3. Ensemble Learning

13 of 55

Implementation

  • 81 Base Classifiers
    • Context window (l, r)
    • Possible values for l and r : 0, 1, 2,(narrow) 3, 4, 5, (medium) 10, 25, 50 (wide)
  • 9 Selected Range Classifiers
    • For each range (e.g., (narrow, medium)), select the best classifiers from 9 candidates (using a development set)
  • Combination
    • Unweighted voting of the 9 Classifiers

Machine Learning Basics: 3. Ensemble Learning

14 of 55

WSD Results

  • Benchmark: Interest
    • Six senses
    • 2368 examples for training and testing
  • Results
    • Ensembles of naive Bayesian classifiers: 89% (Pedersen 2000)
    • Achieve the best performance reported

Machine Learning Basics: 3. Ensemble Learning

15 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View

16 of 55

Bagging

  • An Important Strategy for Ensemble Learning
    • Create different training sets
  • Bootstrap AGGregatING
    • Take repeated bootstrap samples to create a sequence of training sets
    • Train classifiers using the training sets
    • Classification by majority voting

Machine Learning Basics: 3. Ensemble Learning

17 of 55

Replicating Data Sets

  • Original Training Set
    • {(x(1), y(1)), (x(2), y(2)), , (x(m), y(m))}
  • Sample with Replacement
    • At each time, randomly draw m examples according to the uniform distribution on the original training set
    • Allow duplicating and missing
    • Used for training classifiers

Machine Learning Basics: 3. Ensemble Learning

18 of 55

Performance

  • Data Set
    • 27 data sets from UCI ML Repository
  • Methods for Comparison
    • Decision tree classifier: C4.5
    • Bagging: ensembles of 100 C4.5 classifiers

Machine Learning Basics: 3. Ensemble Learning

19 of 55

Results (Freund and Schapire 1996)

Machine Learning Basics: 3. Ensemble Learning

Error rate of bagging C4.5

Error rate of C4.5

20 of 55

Why Bagging Works? (1)

  • Stability in Training
    • Training: construct classifier f from D
    • Stability: small changes on D results in small changes on f
  • Unstable Classifiers
    • Small changes on D results in large changes on f
    • Typical unstable classifier: decision tree

Machine Learning Basics: 3. Ensemble Learning

21 of 55

Why Bagging Works? (2)

  • Fluctuation
    • In learning unstable classfiers, fluctuations play a role
    • Classifiers constructed from the data may not be good sometimes
  • Explanations
    • Breiman’s explanation (Breiman 1996)
    • My explanation
      • Different in terms of the appearance
      • Might be same with Breiman’s in nature

Machine Learning Basics: 3. Ensemble Learning

22 of 55

Why Bagging Works? (3)

  • Bayes Optimal Classifier
    • Given training data D

    • Theoretically perform the best

Machine Learning Basics: 3. Ensemble Learning

  • Why Bagging Works?
    • Re-sample training data and train the classifiers to approximate the Bayes optimal classification (P(f |D), P(y |x, f ) with Monte-Carlo sampling)
    • Require unstable classifiers

23 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View

24 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
    • Basic Idea
    • AdaBoost Algorithm
    • Performance
    • Why AdaBoost Works
  • Dropout: an Ensemble View

25 of 55

Strong and Weak Learners

  • Strong (PAC) Learner
    • Take labeled data for training
    • Produce a classifier which can be arbitrarily accurate
    • Objective of machine learning
  • Weak (PAC) Learner
    • Take labeled data for training
    • Produce a classifier which is more accurate than random guessing

Machine Learning Basics: 3. Ensemble Learning

26 of 55

Boosting

  • Learners
    • Strong learners are very difficult to construct
    • Constructing weaker learners is relatively easy
  • Strategy
    • Derive strong learner from weak learner
    • Boost weak classifiers to a strong learner

Machine Learning Basics: 3. Ensemble Learning

27 of 55

Construct Weak Classifiers

  • Using Different Data Distribution
    • Start with uniform weighting
    • During each step of learning
      • Increase weights of the examples which are not correctly learned by the weak learner
      • Decrease weights of the examples which are correctly learned by the weak learner
  • Idea
    • Focus on difficult examples which are not correctly classified in the previous steps

Machine Learning Basics: 3. Ensemble Learning

28 of 55

Combine Weak Classifiers

  • Weighted Voting
    • Construct strong classifier by weighted voting of the weak classifiers
  • Idea
    • Better weak classifier gets a larger weight
    • Iteratively add weak classifiers
      • Increase accuracy of the combined classifier through minimization of a cost function

Machine Learning Basics: 3. Ensemble Learning

29 of 55

Example

Machine Learning Basics: 3. Ensemble Learning

Training

Combined classifier

30 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
    • Basic Idea
    • AdaBoost Algorithm
    • Performance
    • Why AdaBoost Works
  • Dropout: an Ensemble View

31 of 55

AdaBoost: Algorithm Basics

  • Training Data
    • D ={(x(1), y(1)), (x(2), y(2)), , (x(m), y(m))}, y(i) {−1, 1}

Machine Learning Basics: 3. Ensemble Learning

  • Some Notations
    • Current iteration: t
    • Weight of training examples:
    • Weak Classifier:
    • Training error of weak classifier:
    • An intermediate ratio:
    • Weight of weak classifier:

32 of 55

AdaBoost: Algorithm

  •  

Machine Learning Basics: 3. Ensemble Learning

 

33 of 55

AdaBoost.M1: Detail Calculation

  • Calculating Training Error

Machine Learning Basics: 3. Ensemble Learning

  • Calculation of Weight of the Weak Classifier

 

34 of 55

AdaBoost: Final

  • Output

Machine Learning Basics: 3. Ensemble Learning

  • Margin Classifier
    • Margin in majority vote classifiers

    • Related to generalization error
    • AdaBoost optimizes the margins

35 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
    • Basic Idea
    • AdaBoost Algorithm
    • Performance
    • Why AdaBoost Works
  • Dropout: an Ensemble View

36 of 55

Performance

  • Data Set
    • 27 data sets from UCI ML Repository
  • Methods for Comparison
    • Decision tree classifier: C4.5
    • Bagging: ensembles of 100 C4.5 classifiers
    • Boosting: AdaBoost using C4.5 as the weak learner

Machine Learning Basics: 3. Ensemble Learning

37 of 55

Results (Freund and Schapire 1996)

Machine Learning Basics: 3. Ensemble Learning

Error rate of boosting C4.5

Error rate of C4.5

38 of 55

Results (Freund and Schapire 1996)

Machine Learning Basics: 3. Ensemble Learning

Error rate of boosting C4.5

Error rate of bagging C4.5

39 of 55

Boosting

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
    • Basic Idea
    • AdaBoost Algorithm
    • Performance
    • Why AdaBoost Works
  • Dropout: an Ensemble View

40 of 55

Training Errors vs Test Errors

Machine Learning Basics: 3. Ensemble Learning

Performance on ‘letter’ dataset (Schapire et al. 1997)

Training error

Test error

Training error drops to 0 on round 5

Test error continues to drop after round 5 (from 8.4% to 3.1%)

41 of 55

Occam’s Razor (1)

  • Contradiction with Occam’s Razor
    • More rounds -> more classifiers for voting -> more complicated
    • With the 0 training error, a more complicated classifier may perform worse
  • Why Test Error Decreases?
    • Related to margin: the confidence of classification
    • Large margin -> more confident in classification

Machine Learning Basics: 3. Ensemble Learning

42 of 55

Occam’s Razor (2)

  • Approximating Set
    • Ensemble of N classifiers
    • Large confidence in classification -> perturbing the weights may not impact the results -> more probable to find rules in the approximating set with similar performance
  • Generalization Error
    • Rules in the approximating set are simple -> generalization error is similar with the training error

Machine Learning Basics: 3. Ensemble Learning

43 of 55

AdaBoost: Generalization Error

  • Expected Risk
    • VC dimension of the base classifier: d
    • Training examples: m
    • With probability: 1−η

Machine Learning Basics: 3. Ensemble Learning

  • Margin Maximization in AdaBoost

44 of 55

Margin Distribution Graph

Machine Learning Basics: 3. Ensemble Learning

Fraction of examples whose margin is at most x

Round 5

Round 100

Round 1000

45 of 55

Outline

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View

46 of 55

Dropout: an Ensemble View

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View
    • Dropout
    • An Ensemble View

47 of 55

Dropout

  • Complicated Neural Network
    • Many layers of neurons, e.g.,
      • Multi-layer perceptrons
      • Convolutional neural networks
  • Dropout
    • Randomly disable certain neurons in individual training ‘batches’
      • ‘Thin’ networks
    • Scale weights in prediction
    • Originally a heuristic

Machine Learning Basics: 3. Ensemble Learning

48 of 55

Example: Standard Network

Machine Learning Basics: 3. Ensemble Learning

Output y

Input x1, x2, x3, x4

 

Feed-forward prediction with current parameters

Back-propagation of error to adjust parameters

49 of 55

Example: Dropout Network

Machine Learning Basics: 3. Ensemble Learning

Output y

Input x1, x2, x3, x4

 

 

Feed-forward prediction with current parameters

Back-propagation of error to adjust parameters

50 of 55

Example: Final Network

Machine Learning Basics: 3. Ensemble Learning

Output y

Input x1, x2, x3, x4

 

51 of 55

Dropout: an Ensemble View

Machine Learning Basics: 3. Ensemble Learning

  • Ensemble Methods in Machine Learning
  • Bagging
  • Boosting
  • Dropout: an Ensemble View
    • Dropout
    • An Ensemble View

52 of 55

Complicated Network

  • Complicated Network
    • Uncertain in what to learn
      • What to encode in the network parameters
    • Tend to overfitting
      • Capture ‘noise’ rather than ‘knowledge’
  • To Improve Performance
    • ‘Wash out’ the noises

Machine Learning Basics: 3. Ensemble Learning

53 of 55

Bayesian Approximation

  •  

Machine Learning Basics: 3. Ensemble Learning

54 of 55

References

  • L. Breiman (1996). Bagging Predictors. Machine Learning, 24(2), 123-140.
  • Y. Freund and R. Schapire (1996). Experiments with a New Boosting Algorithm. In Proc. ICML-1996, 148-156.
  • T. Pedersen (2000). A Simple Approach to Build Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation. In Proc. NAACL-2000, 63-69.
  • R. Schapire, Y. Freund, P. Bartlett, and W. Lee (1997). Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. In Proc. ICML-1997, 322-330.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 2014(15), 1929-1958.

Machine Learning Basics: 3. Ensemble Learning

55 of 55

The End