1 of 25

�

Data Mining_Anoop Chaturvedi

1

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 36

Committee Machine and Random Forests

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 25

Committee Machine or Ensemble learning �Objective: To lower the generalization error of a learning algorithm. Improve accuracy of learning algorithm.

Multiple models are combined to improve the predictive performance over any individual model

Generalization error ⇒ How well a machine learning model performs on unseen data.�Approach ⇒ Bagging and Boosting

Exploit the presence of instability to create a more accurate learning method (predictor or classifier).

Data Mining_Anoop Chaturvedi

2

3 of 25

Generate an ensemble of base predictors/ classifiers by perturbing the learning set.�Combine them into a single combined predictor/ classifier�Bagging ⇒ Generates perturbations by random and independent drawings from the learning set. Focus is on variance reduction.�Boosting process ⇒ The Perturbation process is deterministic. Generates perturbations by successive re-weightings of the learning set. Current weights depend upon the misclassification history of the process. Focus is on bias reduction.

Data Mining_Anoop Chaturvedi

3

4 of 25

Data Mining_Anoop Chaturvedi

4

5 of 25

Data Mining_Anoop Chaturvedi

5

6 of 25

Data Mining_Anoop Chaturvedi

6

7 of 25

Data Mining_Anoop Chaturvedi

7

8 of 25

Data Mining_Anoop Chaturvedi

8

9 of 25

Data Mining_Anoop Chaturvedi

9

10 of 25

Data Mining_Anoop Chaturvedi

10

11 of 25

Data Mining_Anoop Chaturvedi

11

12 of 25

Data Mining_Anoop Chaturvedi

12

13 of 25

Data Mining_Anoop Chaturvedi

13

14 of 25

Data Mining_Anoop Chaturvedi

14

15 of 25

Data Mining_Anoop Chaturvedi

15

16 of 25

Data Mining_Anoop Chaturvedi

16

17 of 25

Data Mining_Anoop Chaturvedi

17

18 of 25

Data Mining_Anoop Chaturvedi

18

19 of 25

Random Forests

Bagging, Bootstrap ⇒ Reducing variance of an estimated prediction function.

Random forests ⇒ A modification of bagging that builds a large collection of de-correlated trees, and then averages them.

The performance of random forests is very similar to boosting, and they are simpler to train and tune.

Data Mining_Anoop Chaturvedi

19

20 of 25

Data Mining_Anoop Chaturvedi

20

21 of 25

Data Mining_Anoop Chaturvedi

21

22 of 25

Data Mining_Anoop Chaturvedi

22

23 of 25

Data Mining_Anoop Chaturvedi

23

24 of 25

Data Mining_Anoop Chaturvedi

24

25 of 25

Data Mining_Anoop Chaturvedi

25