LO 4.2.3.H
Learning Objective: Describe random forests.
Review:
One of the issues of bagging is that strong predictors will dominate each tree, making the decision trees correlated. Unfortunately, the average of many highly correlated trees does not lead to a significant reduction in variance.
Random forests provide a small tweak to de-correlate the trees.
- We begin to bootstrap the dataset by taking repeated samples with replacement from the (single) training data set. Unlike bagging, which creates samples with instance data of all p predictors, random forests samples do not contain instance data for all p predictors, but only data relative to m random predictors (the number of m predictors could be m ≈
). - Therefore, in the random forests, some samples used to build the trees may contain weak predictors and no strong predictors. This process will help de-correlating the trees making the average of the resulting trees less variable and hence making the prediction/classification more reliable.