xgboost
John m. aiken
1
Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.
introduction
2
Easy example
We want to predict a person’s age based on whether they play video games, enjoy gardening, and their preference on wearing hats. Our objective is to minimize squared error.
4
5
PersonID | Age | LikesGardening | PlaysVideoGames | LikesHats |
1 | 13 | FALSE | TRUE | TRUE |
2 | 14 | FALSE | TRUE | FALSE |
3 | 15 | FALSE | TRUE | FALSE |
4 | 25 | TRUE | TRUE | TRUE |
5 | 35 | FALSE | TRUE | TRUE |
6 | 49 | TRUE | FALSE | FALSE |
7 | 68 | TRUE | TRUE | TRUE |
8 | 71 | TRUE | FALSE | FALSE |
9 | 73 | TRUE | FALSE | TRUE |
Feature | FALSE | TRUE |
LikesGardening | {13, 14, 15, 35} | {25, 49, 68, 71, 73} |
PlaysVideoGames | {49, 71, 73} | {13, 14, 15, 25, 35, 68} |
LikesHats | {14, 15, 49, 71} | {13, 25, 35, 68, 73} |
Decision tree analysis
7
This tree ignores video games and hats
Predictions from first tree
9
{19.25, 19.25, 19.25, 19.25}
{57.2, 57.2, 57.2, 57.2, 57.2}
predictions
PersonID | Age | Tree1 Prediction | Tree1 Residual |
1 | 13 | 19.25 | -6.25 |
2 | 14 | 19.25 | -5.25 |
3 | 15 | 19.25 | -4.25 |
4 | 25 | 57.2 | -32.2 |
5 | 35 | 19.25 | 15.75 |
6 | 49 | 57.2 | -8.2 |
7 | 68 | 57.2 | 10.8 |
8 | 71 | 57.2 | 13.8 |
9 | 73 | 57.2 | 15.8 |
But now we can fit a new model on the residuals of the first model!
Predictions from second tree
10
PersonID | Age | Tree2 Prediction |
1 | 13 | -3.567 |
2 | 14 | -3.567 |
3 | 15 | -3.567 |
4 | 25 | -3.567 |
5 | 35 | -3.567 |
6 | 49 | 7.133 |
7 | 68 | -3.567 |
8 | 71 | 7.133 |
9 | 73 | 7.133 |
{7.13, 7.13, 7.13}
{-3.56, -3.56, -3.56, -3.56, -3.56, -3.56}
predictions
F(x) =
Predicting age
Predicting residuals
A first step in gradient boosting, but not there yet
12
PersonID | Age | Tree1 Prediction | Tree1 Residual | Tree2 Prediction | Combined Prediction | Final Residual |
1 | 13 | 19.25 | -6.25 | -3.567 | 15.68 | 2.683 |
2 | 14 | 19.25 | -5.25 | -3.567 | 15.68 | 1.683 |
3 | 15 | 19.25 | -4.25 | -3.567 | 15.68 | 0.6833 |
4 | 25 | 57.2 | -32.2 | -3.567 | 53.63 | 28.63 |
5 | 35 | 19.25 | 15.75 | -3.567 | 15.68 | -19.32 |
6 | 49 | 57.2 | -8.2 | 7.133 | 64.33 | 15.33 |
7 | 68 | 57.2 | 10.8 | -3.567 | 53.63 | -14.37 |
8 | 71 | 57.2 | 13.8 | 7.133 | 64.33 | -6.667 |
9 | 73 | 57.2 | 15.8 | 7.133 | 64.33 | -8.667 |
A more formalized algorithm
13
This process can be generalized as:
F(x) = F1(x) + h1(x) + … → FM(x) = FM-1(x) + hM-1(x)
And the model can be initialized by the average (minimum of the MSE)
But wait their’s more
14
This part is much more complicated than just “fit g(x) to r”
Actually, in gradient boosting we replace this step by instead fitting to the gradient of any loss function
15
Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.
Here the gradient of the loss function is the same as before, but it's more general
16
At this point i found it helpful to actually write out how you get FM(x) for a small M but it's just
note
Thus gradient boosting basically works with anything you want. However, for xgboost, and more generally, it uses tree based models.
Xgboost adds a bunch of stuff
17
Xgboost objective functions
18
Xgboost: Regularized loss
Instead of only using the MSE, we can add a ridge regression regularization
19
Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.
Xgboost shrinkage
Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model
20
*η
Shrinkage
0<η<1
Xgboost column subsampling
Features are randomly subsampled and provided to each tree for fitting.
This is common in random forest.
Additionally, xgboost allows for row subsampling, so that it fits on subsamples of training data per tree instead of using all the data.
21
Xgboost: finding the split
22
Finding the optimum structure for a single leaf
If you assume the structure of a leaf is fixed then the optimum weight in each leaf and the resulting objective value for the leaf is:
24
Calculating leaf weights
So the weight per leaf j in each tree is the ratio of the derivative of the loss function from the previous iteration model to the second derivative of the loss function from the previous iteration
25
That is, if this leafs prediction is much better than before, it should be weighted much more.
26
I guess the G is squared because of differentiation...
This should be as small as possible
The bigger this is the more value the leaf has
The bigger this is the more leaves there are
Thus you want the smallest number of leaves with the largest ratio of residuals.
But this is only one leaf, and there are many leaf combinations
27
28
29
Xgboost: sparse data
Missing data, frequent zeroes, and artifacts of feature engineering can all cause data to be sparse.
Xgboost creates a default direction based on the model using non-missing entries, it then calculates the loss based on which of two default directions are choosen (a tree can only have two default directions per node)
30
Xgboost implementations
In practice...
31
Predicting round trip times for packets in wifi networks
32
All sorts of things can impact networks, we want to predict from passive network statistics like
We can summarize the “quality” of the latency of a network through the cumulative distribution of latencies observed in the network, this is known as delta-Q
Xgboost completely outperforms other models
Linear Regression - typical OLS model no special effects
Random Forest - uses 1000 estimators, max depth of 3
xgboost - uses 1000 estimators, max depth of 3
33
Predicting drop out in survey responses
German socio-economy panel
34
Figure 3. Performance Curves in Test Set (y = Refusal in GSOEP Wave 2014)
Kern, Christoph, Thomas Klausch, and Frauke Kreuter. "Tree-based Machine Learning Methods for Survey Research." Survey Research Methods. Vol. 13. No. 1. 2019.
conclusions
35
To put it in “plainer” terms
The ultimate model ensemble is the linear combination of Fm(x) that is FM(x)
36