1 of 37

xgboost

John m. aiken

1

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

2 of 37

introduction

2

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

3 of 37

4 of 37

Easy example

We want to predict a person’s age based on whether they play video games, enjoy gardening, and their preference on wearing hats. Our objective is to minimize squared error.

4

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

5 of 37

5

PersonID	Age	LikesGardening	PlaysVideoGames	LikesHats
1	13	FALSE	TRUE	TRUE
2	14	FALSE	TRUE	FALSE
3	15	FALSE	TRUE	FALSE
4	25	TRUE	TRUE	TRUE
5	35	FALSE	TRUE	TRUE
6	49	TRUE	FALSE	FALSE
7	68	TRUE	TRUE	TRUE
8	71	TRUE	FALSE	FALSE
9	73	TRUE	FALSE	TRUE

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

6 of 37

Feature	FALSE	TRUE
LikesGardening	{13, 14, 15, 35}	{25, 49, 68, 71, 73}
PlaysVideoGames	{49, 71, 73}	{13, 14, 15, 25, 35, 68}
LikesHats	{14, 15, 49, 71}	{13, 25, 35, 68, 73}

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

7 of 37

Decision tree analysis

7

This tree ignores video games and hats

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

8 of 37

8

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

9 of 37

Predictions from first tree

9

{19.25, 19.25, 19.25, 19.25}

{57.2, 57.2, 57.2, 57.2, 57.2}

predictions

PersonID	Age	Tree1 Prediction	Tree1 Residual
1	13	19.25	-6.25
2	14	19.25	-5.25
3	15	19.25	-4.25
4	25	57.2	-32.2
5	35	19.25	15.75
6	49	57.2	-8.2
7	68	57.2	10.8
8	71	57.2	13.8
9	73	57.2	15.8

But now we can fit a new model on the residuals of the first model!

10 of 37

Predictions from second tree

10

PersonID	Age	Tree2 Prediction
1	13	-3.567
2	14	-3.567
3	15	-3.567
4	25	-3.567
5	35	-3.567
6	49	7.133
7	68	-3.567
8	71	7.133
9	73	7.133

{7.13, 7.13, 7.13}

{-3.56, -3.56, -3.56, -3.56, -3.56, -3.56}

predictions

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

11 of 37

F(x) =

Predicting age

Predicting residuals

A first step in gradient boosting, but not there yet

12 of 37

12

PersonID	Age	Tree1 Prediction	Tree1 Residual	Tree2 Prediction	Combined Prediction	Final Residual
1	13	19.25	-6.25	-3.567	15.68	2.683
2	14	19.25	-5.25	-3.567	15.68	1.683
3	15	19.25	-4.25	-3.567	15.68	0.6833
4	25	57.2	-32.2	-3.567	53.63	28.63
5	35	19.25	15.75	-3.567	15.68	-19.32
6	49	57.2	-8.2	7.133	64.33	15.33
7	68	57.2	10.8	-3.567	53.63	-14.37
8	71	57.2	13.8	7.133	64.33	-6.667
9	73	57.2	15.8	7.133	64.33	-8.667

13 of 37

A more formalized algorithm

Fit a model to the data: F₁(x) = y
Compute the residuals: r₁ = y - F₁(x)
Fit a model to the residuals: h₁(x)
Create a new model: F₂(x) = F₁(x) + h₁(x)

13

This process can be generalized as:

F(x) = F₁(x) + h₁(x) + … → F_M(x) = F_M-1(x) + h_M-1(x)

And the model can be initialized by the average (minimum of the MSE)

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

14 of 37

But wait their’s more

Fit a model to the data: F₁(x) = y
Compute the residuals: r₁ = y - F₁(x)
Fit a model to the residuals: h₁(x)
Create a new model: F₂(x) = F₁(x) + h₁(x)

14

This part is much more complicated than just “fit g(x) to r”

Actually, in gradient boosting we replace this step by instead fitting to the gradient of any loss function

15 of 37

15

Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.

16 of 37

Here the gradient of the loss function is the same as before, but it's more general

F₀ is the initialization
m to M are the collection of trees, there a single m tree of all M trees
Ytilde is the finite difference approximation to the derivative of the loss function
a_m is the model parameters. For a tree based model this is how each decision node should make a split
Beta is a weighting term for each node
Rho_m is the weighting term for each newly fit h
F_m is the final combination of this iterations of models
F_m-1 is the model from the previous iteration

16

At this point i found it helpful to actually write out how you get F_M(x) for a small M but it's just

17 of 37

note

We did not specify what kind of model is fitting the data
We did not specify what kind of loss function is fitting the data

Thus gradient boosting basically works with anything you want. However, for xgboost, and more generally, it uses tree based models.

Xgboost adds a bunch of stuff

17

18 of 37

Xgboost objective functions

reg:squarederror: regression with squared loss.
reg:squaredlogerror: regression with squared log loss
reg:logistic: logistic regression
binary:logistic: logistic regression for binary classification, output probability
binary:logitraw: logistic regression for binary classification, output score before logistic transformation
binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
count:poisson –poisson regression for count data, output mean of poisson distribution
survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale.

multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

18

19 of 37

Xgboost: Regularized loss

Instead of only using the MSE, we can add a ridge regression regularization

19

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

20 of 37

Xgboost shrinkage

Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model

20

*η

Shrinkage

0<η<1

21 of 37

Xgboost column subsampling

Features are randomly subsampled and provided to each tree for fitting.

This is common in random forest.

Additionally, xgboost allows for row subsampling, so that it fits on subsamples of training data per tree instead of using all the data.

21

22 of 37

Xgboost: finding the split

The best way to find a split is to try every split and use the best one this is called the “exact greedy algorithm”.
However, computers and size of data do not always support this procedure
Thus there is approximate algorithm for finding splits, im not going to talk about this...

22

23 of 37

24 of 37

Finding the optimum structure for a single leaf

If you assume the structure of a leaf is fixed then the optimum weight in each leaf and the resulting objective value for the leaf is:

24

25 of 37

Calculating leaf weights

So the weight per leaf j in each tree is the ratio of the derivative of the loss function from the previous iteration model to the second derivative of the loss function from the previous iteration

25

That is, if this leafs prediction is much better than before, it should be weighted much more.

26 of 37

26

I guess the G is squared because of differentiation...

This should be as small as possible

The bigger this is the more value the leaf has

The bigger this is the more leaves there are

Thus you want the smallest number of leaves with the largest ratio of residuals.

27 of 37

But this is only one leaf, and there are many leaf combinations

27

28 of 37

28

29 of 37

29

30 of 37

Xgboost: sparse data

Missing data, frequent zeroes, and artifacts of feature engineering can all cause data to be sparse.

Xgboost creates a default direction based on the model using non-missing entries, it then calculates the loss based on which of two default directions are choosen (a tree can only have two default directions per node)

30

31 of 37

Xgboost implementations

Available on R, python, JVM, ruby, julia
Python version meets sci-kit learn api so you can drop it in to replace a sklearn model
Can be deployed on distributed sysems via AWS, Kubernetes, Spark, etc.

In practice...

31

32 of 37

Predicting round trip times for packets in wifi networks

32

All sorts of things can impact networks, we want to predict from passive network statistics like

Overlap from other networks
Network usage
Noise per antenna
Distance device is from router
etc

We can summarize the “quality” of the latency of a network through the cumulative distribution of latencies observed in the network, this is known as delta-Q

33 of 37

Xgboost completely outperforms other models

Linear Regression - typical OLS model no special effects

Random Forest - uses 1000 estimators, max depth of 3

xgboost - uses 1000 estimators, max depth of 3

33

34 of 37

Predicting drop out in survey responses

German socio-economy panel

Annual longitudinal survey since 1984 on topics like economics, sociology, psychology, political science
Survey drop out has increased over time (25% in 1984, 45% in 2015) and understanding why is important to characterizing the quality of survey response
In comparison to logistic regression, etc. xgboost performs similar to RF

34

Figure 3. Performance Curves in Test Set (y = Refusal in GSOEP Wave 2014)

Kern, Christoph, Thomas Klausch, and Frauke Kreuter. "Tree-based Machine Learning Methods for Survey Research." Survey Research Methods. Vol. 13. No. 1. 2019.

35 of 37

conclusions

Xgboost provides a good top line model for prediction in many, many situations where you have table data
Xgboost can handle missing data, data larger than memory,
Xgboost is perfect and can do anything

Really this is false but it's pretty impressive what it can do

35

36 of 37

To put it in “plainer” terms

Initialize the model by minimizing the loss function (the average if it's MSE). this gives you F₀(x)
For each m model out of M models (defined as F_m(x) for each model)

Fit a function to the derivative of the loss function of the previous model, this is your y
Calculate the parameters for the model, this is your a (if it's a tree it's the split parameters)
Calculate a controlling weight term rho
Finish with the final model F_m(x) = F_m-1(x) + rho h(x,a)

The ultimate model ensemble is the linear combination of F_m(x) that is F_M(x)

36

1 of 37

2 of 37

3 of 37

4 of 37

5 of 37

6 of 37

7 of 37

8 of 37

9 of 37

10 of 37

11 of 37

12 of 37

13 of 37

14 of 37

15 of 37

16 of 37

17 of 37

18 of 37

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

27 of 37

28 of 37

29 of 37

30 of 37

31 of 37

32 of 37

33 of 37

34 of 37

35 of 37

36 of 37

37 of 37