1 of 37

xgboost

John m. aiken

1

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

2 of 37

introduction

2

3 of 37

4 of 37

Easy example

We want to predict a person’s age based on whether they play video games, enjoy gardening, and their preference on wearing hats. Our objective is to minimize squared error.

4

5 of 37

5

PersonID

Age

LikesGardening

PlaysVideoGames

LikesHats

1

13

FALSE

TRUE

TRUE

2

14

FALSE

TRUE

FALSE

3

15

FALSE

TRUE

FALSE

4

25

TRUE

TRUE

TRUE

5

35

FALSE

TRUE

TRUE

6

49

TRUE

FALSE

FALSE

7

68

TRUE

TRUE

TRUE

8

71

TRUE

FALSE

FALSE

9

73

TRUE

FALSE

TRUE

6 of 37

Feature

FALSE

TRUE

LikesGardening

{13, 14, 15, 35}

{25, 49, 68, 71, 73}

PlaysVideoGames

{49, 71, 73}

{13, 14, 15, 25, 35, 68}

LikesHats

{14, 15, 49, 71}

{13, 25, 35, 68, 73}

7 of 37

Decision tree analysis

7

This tree ignores video games and hats

8 of 37

8

9 of 37

Predictions from first tree

9

{19.25, 19.25, 19.25, 19.25}

{57.2, 57.2, 57.2, 57.2, 57.2}

predictions

PersonID

Age

Tree1 Prediction

Tree1 Residual

1

13

19.25

-6.25

2

14

19.25

-5.25

3

15

19.25

-4.25

4

25

57.2

-32.2

5

35

19.25

15.75

6

49

57.2

-8.2

7

68

57.2

10.8

8

71

57.2

13.8

9

73

57.2

15.8

But now we can fit a new model on the residuals of the first model!

10 of 37

Predictions from second tree

10

PersonID

Age

Tree2 Prediction

1

13

-3.567

2

14

-3.567

3

15

-3.567

4

25

-3.567

5

35

-3.567

6

49

7.133

7

68

-3.567

8

71

7.133

9

73

7.133

{7.13, 7.13, 7.13}

{-3.56, -3.56, -3.56, -3.56, -3.56, -3.56}

predictions

11 of 37

F(x) =

Predicting age

Predicting residuals

A first step in gradient boosting, but not there yet

12 of 37

12

PersonID

Age

Tree1 Prediction

Tree1 Residual

Tree2 Prediction

Combined Prediction

Final Residual

1

13

19.25

-6.25

-3.567

15.68

2.683

2

14

19.25

-5.25

-3.567

15.68

1.683

3

15

19.25

-4.25

-3.567

15.68

0.6833

4

25

57.2

-32.2

-3.567

53.63

28.63

5

35

19.25

15.75

-3.567

15.68

-19.32

6

49

57.2

-8.2

7.133

64.33

15.33

7

68

57.2

10.8

-3.567

53.63

-14.37

8

71

57.2

13.8

7.133

64.33

-6.667

9

73

57.2

15.8

7.133

64.33

-8.667

13 of 37

A more formalized algorithm

  • Fit a model to the data: F1(x) = y
  • Compute the residuals: r1 = y - F1(x)
  • Fit a model to the residuals: h1(x)
  • Create a new model: F2(x) = F1(x) + h1(x)

13

This process can be generalized as:

F(x) = F1(x) + h1(x) + … → FM(x) = FM-1(x) + hM-1(x)

And the model can be initialized by the average (minimum of the MSE)

14 of 37

But wait their’s more

  • Fit a model to the data: F1(x) = y
  • Compute the residuals: r1 = y - F1(x)
  • Fit a model to the residuals: h1(x)
  • Create a new model: F2(x) = F1(x) + h1(x)

14

This part is much more complicated than just “fit g(x) to r”

Actually, in gradient boosting we replace this step by instead fitting to the gradient of any loss function

15 of 37

15

Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.

16 of 37

Here the gradient of the loss function is the same as before, but it's more general

  • F0 is the initialization
  • m to M are the collection of trees, there a single m tree of all M trees
  • Ytilde is the finite difference approximation to the derivative of the loss function
  • am is the model parameters. For a tree based model this is how each decision node should make a split
  • Beta is a weighting term for each node
  • Rhom is the weighting term for each newly fit h
  • Fm is the final combination of this iterations of models
  • Fm-1 is the model from the previous iteration

16

At this point i found it helpful to actually write out how you get FM(x) for a small M but it's just

17 of 37

note

  • We did not specify what kind of model is fitting the data
  • We did not specify what kind of loss function is fitting the data

Thus gradient boosting basically works with anything you want. However, for xgboost, and more generally, it uses tree based models.

Xgboost adds a bunch of stuff

17

18 of 37

Xgboost objective functions

  • reg:squarederror: regression with squared loss.
  • reg:squaredlogerror: regression with squared log loss
  • reg:logistic: logistic regression
  • binary:logistic: logistic regression for binary classification, output probability
  • binary:logitraw: logistic regression for binary classification, output score before logistic transformation
  • binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
  • count:poisson –poisson regression for count data, output mean of poisson distribution
  • survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale.

  • multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
  • multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
  • rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
  • rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
  • rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
  • reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
  • reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

18

19 of 37

Xgboost: Regularized loss

Instead of only using the MSE, we can add a ridge regression regularization

19

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

20 of 37

Xgboost shrinkage

Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model

20

Shrinkage

0<η<1

21 of 37

Xgboost column subsampling

Features are randomly subsampled and provided to each tree for fitting.

This is common in random forest.

Additionally, xgboost allows for row subsampling, so that it fits on subsamples of training data per tree instead of using all the data.

21

22 of 37

Xgboost: finding the split

  • The best way to find a split is to try every split and use the best one this is called the “exact greedy algorithm”.
  • However, computers and size of data do not always support this procedure
  • Thus there is approximate algorithm for finding splits, im not going to talk about this...

22

23 of 37

24 of 37

Finding the optimum structure for a single leaf

If you assume the structure of a leaf is fixed then the optimum weight in each leaf and the resulting objective value for the leaf is:

24

25 of 37

Calculating leaf weights

So the weight per leaf j in each tree is the ratio of the derivative of the loss function from the previous iteration model to the second derivative of the loss function from the previous iteration

25

That is, if this leafs prediction is much better than before, it should be weighted much more.

26 of 37

26

I guess the G is squared because of differentiation...

This should be as small as possible

The bigger this is the more value the leaf has

The bigger this is the more leaves there are

Thus you want the smallest number of leaves with the largest ratio of residuals.

27 of 37

But this is only one leaf, and there are many leaf combinations

27

28 of 37

28

29 of 37

29

30 of 37

Xgboost: sparse data

Missing data, frequent zeroes, and artifacts of feature engineering can all cause data to be sparse.

Xgboost creates a default direction based on the model using non-missing entries, it then calculates the loss based on which of two default directions are choosen (a tree can only have two default directions per node)

30

31 of 37

Xgboost implementations

  • Available on R, python, JVM, ruby, julia
  • Python version meets sci-kit learn api so you can drop it in to replace a sklearn model
  • Can be deployed on distributed sysems via AWS, Kubernetes, Spark, etc.

In practice...

31

32 of 37

Predicting round trip times for packets in wifi networks

32

All sorts of things can impact networks, we want to predict from passive network statistics like

  • Overlap from other networks
  • Network usage
  • Noise per antenna
  • Distance device is from router
  • etc

We can summarize the “quality” of the latency of a network through the cumulative distribution of latencies observed in the network, this is known as delta-Q

33 of 37

Xgboost completely outperforms other models

Linear Regression - typical OLS model no special effects

Random Forest - uses 1000 estimators, max depth of 3

xgboost - uses 1000 estimators, max depth of 3

33

34 of 37

Predicting drop out in survey responses

German socio-economy panel

  • Annual longitudinal survey since 1984 on topics like economics, sociology, psychology, political science
  • Survey drop out has increased over time (25% in 1984, 45% in 2015) and understanding why is important to characterizing the quality of survey response
  • In comparison to logistic regression, etc. xgboost performs similar to RF

34

Figure 3. Performance Curves in Test Set (y = Refusal in GSOEP Wave 2014)

Kern, Christoph, Thomas Klausch, and Frauke Kreuter. "Tree-based Machine Learning Methods for Survey Research." Survey Research Methods. Vol. 13. No. 1. 2019.

35 of 37

conclusions

  • Xgboost provides a good top line model for prediction in many, many situations where you have table data
  • Xgboost can handle missing data, data larger than memory,
  • Xgboost is perfect and can do anything
    • Really this is false but it's pretty impressive what it can do

35

36 of 37

To put it in “plainer” terms

  • Initialize the model by minimizing the loss function (the average if it's MSE). this gives you F0(x)
  • For each m model out of M models (defined as Fm(x) for each model)
    • Fit a function to the derivative of the loss function of the previous model, this is your y
    • Calculate the parameters for the model, this is your a (if it's a tree it's the split parameters)
    • Calculate a controlling weight term rho
    • Finish with the final model Fm(x) = Fm-1(x) + rho h(x,a)

The ultimate model ensemble is the linear combination of Fm(x) that is FM(x)

36

37 of 37