1 of 50

CS60050: Machine Learning

Sourangshu Bhattacharya

2 of 50

BOOSTING

3 of 50

Boosting

  • Train classifiers (e.g. decision trees) in a sequence.
  • A new classifier should focus on those cases which were incorrectly classified in the last round.
  • Combine the classifiers by letting them vote on the final prediction (like bagging).
  • Each classifier is “weak” but the ensemble is “strong.”
  • AdaBoost is a specific boosting method.

4 of 50

Boosting Intuition

  • We adaptively weigh each data point.

  • Data points which are wrongly classified get high weight (the algorithm will focus on them)

  • Each boosting round learns a new (simple) classifier on the weighed dataset.

  • These classifiers are weighed to combine them into a single powerful classifier.

  • Classifiers that that obtain low training error rate have high weight.

  • We stop by using monitoring a hold out set (cross-validation).

5 of 50

Boosting in a Picture

5

training cases

Correctly

classified

This example

has a large weight

in this round

This DT has

a strong vote.

boosting rounds

6 of 50

Boosting: Adaboost

  •  

7 of 50

Adaboost

8 of 50

Adaboost (contd..)

9 of 50

And in animation

Original training set: equal weights to all training samples

Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

10 of 50

AdaBoost example

10

ROUND 1

ε = error rate of classifier

α = weight of classifier

11 of 50

AdaBoost example

11

ROUND 2

12 of 50

AdaBoost example

12

ROUND 3

13 of 50

AdaBoost example

13

14 of 50

Adaboost illustration

15 of 50

Adaboost

16 of 50

Adaboost (contd..)

17 of 50

Adaboost - Observations

  •  

18 of 50

Adaboost - derivation

  •  

19 of 50

Adaboost - derivation

  •  

20 of 50

Adaboost - derivation

  •  

21 of 50

Adaboost

22 of 50

Adaboost (contd..)

23 of 50

XGBOOST

24 of 50

Regression Tree Ensemble

age < 15

is male?

+2

-1

+0.1

Y

N

Y

N

Y

N

+0.9

-0.9

tree1

tree2

Use Computer Daily

f( ) = 2 + 0.9= 2.9 f( )= -1 - 0.9= -1.9

Prediction of is sum of scores predicted by each of the tree

25 of 50

Tree Ensemble methods

  • Very widely used, look for GBM, random forest…
    • Almost half of data mining competition are won by using some variants of tree ensemble methods

  • Invariant to scaling of inputs, so you do not need to do careful features normalization.

  • Learn higher order interaction between features.

  • Can be scalable, and are used in Industry

26 of 50

Put into context: Model and Parameters

  • Model: assuming we have K trees

Think: regression tree is a function that maps the attributes to the score

  • Parameters
    • Including structure of each tree, and the score in the leaf
    • Or simply use function as parameters
  • Instead learning weights in

, we are learning functions(trees)

Space of functions containing all Regression trees

27 of 50

Learning a tree on single variable

  • How can we learn functions?
  • Define objective (loss, regularization), and optimize it!!
  • Example:
    • Consider regression tree on single input t (time)
    • I want to predict whether I like romantic music at time t

t < 2011/03/01

Y

N

t < 2010/03/20 Y

0.2

Equivalently

The model is regression tree that splits on time

N

1.2

1.0

Piecewise step function over time

28 of 50

Learning a step function

  • Things we need to learn
  • Objective for single variable regression tree(step functions)
    • Training Loss: How will the function fit on the points?
    • Regularization: How do we define complexity of the function?
      • Number of splitting points, l2 norm of the height in each segment?

Splitting Positions

The Height in each segment

29 of 50

Learning step function (visually)

30 of 50

Coming back: Objective for Tree Ensemble

  • Model: assuming we have K trees

  • Objective
  • Possible ways to define ?
    • Number of nodes in the tree, depth
    • L2 norm of the leaf weights
    • … detailed later

Training loss

Complexity of the Trees

31 of 50

Objective vs Heuristic

  • When you talk about (decision) trees, it is usually heuristics
    • Split by information gain
    • Prune the tree
    • Maximum depth
    • Smooth the leaf values
  • Most heuristics maps well to objectives, taking the formal (objective) view let us know what we are learning
    • Information gain -> training loss
    • Pruning -> regularization defined by #nodes
    • Max depth -> constraint on the function space
    • Smoothing leaf values -> L2 regularization on leaf weights

32 of 50

Regression Tree is not just for regression!

  • Regression tree ensemble defines how you make the prediction score, it can be used for
    • Classification, Regression, Ranking….
    • ….
  • It all depends on how you define the objective function!
  • So far we have learned:
    • Using Square loss
      • Will result in common gradient boosted machine
    • Using Logistic loss
      • Will results in LogitBoost

33 of 50

Outline

  • Review of key concepts of supervised learning

  • Regression Tree and Ensemble (What are we Learning)

  • Gradient Boosting (How do we Learn)

  • Summary

34 of 50

Take Home Message for this section

  • Bias-variance tradeoff is everywhere
  • The loss + regularization objective pattern applies for regression tree learning (function learning)

  • We want predictive and simple functions

  • This defines what we want to learn (objective, model).
  • But how do we learn it?
    • Next section

35 of 50

So How do we Learn?

  • Objective:
  • We can not use methods such as SGD, to find f (since they are trees, instead of just numerical vectors)
  • Solution: Additive Training (Boosting)
    • Start from constant prediction, add a new function each time

Model at training round t

New function

Keep functions added in previous round

36 of 50

Additive Training

  • Consider square loss
  • How do we decide which f to add?
    • Optimize the objective!!
  • The prediction at round t is

This is what we need to decide in round t

Goal: find

to minimize this

This is usually called residual from previous round

37 of 50

Taylor Expansion Approximation of Loss

  • Goal
    • Seems still complicated except for the case of square loss
  • Take Taylor expansion of the objective
    • Recall
    • Define
  • If you are not comfortable with this, think of square loss

  • Compare what we get to previous slide

38 of 50

Our New Goal

  • Objective, with constants removed

    • where
  • Why spending so much efforts to derive the objective, why not just grow trees …
    • Theoretical benefit: know what we are learning, convergence
    • Engineering benefit, recall the elements of supervised learning
      • and comes from definition of loss function
      • The learning of function only depend on the objective via and
      • Think of how you can separate modules of your code when you are asked to implement boosted tree for both square loss and logistic loss

39 of 50

Refine the definition of tree

  • We define tree by a vector of scores in leafs, and a leaf index mapping function that maps an instance to a leaf

age < 15

is male?

Y

N

Y

N

Leaf 1

Leaf 2

Leaf 3

q(

) = 1

q(

) = 3

w1=+2

w2=0.1

w3=-1

The structure of the tree

The leaf weight of the tree

40 of 50

Define Complexity of a Tree

  • Define complexity as (this is not the only possible definition)

Number of leaves

L2 norm of leaf scores

age < 15

is male?

Y

N

Y

N

Leaf 1

Leaf 2

Leaf 3

w1=+2

w2=0.1

w3=-1

41 of 50

Revisit the Objectives

  • Define the instance set in leaf j as
  • Regroup the objective by each leaf
  • This is sum of T independent quadratic functions

42 of 50

The Structure Score

  • Two facts about single variable quadratic function

  • Let us define
  • Assume the structure of tree ( q(x) ) is fixed, the optimal weight in each leaf, and the resulting objective value are

This measures how good a tree structure is!

43 of 50

The Structure Score Calculation

age < 15

is male?

Y

N

Y

N

Instance index

1

2

3

4

5

g1, h1

g2, h2

g3, h3

g4, h4

g5, h5

gradient statistics

The smaller the score is, the better the structure is

44 of 50

Searching Algorithm for Single Tree

  • Enumerate the possible tree structures q
  • Calculate the structure score for the q, using the scoring eq.
  • Find the best tree structure, and use the optimal leaf weight
  • But… there can be infinite possible tree structures..

45 of 50

Greedy Learning of the Tree

  • In practice, we grow the tree greedily
    • Start from tree with depth 0
    • For each leaf node of the tree, try to add a split. The change of

objective after adding the split is

  • Remaining question: how do we find the best split?

the score of left child the score of if we do not split the score of right child

The complexity cost by introducing additional leaf

46 of 50

Efficient Finding of the Best Split

  • All we need is sum of g and h in each side, and calculate

  • Left to right linear scan over sorted instance is enough to decide the best split along the feature

g1, h1

g4, h4

g2, h2

g5, h5

g3, h3

  • What is the gain of a split rule ? Say is age

a

47 of 50

An Algorithm for Split Finding

  • For each node, enumerate over all features
    • For each feature, sort the instances by feature value
    • Use a linear scan to decide the best split along that feature
    • Take the best split solution along all the features

  • Time Complexity growing a tree of depth K
    • It is O(n d K log n): or each level, need O(n log n) time to sort There are d features, and we need to do it for K level
    • This can be further optimized (e.g. use approximation or caching the sorted features)
    • Can scale to very large dataset

48 of 50

What about Categorical Variables?

  • Some tree learning algorithm handles categorical variable and continuous variable separately
    • We can easily use the scoring formula we derived to score split based on categorical variables.
  • Actually it is not necessary to handle categorical separately.
    • We can encode the categorical variables into numerical vector using one-hot encoding. Allocate a #categorical length vector
  • The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data

49 of 50

Pruning and Regularization

  • Recall the gain of split, it can be negative!
  • When the training loss reduction is smaller than regularization
  • Trade-off between simplicity and predictivness
  • Pre-stopping
    • Stop split if the best split have negative gain
    • But maybe a split can benefit future splits..
  • Post-Prunning
    • Grow a tree to maximum depth, recursively prune all the leaf splits with negative gain

50 of 50

Recap: Boosted Tree Algorithm

  • Add a new tree in each iteration
  • Beginning of each iteration, calculate

  • Use the statistics to greedily grow a tree

  • Add to the model
    • Usually, instead we do
    • is called step-size or shrinkage, usually set around 0.1
    • This means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting