1 of 32

Hyperparameter optimization

with focus on TPE, HB and BOHB methods

2 of 32

3 of 32

4 of 32

Bayesian Optimization

5 of 32

Bayesian Optimization

Bayesian approaches keep track of past evaluation results which they use to form a probabilistic model (surrogate), that maps hyperparameters to a probability of a score on the objective function.

Build a surrogate probability model of the objective function
Find the hyperparameters that perform best on the surrogate
Apply these hyperparameters to the true objective function
Update the surrogate model incorporating the new results
Repeat steps 2-4 until max iterations or time is reached

6 of 32

7 of 32

8 of 32

Sequential Model-Based Optimization

SMBO methods are a formalization of Bayesian optimization.

There are five main aspects:

A domain of hyperparameters over which to search
An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize)
The surrogate model of the objective function
A criteria, called an acquisition function, for evaluating which hyperparameters to choose next from the surrogate model
A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model

9 of 32

10 of 32

Example domain consisting of several 1-dimensional distributions

11 of 32

Response surface (surrogate function) for AdaBoost

12 of 32

Acquisition function

The acquisition function is the criteria by which the next set of hyperparameters are chosen from the surrogate function. The most common choice of criteria is Expected Improvement:

The aim is to maximize the Expected Improvement with respect to x. This means finding the best hyperparameters under the surrogate function p(y|x).

13 of 32

Tree-structured Parzen Estimator

The Tree Parzen Estimator (TPE) is a Bayesian optimization method in which surrogate function is expressed as:

where densities l(x) and g(x) are modeled with kernel density estimation.

14 of 32

Kernel Density Estimation example

15 of 32

16 of 32

So we’re maximizing l(x)/g(x)

17 of 32

18 of 32

19 of 32

Bandit-based Optimization

20 of 32

Successive Halving

Randomly sample a set of hyperparameter configurations
Evaluate the performances of all currently remaining configurations
Throw out the bottom half of the worst scoring configurations
Go back to point 2. and repeat until one configuration remains

21 of 32

22 of 32

HyperBand

HyperBand divides the total budget into several combinations of number of configurations vs. budget for each, to then call successive halving as a subroutine on each set of random configurations.

Hyperband does away with “n vs B/n” dilemma by considering several possible values of n for a fixed budget B, in essence performing a grid search over feasible value of n.

23 of 32

24 of 32

25 of 32

Why not both bohb?

26 of 32

27 of 32

BOHB

Combines Bayesian optimization (BO) and Hyperband (HB)
BO part is handled by a variant of the TPE with a product kernel (which is quite different from a product of univariate distributions)
BOHB relies on HB to determine how many configurations to evaluate with which budget
Random selection of configurations at the beginning of each HB iteration is replaced by a model-based search, where the model is trained on the configurations that were evaluated so far.

28 of 32

29 of 32

30 of 32

31 of 32

Thank you for your attention

32 of 32

Sources

https://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf

https://www.automl.org/wp-content/uploads/2019/05/AutoML_Book_Chapter1.pdf

https://www.automl.org/blog_bohb/

https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf

https://arxiv.org/pdf/1603.06560.pdf