1 of 41

Part 3: AutoML

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial

Meta Learning Tutorial

2 of 41

Iddo Drori (MIT) and Joaquin Vanschoren (TU Eindhoven)

Meta-Learning for AutoML

AAAI 2021 Tutorial, part 3

Cover art: MC Escher, Ustwo Games

3 of 41

Task 1

Learning

Task 2

Learning

Task 3

Learning

Task_new

Learning

x,y

Recap: What can we learn to learn?

1. architectures / pipelines

(hyperparameters, structures)

Clune 2019

Focus of this part

See part 2

From hand-designed to learned learning algorithms … to AI-generating algorithms?

new tasks

(of own choosing)

experience

2. learning algorithms

(priors, task embeddings,…)

3. learning environments

(curricula, self-exploration)

bias

4 of 41

Machine Learning

Task 1

Model

Learning algorithm

Task: distribution of samples q(x)

outputs y, loss ℒ(x,y)

Task 2

Model

Learning algorithm

ℒ_T1(f_ɸ1,λ(x),y)

ℒ_T2

x,y

training

∇_ɸ

ɸ’₁

ɸ’₂

f_ɸ1(x)

f_ɸ2(x)

Learner: model parameters ɸ,

hyper-parameters λ

When the new task is quite different, (meta-)learn the hyper-parameters λ

λ₁

λ₂

When the new task is quite similar, keep λ, (meta-)learn the model parameters ɸ

Neural architectures,

pipelines, other

hyperparameters, …

Note: we can also learn λ and ɸ at the same time (bilevel optimization)

5 of 41

Task

Models

performance

Human expert

Models

manual trial and error

(and intuition)

Task

Models

performance

Learning and optimization

Models

automated, efficient search for best models

Manual machine learning

Models

Automatic Machine Learning (AutoML)

Hutter et al. 2019

AutoML: build models in a data-driven, intelligent, purposeful way

6 of 41

AutoML example: Pipeline synthesis

Cleaning, preprocessing, feature selection/engineering features, model selection, hyperparameter tuning, adapting to concept drift,…

Figure source: Nick Gillian

7 of 41

Type of operators
Size of layers
Filter sizes
Skip connections
Pre-trained layers
Transformers
…

Gradient descent hyperparameters
Regularization
…

AutoML example: Neural Architecture Search

Architecture:

Optimization:

Figure source: Elsken et al., 2018

8 of 41

Task 1..N

Models

performance

AutoML

Models

Meta-learn how to design architectures/pipelines and tune hyper parameters

Human data scientists also learn from experience

Models

AutoML + meta-learning

Hutter et al. 2019

New task

Models

performance

self-learning AutoML

Models

bias

(priors, meta-knowledge, human priors)

Search space can be huge!

9 of 41

Meta-learning for AutoML: how?

Learning hyperparameter priors

Warm starting (what works on similar tasks?)

start randomly

start with

good candidates

Meta-models (learn how to build models/components)

Complex

hyperparameter space

Simple

hyperparameter space

Vanschoren 2018

Task

λ, scores

Learner

metadata

hyperparameters = architecture + hyperparameters

Task

λ, scores

Learner

metadata

Task

10 of 41

Observation:

current AutoML strongly depends on learned priors

Complex

hyperparameter space

Simple

hyperparameter space

observation

11 of 41

Manual architecture priors

Most successful pipelines have a similar structure

autosklearn Feurer et al. 2015

autoWEKA Thornton et al. 2013

hyperopt-sklearn Komer et al. 2014

AutoGluon-Tabular Erickson et al. 2020

+ smaller search space

- you can’t learn entirely new architectures

Fix architecture, encode all choices as extra hyperparameters

Architecture search becomes hyperparameter optimization

Can we meta-learn a prior over successful structures?

Ensembling/stacking

Figure source: Feurer et al. 2015

12 of 41

Manual architecture priors

Parameterized Sequential

Parameterized Graph

Choose:

number of layers
type of layers

dense
convolutional
max-pooling
…

hyperparameters of layers

+ easier to search

- sometimes too simple

Choose:

branching
joins
skip connections
types of layers
hyperparameters of layers

+ more flexible

- much harder to search

Elsken et al. 2019

13 of 41

Manual architecture priors

Successful deep networks often have repeated motifs (cells)

e.g. Inception v4:

Szegedy

Figure source: Szegedy et al 2016

14 of 41

Cell search space prior

Google NASNet Zoph et al 2018

Compositionality: learn hierarchical building blocks to simplify the task

learn parameterized building blocks (cells)
stack cells together in macro-architecture

+ smaller search space

+ cells can be learned on a small dataset & transferred to a larger dataset

strong domain priors, doesn’t generalize well

Cell search space

Can we meta-learn hierarchies / components that generalize better?

Figure source: Elsken et al., 2019

15 of 41

Cell search space prior

NASNet, Zoph et al 2018

Figure source: Zoph et al., 2018

16 of 41

Cell construction with neuro-evolution (SOTA ImageNet)

AmoebaNet, Real et al 2019

normal cell

reduction cell

Cell search space prior

Figure source: Real et al., 2019

17 of 41

If you constrain the search space enough, you can get SOTA results with random search!

Li & Talwalkar 2019

Yu et al. 2019

Real et al. 2019

Cell search space prior

Cell construction with multi-fidelity random search!

Figure source: Li & Talwalkar., 2019

18 of 41

Weight-agnostic neural networks

ALL weights are shared
Only evolve the architecture?

Minimal description length
Baldwin effect?

Manual priors: Weight sharing

Gaier & Ha 2019

Figure source: Gaier & Ha, 2019

19 of 41

Learning hyperparameter priors

Complex

hyperparameter space

Simple

hyperparameter space

λ, scores

Learner

20 of 41

Learn hyperparameter importance

Functional ANOVA ¹

Select hyperparameters that cause variance in the evaluations.
Useful to speed up black-box optimization techniques

¹ van Rijn & Hutter 2018

ResNets for image classification

Figure source: van Rijn & Hutter, 2018

21 of 41

Learn defaults + hyperparameter importance

Tunability ^1,2,3 Learn good defaults, measure importance as improvement via tuning

¹ Probst et al. 2018

² Weerts et al. 2020

³ van Rijn et al. 2018

Learned defaults

Tuning risk

22 of 41

Bayesian Optimization (interlude)

Mockus, 1974

performance

23 of 41

Repeat until some stopping criterion:

Fixed budget
Convergence
EI threshold

Theoretical guarantees

Also works for non-convex, noisy data
Used in AlphaGo

Srinivas et al. 2010, Freitas et al. 2012, Kawaguchi et al. 2016

Bayesian Optimization

Figure source: Shahriari 2016

24 of 41

Learn basis expansions for hyperparameters

Hyperparameters can interact in very non-linear ways
Use a neural net to learn a suitable basis expansion ϕ_z(λ) for all tasks
You can use Bayesian linear models, transfers info on configuration space

Perrone et al. 2018

Bayesian Linear surrogate

φ_z(λ)_i

λ_i,scores

Learn basis expansion on lots of data (e.g. OpenML)

φ_z(λ)

Gaussian Processes surrogate

25 of 41

Surrogate model transfer

If task j is similar to the new task, its surrogate model S_jwill likely transfer well
Sum up all S_jpredictions, weighted by task similarity (as in active testing)¹
Build combined Gaussian process, weighted by current performance on new task²

Tasks

Models

performance

Learning

New Task

meta-learner

Models

performance

per task t_j:

P_i,j

}

¹ Wistuba et al. 2018

λ_i

S_j

² Feurer et al. 2018

S= ∑ w_jS_j

S₁

S₂

S₃

λ_i

26 of 41

Store surrogate model S_ijfor every pair of task i and algorithm j
Simpler surrogates, better transfer
Learn weighted ensemble -> significant speed up in optimization

prior tasks

Surrogate model transfer

Manolache & Vanschoren 2019

new task

27 of 41

Warm starting

(what works on similar tasks?)

start randomly

start with

good candidates

Task

λ, scores

Learner

metadata

28 of 41

Hand-designed (statical) meta-features that describe (tabular) datasets ¹
Task2Vec: task embedding for image data ²
Optimal transport: similarity measure based on comparing probability distributions ³
Metadata embedding based on textual dataset description ⁴
Dataset2Vec: compares batches of datasets ⁵
Distribution-based invariant deep networks ⁶

How to measure task similarity?

¹ Vanschoren 2018

² Achille et al. 2019

³ Alvarez-Melis et al. 2020

⁴ Drori et al. 2019

⁵ Jooma et al. 2020

⁶ de Bie et al. 2020

Figure source: Alvarez-Melis et al. 2020

29 of 41

Warm-starting with kNN

Find k most similar tasks, warm-start search with best λ_i

Auto-sklearn: Bayesian optimization (SMAC)

Meta-learning yield better models, faster
Winner of AutoML Challenges

Tasks

Models

performance

Learning

New Task

meta-learner

Models

performance

P_i,j

}

λ_1..k

m_j

best λ_i on similar tasks

Feurer et al. 2015

λ_i

Bayesian optimization

λ₁

λ₃

λ₂

λ₄

Figure source: Feurer et al., 2015

30 of 41

Probabilistic Matrix Factorization

Fusi et al. 2017

P_i,j

λ_i

T_L

λ_L

t_j

t_new warm-started

with λ_1..k

. . .. . . .. . . . .

λ_i

λ_Li

p(P|λ_Li)

latent representation

Learn latent representation for tasks T and configurations λ
Use meta-features to warm-start on new task
Returns probabilistic predictions for Bayesian optitmization

Collaborative filtering: configurations λ_i are `rated’ by tasks t_j

Figure source: Fusi et al., 2017

31 of 41

DARTS: Differentiable NAS

Liu et al. 2018

convolution

max pooling

zero

One-shot model

Figure source: Liu et al., 2018

32 of 41

Warm-started DARTS

Warm-start DARTS with architectures that worked well on similar problems
Slightly better performance, but much faster (5x)

Grobelnik and Vanschoren, 2021

33 of 41

Meta-models

(learn how to build models/components)

Task

λ, scores

Learner

metadata

Task

34 of 41

Algorithm selection models

Learn direct mapping between meta-features and P_i,j

Zero-shot meta-models: predict best λ_igiven meta-features ¹

Ranking models: return ranking λ_1..k²

Predict which algorithms / configurations to consider / tune³

Predict performance / runtime for given 𝛳_iand task⁴

Can be integrated in larger AutoML systems: warm start, guide search,…

meta-learner

λ_best

¹ Brazdil et al. 2009, Lemke et al. 2015

² Sun and Pfahringer 2013, Pinto et al. 2017

meta-learner

λ_1..k

m_j

meta-learner

P_ij

m_j,λ_i

³ Sanders and C. Giraud-Carrier 2017

meta-learner

m_j

⁴ Yang et al. 2018

35 of 41

Learn nonlinearities: RL-based search of space of likely useful activation functions ¹

E.g. Swish can outperform ReLU

¹ Ramachandran et al. 2017

Learn optimizers: RL-based search of space of likely useful update rules ²

E.g. PowerSign can outperform Adam, RMPprop

²Bello et al. 2017

Learning model components

g: gradient, m:moving average

Learn acquisition functions for Bayesian optimization ³

³Volpp et al. 2020

Figure source: Ramachandran et al., 2017 (top), Bello et al. 2017 (bottom)

36 of 41

Monte Carlo Tree Search + reinforcement learning

Self-play:

Game actions: insert, delete, replace components in a pipeline
Monte Carlo Tree Search builds pipelines given action probabilities

With grammar to avoid invalid pipelines

Neural network (LSTM) Predicts pipeline performance (can be pre-trained on prior datasets)

MOSAIC [Rakotoarison et al. 2019]

AlphaD3M [Drori et al. 2019]

Figure source: Drori et al., 2019

37 of 41

Neural Architecture Transfer learning

Wong et al. 2018

Warm-start a deep RL controller based on prior tasks
Much faster than single-task equivalent

Figure source: Wong et al., 2018

38 of 41

Meta-Reinforcement Learning for NAS

Train an agent how to build a neural net, across tasks
Should transfer but also adapt to new tasks

Actions: add/remove certain layers in certain locations

Gomez & Vanschoren, 2019

39 of 41

omniglot

vgg_flower

dtd

Results on increasingly difficult tasks:

Initially slower than DQN, but faster after a few tasks
Policy entropy shows learning/re-learning

Gomez & Vanschoren, 2019

Meta-Reinforcement Learning for NAS

40 of 41

MetaNAS: MAML + Neural Architecture Search

Figure source: Elsken et al., 2020

Combines gradient based meta-learning (REPTILE) with NAS
During meta-train, it optimizes the meta-architecture (DARTS weights)along with the meta-parameters (initial weights) 𝛳
During meta-test, the architecture can be adapted to the novel task through gradient descent

Elsken et al. 2020

41 of 41

Meta Learning Tutorial

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial