1 of 41

Part 3: AutoML

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial

Meta Learning Tutorial

2 of 41

Iddo Drori (MIT) and Joaquin Vanschoren (TU Eindhoven)

Meta-Learning for AutoML

AAAI 2021 Tutorial, part 3

Cover art: MC Escher, Ustwo Games

3 of 41

Task 1

Learning

Task 2

Learning

Task 3

Learning

Tasknew

Learning

x,y

x,y

x,y

x,y

Recap: What can we learn to learn?

1. architectures / pipelines

(hyperparameters, structures)

Focus of this part

See part 2

From hand-designed to learned learning algorithms … to AI-generating algorithms?

new tasks

(of own choosing)

experience

2. learning algorithms

(priors, task embeddings,…)

3. learning environments

(curricula, self-exploration)

bias

4 of 41

Machine Learning

Task 1

Model

Learning algorithm

Task: distribution of samples q(x)

outputs y, loss ℒ(x,y)

Task 2

Model

Learning algorithm

T1(fɸ1,λ(x),y)

T2

x,y

x,y

training

ɸ

ɸ’1

ɸ’2

fɸ1(x)

fɸ2(x)

Learner: model parameters ɸ,

hyper-parameters λ

When the new task is quite different, (meta-)learn the hyper-parameters λ

λ1

λ2

When the new task is quite similar, keep λ, (meta-)learn the model parameters ɸ

Neural architectures,

pipelines, other

hyperparameters, …

Note: we can also learn λ and ɸ at the same time (bilevel optimization)

5 of 41

Task

Models

performance

Human expert

Models

Models

manual trial and error

(and intuition)

Task

Models

performance

Learning and optimization

Models

Models

automated, efficient search for best models

Manual machine learning

Models

Models

Models

λ

Automatic Machine Learning (AutoML)

λ

AutoML: build models in a data-driven, intelligent, purposeful way

6 of 41

AutoML example: Pipeline synthesis

Cleaning, preprocessing, feature selection/engineering features, model selection, hyperparameter tuning, adapting to concept drift,…

Figure source: Nick Gillian

7 of 41

  • Type of operators
  • Size of layers
  • Filter sizes
  • Skip connections
  • Pre-trained layers
  • Transformers
  • Gradient descent hyperparameters
  • Regularization

AutoML example: Neural Architecture Search

Architecture:

Optimization:

Figure source: Elsken et al., 2018

8 of 41

Task 1..N

Models

performance

AutoML

Models

Models

Meta-learn how to design architectures/pipelines and tune hyper parameters

Human data scientists also learn from experience

Models

Models

Models

AutoML + meta-learning

λ

New task

Models

performance

self-learning AutoML

Models

λ

bias

(priors, meta-knowledge, human priors)

Search space can be huge!

9 of 41

Meta-learning for AutoML: how?

Learning hyperparameter priors

Warm starting (what works on similar tasks?)

start randomly

start with

good candidates

Meta-models (learn how to build models/components)

Complex

hyperparameter space

Simple

hyperparameter space

Task

λ, scores

λ, scores

Learner

Learner

metadata

hyperparameters = architecture + hyperparameters

Task

λ, scores

Learner

metadata

Task

Task

10 of 41

Observation:

current AutoML strongly depends on learned priors

Complex

hyperparameter space

Simple

hyperparameter space

observation

11 of 41

Manual architecture priors

Most successful pipelines have a similar structure

autosklearn Feurer et al. 2015

autoWEKA Thornton et al. 2013

hyperopt-sklearn Komer et al. 2014

AutoGluon-Tabular Erickson et al. 2020

+ smaller search space

- you can’t learn entirely new architectures

  • Fix architecture, encode all choices as extra hyperparameters
    • Architecture search becomes hyperparameter optimization

Can we meta-learn a prior over successful structures?

Ensembling/stacking

Figure source: Feurer et al. 2015

12 of 41

Manual architecture priors

12

Parameterized Sequential

Parameterized Graph

Choose:

  • number of layers
  • type of layers
    • dense
    • convolutional
    • max-pooling
  • hyperparameters of layers

+ easier to search

- sometimes too simple

Choose:

  • branching
  • joins
  • skip connections
  • types of layers
  • hyperparameters of layers

+ more flexible

- much harder to search

13 of 41

Manual architecture priors

Successful deep networks often have repeated motifs (cells)

e.g. Inception v4:

Szegedy

Figure source: Szegedy et al 2016

14 of 41

Cell search space prior

Google NASNet Zoph et al 2018

Compositionality: learn hierarchical building blocks to simplify the task

  • learn parameterized building blocks (cells)
  • stack cells together in macro-architecture

+ smaller search space

+ cells can be learned on a small dataset & transferred to a larger dataset

  • strong domain priors, doesn’t generalize well

Cell search space

Can we meta-learn hierarchies / components that generalize better?

Figure source: Elsken et al., 2019

15 of 41

Cell search space prior

 

Figure source: Zoph et al., 2018

16 of 41

  • Cell construction with neuro-evolution (SOTA ImageNet)

AmoebaNet, Real et al 2019

normal cell

reduction cell

Cell search space prior

Figure source: Real et al., 2019

17 of 41

If you constrain the search space enough, you can get SOTA results with random search!

Cell search space prior

  • Cell construction with multi-fidelity random search!

Figure source: Li & Talwalkar., 2019

18 of 41

Weight-agnostic neural networks

  • ALL weights are shared
  • Only evolve the architecture?
    • Minimal description length
    • Baldwin effect?

Manual priors: Weight sharing

Figure source: Gaier & Ha, 2019

19 of 41

Learning hyperparameter priors

Complex

hyperparameter space

Simple

hyperparameter space

λ, scores

Learner

20 of 41

Learn hyperparameter importance

  • Functional ANOVA 1
    • Select hyperparameters that cause variance in the evaluations.
    • Useful to speed up black-box optimization techniques

ResNets for image classification

Figure source: van Rijn & Hutter, 2018

21 of 41

Learn defaults + hyperparameter importance

  • Tunability 1,2,3 Learn good defaults, measure importance as improvement via tuning

Learned defaults

Tuning risk

22 of 41

Bayesian Optimization (interlude)

 

 

 

 

 

performance

23 of 41

  • Repeat until some stopping criterion:
    • Fixed budget
    • Convergence
    • EI threshold

  • Theoretical guarantees

  • Also works for non-convex, noisy data
  • Used in AlphaGo

Bayesian Optimization

Figure source: Shahriari 2016

24 of 41

Learn basis expansions for hyperparameters

  • Hyperparameters can interact in very non-linear ways
  • Use a neural net to learn a suitable basis expansion ϕz(λ) for all tasks
  • You can use Bayesian linear models, transfers info on configuration space

P

Bayesian Linear surrogate

φz(λ)i

λi, scores

Learn basis expansion on lots of data (e.g. OpenML)

φz(λ)

φz(λ)

λ

Gaussian Processes surrogate

25 of 41

Surrogate model transfer

  • If task j is similar to the new task, its surrogate model Sj will likely transfer well
  • Sum up all Sj predictions, weighted by task similarity (as in active testing)1
  • Build combined Gaussian process, weighted by current performance on new task2

Tasks

Models

Models

Models

performance

Learning

Learning

Learning

New Task

meta-learner

Models

Models

Models

performance

per task tj:

Pi,j

}

λi

P

Sj

S = ∑ wj Sj

+

+

S1

S2

S3

λi

26 of 41

  • Store surrogate model Sij for every pair of task i and algorithm j
  • Simpler surrogates, better transfer
  • Learn weighted ensemble -> significant speed up in optimization

prior tasks

Surrogate model transfer

new task

27 of 41

Warm starting

(what works on similar tasks?)

start randomly

start with

good candidates

Task

λ, scores

Learner

metadata

28 of 41

  • Hand-designed (statical) meta-features that describe (tabular) datasets 1
  • Task2Vec: task embedding for image data 2
  • Optimal transport: similarity measure based on comparing probability distributions 3
  • Metadata embedding based on textual dataset description 4
  • Dataset2Vec: compares batches of datasets 5
  • Distribution-based invariant deep networks 6

How to measure task similarity?

Figure source: Alvarez-Melis et al. 2020

29 of 41

Warm-starting with kNN

  • Find k most similar tasks, warm-start search with best λi
    • Auto-sklearn: Bayesian optimization (SMAC)
      • Meta-learning yield better models, faster
      • Winner of AutoML Challenges

Tasks

Models

Models

Models

performance

Learning

Learning

Learning

New Task

meta-learner

Models

Models

Models

performance

Pi,j

}

λ1..k

mj

best λi on similar tasks

λi

Bayesian optimization

λ

P

λ1

λ3

λ2

λ4

Figure source: Feurer et al., 2015

30 of 41

Probabilistic Matrix Factorization

Pi,j

λi

TL

λL

tj

tnew warm-started

with λ1..k

. . .. . . .. . . . .

λi

λLi

P

p(P|λLi)

latent representation

  • Learn latent representation for tasks T and configurations λ
  • Use meta-features to warm-start on new task
  • Returns probabilistic predictions for Bayesian optitmization
  • Collaborative filtering: configurations λi are `rated’ by tasks tj

Figure source: Fusi et al., 2017

31 of 41

DARTS: Differentiable NAS

convolution

max pooling

zero

 

One-shot model

 

 

 

Figure source: Liu et al., 2018

32 of 41

Warm-started DARTS

  • Warm-start DARTS with architectures that worked well on similar problems
  • Slightly better performance, but much faster (5x)

33 of 41

Meta-models

(learn how to build models/components)

Task

λ, scores

Learner

metadata

Task

Task

34 of 41

Algorithm selection models

  • Learn direct mapping between meta-features and Pi,j
    • Zero-shot meta-models: predict best λi given meta-features 1

    • Ranking models: return ranking λ1..k 2

    • Predict which algorithms / configurations to consider / tune 3

    • Predict performance / runtime for given 𝛳i and task 4

  • Can be integrated in larger AutoML systems: warm start, guide search,…

meta-learner

λbest

meta-learner

λ1..k

mj

mj

meta-learner

Pij

mj, λi

meta-learner

Λ

mj

35 of 41

  • Learn nonlinearities: RL-based search of space of likely useful activation functions 1
    • E.g. Swish can outperform ReLU
  • Learn optimizers: RL-based search of space of likely useful update rules 2
    • E.g. PowerSign can outperform Adam, RMPprop

Learning model components

 

 

g: gradient, m:moving average

  • Learn acquisition functions for Bayesian optimization 3

Figure source: Ramachandran et al., 2017 (top), Bello et al. 2017 (bottom)

36 of 41

Monte Carlo Tree Search + reinforcement learning

  • Self-play:
    • Game actions: insert, delete, replace components in a pipeline
    • Monte Carlo Tree Search builds pipelines given action probabilities
      • With grammar to avoid invalid pipelines
    • Neural network (LSTM) Predicts pipeline performance (can be pre-trained on prior datasets)

Figure source: Drori et al., 2019

37 of 41

Neural Architecture Transfer learning

  • Warm-start a deep RL controller based on prior tasks
  • Much faster than single-task equivalent

Figure source: Wong et al., 2018

38 of 41

Meta-Reinforcement Learning for NAS

  • Train an agent how to build a neural net, across tasks
  • Should transfer but also adapt to new tasks

Actions: add/remove certain layers in certain locations

39 of 41

omniglot

vgg_flower

dtd

Results on increasingly difficult tasks:

  • Initially slower than DQN, but faster after a few tasks
  • Policy entropy shows learning/re-learning

Meta-Reinforcement Learning for NAS

40 of 41

MetaNAS: MAML + Neural Architecture Search

Figure source: Elsken et al., 2020

  • Combines gradient based meta-learning (REPTILE) with NAS
  • During meta-train, it optimizes the meta-architecture (DARTS weights) along with the meta-parameters (initial weights) 𝛳
  • During meta-test, the architecture can be adapted to the novel task through gradient descent

41 of 41

Meta Learning Tutorial

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial