1 of 12

Towards Understanding �Automated Deep Learning

Prof. Dr. Marius Lindauer

@LindauerMarius

@AutoML_org

These slides are available at www.automl.org/talks --- all references are hyperlinks

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

2 of 12

AutoDL: Automated Deep Learning

Optimizer

Validation performance�(e.g., accuracy)

AutoDL Tool

Training Data

Validation Data

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

3 of 12

Auto-PyTorch [Mendoza et al. 2019, Zimmer et al. 2020]

Tabular data and image data
Very efficient because of meta-learning�and multi-fidelity optimization

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

4 of 12

Characteristics of Opt. Problem of AutoML/DL

Complex search space:

Integer, float, categorical, conditional structures

Black-box function

No analytic form known
“Performance” can only be queried
But there is more information available compared to classical black-box optimization

Only few function evaluations affordable

A single function evaluation can cost between minutes or hours (or even more)

Stochastic returns

Training of a DNN is non-deterministic (e.g., SGD)
Thus, returned performance can vary

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

5 of 12

LCBench: Learning Curve Benchmark [Zimmer et al. 2020]

Diverse set of 35 datasets �from OpenML
7 hyperparameters

3 integers
4 floats

SGD + cosine annealing
2000 configurations
3 repeated runs each
⇒ 3 x 35 x 2000 = 210 000 training of a neural network�

Other NAS-Bench: 101, 1shot1, 201, 301

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

6 of 12

Heatmap & Portfolio

There isn’t a configuration �that rules them all.
Surprisingly, small portfolio performs quite well.

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

7 of 12

Landscape?

Pushak & Hoos [2018] showed for algorithm configuration that landscapes are more benign than expected

Uni-modal
Convex
Relatively “smooth”�

→ Does that also apply to AutoDL?

Plots from fANOVA �(similar to partial dependency plots) �show similar characteristics

Approximated via Random Forest

Example: Learning rate of PPO on cartpole (RL problem) [Lindauer et al. 2019]
⇒ Low effective dimensionality

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

8 of 12

Multi-Fidelity Optimization

Only the best configurations are evaluated until the end
→ Makes AutoDL very efficient

Competitive with gradient-based NAS

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

9 of 12

Correlation between Budgets (e.g., #Epochs)

Kendall-tau between configurations on �different budgets
On some datasets, weak correlation
On some datasets, strong correlation
→ How can we effectively determine this?
→ Is correlation really what we care about?

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

10 of 12

Hyperparameter Importance across Budgets

Which hyperparameters influence performance the most?
Surprisingly stable across budgets
Different scores from different importance metrics

Global importance and local importance partially do not match�

If DNN trained for longer,

More layers can be better used
learning rate is less important�(if we use a good learning rate scheduler)

fANOVA [Hutter et al. 2014]

LPI [Biedenkapp et al. 2018]

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

11 of 12

Take Away

AutoDL is a complex and expensive problem
Multi-fidelity optimization is one of the state-of-the-art approaches

opens up new challenges

We only started to understand the real AutoDL problem

But we are working on it ;-)

Opt.

Validation performance�(e.g., accuracy)

AutoDL Tool

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover

12 of 12

Thank you!

@LindauerMarius

@AutoML_org

AutoDL@UMLOP@PPSN’20

M. Lindauer

Leibniz University Hannover