1 of 43

Discovering Black-Box Optimizers via Evolutionary Meta-Learning

Robert Tjarko Lange

June 2023

2 of 43

‘The Creation of Adam’ - Michelangelo (ca. 1508 - 1512)

‘The Creation of AGI (by Adam)’ - ML Community

AGI

🧠

🧠

🧠

🧠

🧠

‘Nothing in Biology Makes Sense Except in the Light of Evolution’ - T. Dobzhansky (1973)

🦎

🦎

🦎

3 of 43

Envisioned excitement curve of this talk 🚀

👇

You are here

[hopefully]

Excitement [🔥/Slide]

Time ⌛

👉 tinyurl.com/learned-evo

4 of 43

What is Black-Box Optimization (⚫ + 📦)?

'White-Box' Function Eval

👉 Many BBO/0-order optimization methods: Random search, BayesOpt, Successive Halving, HyperBand, EvoOpt, etc.

'Black-Box' Function Eval

👉 EvoOpt: Sweet spot - can effectively optimize <1 mio. params

5 of 43

🦎 How does an Evolution Strategy work?

Sample

Iterate

Evaluate

Update

6 of 43

Challenges for Modern Evolutionary Optimization?

jit

grad

vmap

pmap

7 of 43

What is the power of JAX for Evolutionary Optimization?

jax.vmap/jax.pmap

Parallel/Accelerated Fitness Rollouts

🦎

🤖

🌎

🍉

👌

🤖

🌎

👎

👍

🦎

🤖

🌎

🤖

🌎

🦎

🤖

🌎

🤖

🌎

🍉

🍉

Parallel/Accelerated BBO Runs

jax.vmap/jax.pmap

8 of 43

👉 evosax: Accelerated Evolutionary Optimization

OpenAI-ES

(Salimans et al., '17)

SNES

(Schaul et al., '11)

PGPE

(Sehnke et al., '10)

Guided ES

(Maheswaranathan et al. '18)

xNES

(Wierstra et al., '14)

CR-FM-NES

(Nomura & Ono, '22)

Simple GA

(Such et al., '17)

IPOP-CMA

(Auer &Hansen, '05)

LM-MA-ES

(Loshchilov et al., '17)

RmES

(Li & Zhang, '17)

MA-ES

(Bayer & Sendhoff, '17)

iAMaLGaM

(Bosman, '13)

GESMR-GA

(Kumar et al., '22)

ARS

(Mania et al., '18)

Persistent ES

(Salimans et al., '17)

ASEBO

(Choromanski�et al., '19)

CMA-ES

(Hansen & Ostermeier, '01)

SepCMA-ES

(Ros & Hansen, '08)

BIPOP-CMA

(Hansen, '09)

DES

(Lange et al., '23a)

Learned ES

(Lange et al., '23a)

MR-⅕ GA

(Rechenberg, '78)

Simple ES

(Rechenberg, '78)

PSO

(Kennedy & Eberhardt, '95)

SAMR-GA

(Clune et al.,, '08)

Learned GA

(Lange et al., '23b)

Finite-Difference�Gradient-Based

Evolution Strategies

Estimation-of-Distribution

Evolution Strategies

Genetic Algorithms

Many more ES/GA/BBO algorithms

9 of 43

Discovering New Algorithms via Meta-Learning

Manually Designed

Discovered/Learned

Oh et al. (2020)

Veeriah et al. (2021)

Andrychowicz et al. (2016), Metz et al. (2019)

10 of 43

Discovering New Algorithms via Meta-Evolution

Meta-Evolution 🦎²

Fully�Black-Box

[Expressive]

Hard Meta-�Optimization

Easy Meta-�Optimization

Fully�White-Box�[Interpretable]

Discovery

Plasticity (Confavreux et al., '20)

RL-PG Objectives (Lu et al., '22)

GD Optimizers (Metz et al., '19;'22)

Cheap Talk Channels (Lu et al., '23)

Synthetic Envs (Ferreira et al., '22)

11 of 43

Why not use Meta-∇ instead of Meta-🦎? 🤔

[Pascanu et al., '13]

Meta-Gradients ∇²

∇ Propagation through ∇ Updates�[Online Cross-Validation; Sutton, 1992]

…..

12 of 43

Discovering Evolutionary Optimizers (🦎 & 🧬)

Evolution Strategies 🦎 (Lange et al., '23a - @ICLR)

Genetic Algorithms 🧬�(Lange et al., '23b - @GECCO)

Part 1 🦎

Part 2 🧬

13 of 43

White-Box Evolution Strategy: Gaussian Search

Inflexible: Fixed weights per rank + fixed learning rates.

Rank Evaluations�By Performance

Mean

Std

14 of 43

🦎 Learned Evolution Strategy (LES) Architecture

MLP

Meta-Evolved�Set Transformer

Std

Mean

15 of 43

LES Discovery via Meta-Black-Box Optimization 🦎(🦎)

16 of 43

Meta-Training Details for LES Discovery

1 Sphere

2 Ellipsoid separable

3 Rastrigin separable

4 Skew Rastrigin-Bueche sep.

5 Linear Slope

6 Attractive sector

7 Step ellipsoid

8 Rosenbrock rotated

9 Rosenbrock original

10 Ellipsoid

11 Discus

12 Bent cigar

13 Sharp ridge

14 Sum different powers

15 Rastrigin

16 Weierstrass

17 Schaffer F7, cond 10

18 Schaffer F7, cond 1000

19 Griewank-Rosenbrock F8F2

20 Schwefel

21 Gallagher 101 peaks

22 Gallagher 21 peaks

23 Katsuuras

24 Lunacek bi-Rastrigin

BBOB Functions

Random Offsets, Init, Eval Noise & Inner Loop

S + E + U

Popsize: 16

# Gens: 50

Separable

Moderate Condition

High Condition

Multi-Modal (Global Structure)

Multi-Modal (Weak Structure)

Subset of Functions & #Problem Dims (2-10D)

Dixon

17 of 43

Discovering LES: Meta-Training on Low-D BBOB

👉 Generalization across�population size/dims.��👉 Generalization to �unseen problems

18 of 43

Evaluating LES: Brax Control Tasks

19 of 43

BBOB Meta-Trained LES Generalizes to Control Tasks

👉 LES meta-trained on simple functions yields SotA neuroevolution ES 👉 Extreme generalization across time horizon, population size & problems

20 of 43

Scaling Meta-Distributions Improves LES Discovery

21 of 43

What Has The Learned Evolution Strategy Discovered?

👉 Compressed weight rule into closed-form ��👉 Tuning: competitive white-box ES

👉 Recombination weights �contribute most��👉 Can switch between�hill- climbing & FD

22 of 43

LES & the evosax API: ask-evaluate-tell

Initialize

import jax

from evosax import LES

# Instantiate the search strategy

rng = jax.random.PRNGKey(0)

strategy = LES(popsize=20, num_dims=2)

state = strategy.initialize(rng)

# Run ask-eval-tell loop: By default minimization!

for t in range(num_generations):

rng, rng_ask, rng_eval = jax.random.split(rng, 3)

x, state = strategy.ask(rng_ask, state)�

Ask

Evaluate

fitness = ... # Your population evaluation fct

Tell

state = strategy.tell(x, fitness, state)

# Get best overall population member & its fitness

state.best_member, state.best_fitness

23 of 43

Self-Referential Meta-Evolution of Learned ES

Fitness properties

Meta-learned�self-attention

Sample & evaluate

Search dist. update

Inner Loop: LES

Sample tasks

Run search

Normalize &�meta-update

Sample LES

parameters

Outer Loop: MetaBBO

Sample tasks

Run search

Meta-improve:

Copy best & set mean/std

Sample LES

Meta-update

Outer Loop: LES

24 of 43

🧬 How does a Genetic Algorithm work?

Mutate

Iterate

Evaluate

Select

25 of 43

🧬 Learned Genetic Algorithms (LGA) 👉 🦎(🧬)

Sample meta-�train tasks

Run search

Normalize &�meta-update

Sample LGA

parameters

Outer: MetaBBO

Evaluate

children�population

Inner: Learned GA

Learned mutation rate adaptation�via self-attention

Learned selection�via cross-attention

26 of 43

LGA Generalizes to HPO-B & Neuroevolution Tasks

27 of 43

LGA Applies Adaptive Elitism & MR Adaptation

👉 Adaptive elite size & Mutation rate adaption

👉 Transfer of NN GA operators

28 of 43

Summary: Discovering LES & LGA via MetaBBO

Learned ES Architecture

Self-Ref. MetaBBO

Learned GA Architecture

Meta-Generalization

Reverse Engineering

29 of 43

On Survivorship Bias & The Hardware Lottery (Hooker, '21)

3e-04

🩹?

🦎

🦎

🦎

🦎

30 of 43

Recap, pointers & the future of learned-evo 🚀

👇

You are�here 🚀

Excitement [🔥/Slide]

Time ⌛

👉 github.com/RobertTLange/evosax

@GECCO 2023 🦎

31 of 43

AGI

32 of 43

P.S.: Also checkout gymnax 🏋️- JAX-Based RL Envs

👉 github.com/RobertTLange/gymnax

33 of 43

Supplementary Slides

34 of 43

Discovery via Meta-Gradients? The Good, the Bad & Ugly

Meta-Gradients ∇

Optimization Tricks ∇∇

Truncation

Meta-Adam ε

∇∇ Clipping

∇ Propagation through ∇ Updates�[Online Cross-Val; Sutton, 1992]

…..

Flennerhag et al. (2022)

[Bootstrapped MGs]

Zenke & Ganguli (2018)�[Surrogate ∇]

Long Horizon

Cons

Non-Diff

Non-Diff

35 of 43

Towards a Fully-Black-Box Task-Dependent Hybrid BBO

Multi-Modal

Regime

Single-Modal Regime

Genetic�Algorithm

Evolution Strategy

Task

Space

LES/LGA

36 of 43

When might you have to do Black-Box Optimization?

Sparse Rewards/Local Optima

Hyperparameter Optimization

Non-Differentiable Operations

Chaos: Exploding Gradients�[Metz et al., 2019] - 2D Meta-Loss Slice

37 of 43

Evol. Optimization, Set Operations & Self-Attention

Meta-Evolving Evolution 🦎🦎

Inductive Biases in ML

Convolutions

Translation Invariance

RNNs & Gating

Sequential Data

'Black-Box'

Neural�Network

Self-Attention

Permutation Invariance

38 of 43

What Has The Learned Evolution Strategy Discovered?

👉 Recombination weights �contribute most��👉 Can switch between�hill- climbing & FD

39 of 43

Reverse-Engineering Discovered LES Mechanisms

👉 Can compress weight rule into closed-form equation��👉 Correct tuning yields competitive white-box ES

40 of 43

P.S.: Can LLMs already do Black-Box Optimization?

(not quite yet?)�😋

41 of 43

Detailed Results: Neuroevolution - Brax - Radar

Brax Performance

Test Accuracy

Small Budget

Medium Budget

Large Budget

Confidential - DeepMind

42 of 43

Broad Meta-Distributions Improve LES Discovery?

+

Confidential - DeepMind

43 of 43

Attention-Based Parametrization of Selection & MRA

Select Q/K

Adaptation

Featurize

Self-Attention

Adapt A

Mult. Adapt

Selection

Featurize

Cross-Attention

Select Matrix