Discovering Black-Box Optimizers via Evolutionary Meta-Learning
Robert Tjarko Lange
June 2023
‘The Creation of Adam’ - Michelangelo (ca. 1508 - 1512)
‘The Creation of AGI (by Adam)’ - ML Community
AGI
∇
🧠
🧠
🧠
🧠
🧠
‘Nothing in Biology Makes Sense Except in the Light of Evolution’ - T. Dobzhansky (1973)
🦎
🦎
🦎
Envisioned excitement curve of this talk 🚀
👇
You are here
[hopefully]
Excitement [🔥/Slide]
Time ⌛
👉 tinyurl.com/learned-evo
What is Black-Box Optimization (⚫ + 📦)?
'White-Box' Function Eval
👉 Many BBO/0-order optimization methods: Random search, BayesOpt, Successive Halving, HyperBand, EvoOpt, etc.
'Black-Box' Function Eval
❌
❌
👉 EvoOpt: Sweet spot - can effectively optimize <1 mio. params
🦎 How does an Evolution Strategy work?
Sample
Iterate
Evaluate
Update
Challenges for Modern Evolutionary Optimization?
jit
grad
vmap
pmap
What is the power of JAX for Evolutionary Optimization?
jax.vmap/jax.pmap
Parallel/Accelerated Fitness Rollouts
🦎
🤖
🌎
🍉
👌
🤖
🌎
⚡
👎
👍
🦎
🤖
🌎
🤖
🌎
🦎
🤖
🌎
⚡
🤖
🌎
⚡
🍉
🍉
Parallel/Accelerated BBO Runs
jax.vmap/jax.pmap
👉 evosax: Accelerated Evolutionary Optimization
OpenAI-ES
(Salimans et al., '17)
SNES
(Schaul et al., '11)
PGPE
(Sehnke et al., '10)
Guided ES
(Maheswaranathan et al. '18)
xNES
(Wierstra et al., '14)
CR-FM-NES
(Nomura & Ono, '22)
Simple GA
(Such et al., '17)
IPOP-CMA
(Auer &Hansen, '05)
LM-MA-ES
(Loshchilov et al., '17)
RmES
(Li & Zhang, '17)
MA-ES
(Bayer & Sendhoff, '17)
iAMaLGaM
(Bosman, '13)
GESMR-GA
(Kumar et al., '22)
ARS
(Mania et al., '18)
Persistent ES
(Salimans et al., '17)
ASEBO
(Choromanski�et al., '19)
CMA-ES
(Hansen & Ostermeier, '01)
SepCMA-ES
(Ros & Hansen, '08)
BIPOP-CMA
(Hansen, '09)
DES
(Lange et al., '23a)
Learned ES
(Lange et al., '23a)
MR-⅕ GA
(Rechenberg, '78)
Simple ES
(Rechenberg, '78)
PSO
(Kennedy & Eberhardt, '95)
SAMR-GA
(Clune et al.,, '08)
Learned GA
(Lange et al., '23b)
Finite-Difference�Gradient-Based
Evolution Strategies
Estimation-of-Distribution
Evolution Strategies
Genetic Algorithms
Many more ES/GA/BBO algorithms
Discovering New Algorithms via Meta-Learning
Manually Designed
Discovered/Learned
Oh et al. (2020)
Veeriah et al. (2021)
Andrychowicz et al. (2016), Metz et al. (2019)
Discovering New Algorithms via Meta-Evolution
Meta-Evolution 🦎²
Fully�Black-Box
[Expressive]
Hard Meta-�Optimization
Easy Meta-�Optimization
Fully�White-Box�[Interpretable]
Discovery
Plasticity (Confavreux et al., '20)
RL-PG Objectives (Lu et al., '22)
GD Optimizers (Metz et al., '19;'22)
∇
Cheap Talk Channels (Lu et al., '23)
Synthetic Envs (Ferreira et al., '22)
Why not use Meta-∇ instead of Meta-🦎? 🤔
[Pascanu et al., '13]
Meta-Gradients ∇²
∇ Propagation through ∇ Updates�[Online Cross-Validation; Sutton, 1992]
…..
Discovering Evolutionary Optimizers (🦎 & 🧬)
Evolution Strategies 🦎 (Lange et al., '23a - @ICLR)
Genetic Algorithms 🧬�(Lange et al., '23b - @GECCO)
Part 1 🦎
Part 2 🧬
White-Box Evolution Strategy: Gaussian Search
Inflexible: Fixed weights per rank + fixed learning rates.
Rank Evaluations�By Performance
Mean
Std
🦎 Learned Evolution Strategy (LES) Architecture
MLP
Meta-Evolved�Set Transformer
Std
Mean
LES Discovery via Meta-Black-Box Optimization 🦎(🦎)
Meta-Training Details for LES Discovery
1 Sphere
2 Ellipsoid separable
3 Rastrigin separable
4 Skew Rastrigin-Bueche sep.
5 Linear Slope
6 Attractive sector
7 Step ellipsoid
8 Rosenbrock rotated
9 Rosenbrock original
10 Ellipsoid
11 Discus
12 Bent cigar
13 Sharp ridge
14 Sum different powers
15 Rastrigin
16 Weierstrass
17 Schaffer F7, cond 10
18 Schaffer F7, cond 1000
19 Griewank-Rosenbrock F8F2
20 Schwefel
21 Gallagher 101 peaks
22 Gallagher 21 peaks
23 Katsuuras
24 Lunacek bi-Rastrigin
BBOB Functions
Random Offsets, Init, Eval Noise & Inner Loop
S + E + U
Popsize: 16
# Gens: 50
Separable
Moderate Condition
High Condition
Multi-Modal (Global Structure)
Multi-Modal (Weak Structure)
Subset of Functions & #Problem Dims (2-10D)
Dixon
Discovering LES: Meta-Training on Low-D BBOB
👉 Generalization across�population size/dims.��👉 Generalization to �unseen problems
Evaluating LES: Brax Control Tasks
BBOB Meta-Trained LES Generalizes to Control Tasks
👉 LES meta-trained on simple functions yields SotA neuroevolution ES ��👉 Extreme generalization across time horizon, population size & problems
Scaling Meta-Distributions Improves LES Discovery
What Has The Learned Evolution Strategy Discovered?
👉 Compressed weight rule into closed-form ��👉 Tuning: competitive white-box ES
👉 Recombination weights �contribute most��👉 Can switch between�hill- climbing & FD
LES & the evosax API: ask-evaluate-tell
Initialize
import jax
from evosax import LES
# Instantiate the search strategy
rng = jax.random.PRNGKey(0)
strategy = LES(popsize=20, num_dims=2)
state = strategy.initialize(rng)
# Run ask-eval-tell loop: By default minimization!
for t in range(num_generations):
rng, rng_ask, rng_eval = jax.random.split(rng, 3)
x, state = strategy.ask(rng_ask, state)�
Ask
Evaluate
fitness = ... # Your population evaluation fct
Tell
state = strategy.tell(x, fitness, state)
# Get best overall population member & its fitness
state.best_member, state.best_fitness
Self-Referential Meta-Evolution of Learned ES
Fitness properties
Meta-learned�self-attention
Sample & evaluate
Search dist. update
Inner Loop: LES
Sample tasks
Run search
Normalize &�meta-update
Sample LES
parameters
Outer Loop: MetaBBO
Sample tasks
Run search
Meta-improve:
Copy best & set mean/std
Sample LES
Meta-update
Outer Loop: LES
🧬 How does a Genetic Algorithm work?
Mutate
Iterate
Evaluate
Select
🧬 Learned Genetic Algorithms (LGA) 👉 🦎(🧬)
Sample meta-�train tasks
Run search
Normalize &�meta-update
Sample LGA
parameters
Outer: MetaBBO
Evaluate
children�population
Inner: Learned GA
Learned mutation rate adaptation�via self-attention
Learned selection�via cross-attention
LGA Generalizes to HPO-B & Neuroevolution Tasks
LGA Applies Adaptive Elitism & MR Adaptation
👉 Adaptive elite size & Mutation rate adaption
👉 Transfer of NN GA operators
Summary: Discovering LES & LGA via MetaBBO
Learned ES Architecture
Self-Ref. MetaBBO
Learned GA Architecture
Meta-Generalization
Reverse Engineering
On Survivorship Bias & The Hardware Lottery (Hooker, '21)
3e-04
🩹?
🦎
🦎
🦎
🦎
Recap, pointers & the future of learned-evo 🚀
👇
You are�here 🚀
Excitement [🔥/Slide]
Time ⌛
👉 github.com/RobertTLange/evosax
@GECCO 2023 🦎
∇
AGI
P.S.: Also checkout gymnax 🏋️- JAX-Based RL Envs
👉 github.com/RobertTLange/gymnax
Supplementary Slides
Discovery via Meta-Gradients? The Good, the Bad & Ugly
Meta-Gradients ∇
Optimization Tricks ∇∇
Truncation
Meta-Adam ε
∇∇ Clipping
∇ Propagation through ∇ Updates�[Online Cross-Val; Sutton, 1992]
…..
Flennerhag et al. (2022)
[Bootstrapped MGs]
Zenke & Ganguli (2018)�[Surrogate ∇]
Long Horizon
Cons ❌
Non-Diff
Non-Diff
Towards a Fully-Black-Box Task-Dependent Hybrid BBO
Multi-Modal
Regime
Single-Modal Regime
Genetic�Algorithm
Evolution Strategy
Task
Space
LES/LGA
When might you have to do Black-Box Optimization?
Sparse Rewards/Local Optima
Hyperparameter Optimization
Non-Differentiable Operations
Chaos: Exploding Gradients�[Metz et al., 2019] - 2D Meta-Loss Slice
Evol. Optimization, Set Operations & Self-Attention
Meta-Evolving Evolution 🦎🦎
Inductive Biases in ML
Convolutions
Translation Invariance
RNNs & Gating
Sequential Data
'Black-Box'
Neural�Network
Self-Attention
Permutation Invariance
What Has The Learned Evolution Strategy Discovered?
👉 Recombination weights �contribute most��👉 Can switch between�hill- climbing & FD
Reverse-Engineering Discovered LES Mechanisms
👉 Can compress weight rule into closed-form equation��👉 Correct tuning yields competitive white-box ES
P.S.: Can LLMs already do Black-Box Optimization?
(not quite yet?)�😋
Detailed Results: Neuroevolution - Brax - Radar
Brax Performance
Test Accuracy
Small Budget
Medium Budget
Large Budget
Confidential - DeepMind
Broad Meta-Distributions Improve LES Discovery?
+
Confidential - DeepMind
Attention-Based Parametrization of Selection & MRA
Select Q/K
Adaptation
Featurize
Self-Attention
Adapt A
Mult. Adapt
Selection
Featurize
Cross-Attention
Select Matrix