1 of 43

Discovering Black-Box Optimizers via Evolutionary Meta-Learning

Robert Tjarko Lange

June 2023

2 of 43

‘The Creation of Adam’ - Michelangelo (ca. 1508 - 1512)

‘The Creation of AGI (by Adam)’ - ML Community

AGI

∇

🧠

‘Nothing in Biology Makes Sense Except in the Light of Evolution’ - T. Dobzhansky (1973)

🦎

[+2.5m - 3/30m] Many of you probably know this famous painting by Michelangelo called "the creation of adam".

In the painting we can see the magical moment in which god – an old white man – is about to instill life into the first human – adam - a young white man. The beauty of this artwork arguably emerges from the latent silence and everyone knowing that something magical is about to happen.

We are arguably at a similar inflection point right now in Machine Learning, where many people believe that we simply have to continue to scale gradient descent-based optimization of large-scale Transformer models in order to arrive at Artificial General Intelligence.

Many people use the brain and our own central nervous system as loose metaphors for the rapid success. And that's all great. But what if the current paradigm is only a local optimum. And the fingers never touch each other?

So far we only know of a single process that has lead to general intelligence in many different forms. And that is biological evolution. And if you think about it – evolution is the ultimate bitter lesson of nature. Converting compute into intelligent stepping stones.

So what I want to argue for in the next 20 minutes is that we should be more diverse in our thinking, in the ideas to explore and to aim to discover new ways of doing evolutionary optimization.

3 of 43

Envisioned excitement curve of this talk 🚀

👇

You are here

[hopefully]

Excitement [🔥/Slide]

Time ⌛

👉 tinyurl.com/learned-evo

This is not supposed to be an art history lesson though. Instead, I want to talk about modern black-box optimization and how accelerated computing and evosax might assist in discovering the next generation of such algorithms.

Let's quickly talk about the outline of this talk: After this provocative intro – we are currently here. We will now take a dip in potential excitement and first review what black-box optimization and evolutionary optimization actually is. While doing so, we encounter several hardware and software limitations for scaling EvoOpt.

Afterward, we will go to the moon. I will start by introducing evosax, its API and how it overcomes these challenges. And then we will discuss several research use cases and how evosax allows us to discover new algorithms and even entirely unknown evolutionary optimizers. Finally, we will return to a more philosophical perspective and discuss the future of evosax.

4 of 43

What is Black-Box Optimization (⚫ + 📦)?

'White-Box' Function Eval

👉 Many BBO/0-order optimization methods: Random search, BayesOpt, Successive Halving, HyperBand, EvoOpt, etc.

'Black-Box' Function Eval

❌

👉 EvoOpt: Sweet spot - can effectively optimize <1 mio. params

[+2m - 5/30m] So let us start by discussing what BBO or 0-order optimization is.

All of us know the classic white-box optimization as for example used in deep learning: Given a set of parameters x, we aim to optimize f(x). Here we assume to have access to well-behaved gradients that can be estimated in order to construct a local measure of improvement for x. For neural networks we often times leverage powerful autodiff tools to evaluate such gradients for often times very complicated computation graphs for which manual gradient calculation would be cumbersome.

In BBO, on the other hand, we don't assume access to gradients. They may not be computable or simply bad behaved. In this case we only have access to raw function evaluations and have to find our way around. It is a bit like trying to find the light switch in the middle of the night.

There are many ways to go about this problem: Including random search, Bayesian Optimization or algorithms which combine both with an early stopping criterion. All of them suffer to a certain degree from the curse of dimensionality and searching in high-dimensions. Evolutionary optimization provides a sweet spot - in that it allows you to optimize hundreds of thousands of parameters and scales well in the number of function evaluations. This allows us to even optimize small neural networks with non-differentiable operations.

5 of 43

🦎 How does an Evolution Strategy work?

Sample

Iterate

Evaluate

Update

[+1m - 8/30m] So how does an evolution strategy perform BBO?

Maintains a search distribution, e.g. multivariate Gaussian with summary statistics m and sigma
At each iteration (or generation) we sample a set of candidate solutions x_j (red dots)
Afterwards, we evaluate the function of interest for each one of the candidates. This could be the training of a NN with different hyperparams or evaluating a population of neural network weights on a task such as robotic control. The performance scores are then called fitness.
Given the candidates and their fitness evaluations, we update the search distribution statistics. Intuitively, this update aims to increase the likelihood of sampling well performing candidates. And different ES differ in how they perform such an update.
These 3 steps are then iterated in a loop until we find a suitable solution or run out of GCP credits.

6 of 43

Challenges for Modern Evolutionary Optimization?

jit

grad

vmap

pmap

[+2m - 10/30m] Until now all of this sounds like we might be able to this in plain python, right?

So what are actually the challenges for modern EvoOpt?

1) Curse of dimensionality.

2) Engineering - most EO have to evaluate large populations in parallel using multiple MC evals per member

Usually done using either plain Python multiprocessing or some type of manager such as ray or dask. These all require substantial engineering effort and don't directly take advantage of the recent revolution in hardware acceleration/ASICs. This leads prominent figures such as Yann LeCun – to claim that EvoOpt is only a necessary evil.

So what is JAX and how does it tackle LeCuns criticism? Basically at its core JAX a automatic differentiation library developed by Google.

But it is actually much more than that and also allows us to replace the traditional CPU-based parallelism with modern accelerators such as GPUs and TPUs.

It comes with a set of core function transformations such as grad, jit, vmap, pmap which can be elegantly composed.

7 of 43

What is the power of JAX for Evolutionary Optimization?

jax.vmap/jax.pmap

Parallel/Accelerated Fitness Rollouts

🦎

🤖

🌎

🍉

👌

🤖

🌎

⚡

👎

👍

🦎

🤖

🌎

🤖

🌎

🦎

🤖

🌎

⚡

🤖

🌎

⚡

🍉

Parallel/Accelerated BBO Runs

jax.vmap/jax.pmap

[+1.5m - 11.5/30m] Warning: This will not be a full 101 JAX tutorial. Nonetheless, we will have a look at how JAX can help EvoOpt in more detail.

I already mentioned that EvoOpt requires us to evaluate a population of candidate solutions on a task. For example this could be the weights of a robot policy and we might have to do this multiple time ("MC evaluation") in order to get a good estimate.

Naively, this could be done using embarrassing parallelism in plain python, where we chain two for loops. This would be really slow. Alternatively dask and ray would introduce additional communication latency.

JAX, on the other hand, allows us to replace these two loops by two function trafos. Namely, vmap and pmap.

vmap autovectorizes or batches the execution of the eval fn across seeds.
pmap device-parallelizes the multi-seed candidate evaluation across devices.

8 of 43

👉 evosax: Accelerated Evolutionary Optimization

OpenAI-ES

(Salimans et al., '17)

SNES

(Schaul et al., '11)

PGPE

(Sehnke et al., '10)

Guided ES

(Maheswaranathan et al. '18)

xNES

(Wierstra et al., '14)

CR-FM-NES

(Nomura & Ono, '22)

Simple GA

(Such et al., '17)

IPOP-CMA

(Auer &Hansen, '05)

LM-MA-ES

(Loshchilov et al., '17)

RmES

(Li & Zhang, '17)

MA-ES

(Bayer & Sendhoff, '17)

iAMaLGaM

(Bosman, '13)

GESMR-GA

(Kumar et al., '22)

ARS

(Mania et al., '18)

Persistent ES

(Salimans et al., '17)

ASEBO

(Choromanski�et al., '19)

CMA-ES

(Hansen & Ostermeier, '01)

SepCMA-ES

(Ros & Hansen, '08)

BIPOP-CMA

(Hansen, '09)

DES

(Lange et al., '23a)

Learned ES

(Lange et al., '23a)

MR-⅕ GA

(Rechenberg, '78)

Simple ES

(Rechenberg, '78)

PSO

(Kennedy & Eberhardt, '95)

SAMR-GA

(Clune et al.,, '08)

Learned GA

(Lange et al., '23b)

Finite-Difference�Gradient-Based

Evolution Strategies

Estimation-of-Distribution

Evolution Strategies

Genetic Algorithms

Many more ES/GA/BBO algorithms

9 of 43

Discovering New Algorithms via Meta-Learning

Manually Designed

Discovered/Learned

Oh et al. (2020)

Veeriah et al. (2021)

Andrychowicz et al. (2016), Metz et al. (2019)

10 of 43

Discovering New Algorithms via Meta-Evolution

Meta-Evolution 🦎²

Fully�Black-Box

[Expressive]

Hard Meta-�Optimization

Easy Meta-�Optimization

Fully�White-Box�[Interpretable]

Discovery

Plasticity (Confavreux et al., '20)

RL-PG Objectives (Lu et al., '22)

GD Optimizers (Metz et al., '19;'22)

∇

Cheap Talk Channels (Lu et al., '23)

Synthetic Envs (Ferreira et al., '22)

11 of 43

Why not use Meta-∇ instead of Meta-🦎? 🤔

[Pascanu et al., '13]

Meta-Gradients ∇²

∇ Propagation through ∇ Updates�[Online Cross-Validation; Sutton, 1992]

…..

12 of 43

Discovering Evolutionary Optimizers (🦎 & 🧬)

Evolution Strategies 🦎 (Lange et al., '23a - @ICLR)

Genetic Algorithms 🧬�(Lange et al., '23b - @GECCO)

Part 1 🦎

Part 2 🧬

13 of 43

White-Box Evolution Strategy: Gaussian Search

Inflexible: Fixed weights per rank + fixed learning rates.

Rank Evaluations�By Performance

Mean

Std

14 of 43

🦎 Learned Evolution Strategy (LES) Architecture

MLP

Meta-Evolved�Set Transformer

Std

Mean

15 of 43

LES Discovery via Meta-Black-Box Optimization 🦎(🦎)

16 of 43

Meta-Training Details for LES Discovery

1 Sphere

2 Ellipsoid separable

3 Rastrigin separable

4 Skew Rastrigin-Bueche sep.

5 Linear Slope

6 Attractive sector

7 Step ellipsoid

8 Rosenbrock rotated

9 Rosenbrock original

10 Ellipsoid

11 Discus

12 Bent cigar

13 Sharp ridge

14 Sum different powers

15 Rastrigin

16 Weierstrass

17 Schaffer F7, cond 10

18 Schaffer F7, cond 1000

19 Griewank-Rosenbrock F8F2

20 Schwefel

21 Gallagher 101 peaks

22 Gallagher 21 peaks

23 Katsuuras

24 Lunacek bi-Rastrigin

BBOB Functions

Random Offsets, Init, Eval Noise & Inner Loop

S + E + U

Popsize: 16

# Gens: 50

Separable

Moderate Condition

High Condition

Multi-Modal (Global Structure)

Multi-Modal (Weak Structure)

Subset of Functions & #Problem Dims (2-10D)

Dixon

17 of 43

Discovering LES: Meta-Training on Low-D BBOB

👉 Generalization across�population size/dims.��👉 Generalization to �unseen problems

18 of 43

Evaluating LES: Brax Control Tasks

19 of 43

BBOB Meta-Trained LES Generalizes to Control Tasks

👉 LES meta-trained on simple functions yields SotA neuroevolution ES ��👉 Extreme generalization across time horizon, population size & problems

20 of 43

Scaling Meta-Distributions Improves LES Discovery

21 of 43

What Has The Learned Evolution Strategy Discovered?

👉 Compressed weight rule into closed-form ��👉 Tuning: competitive white-box ES

👉 Recombination weights �contribute most��👉 Can switch between�hill- climbing & FD

22 of 43

LES & the evosax API: ask-evaluate-tell

Initialize

import jax

from evosax import LES

# Instantiate the search strategy

rng = jax.random.PRNGKey(0)

strategy = LES(popsize=20, num_dims=2)

state = strategy.initialize(rng)

# Run ask-eval-tell loop: By default minimization!

for t in range(num_generations):

rng, rng_ask, rng_eval = jax.random.split(rng, 3)

x, state = strategy.ask(rng_ask, state)�

Ask

Evaluate

fitness = ... # Your population evaluation fct

Tell

state = strategy.tell(x, fitness, state)

# Get best overall population member & its fitness

state.best_member, state.best_fitness

23 of 43

Self-Referential Meta-Evolution of Learned ES

Fitness properties

Meta-learned�self-attention

Sample & evaluate

Search dist. update

Inner Loop: LES

Sample tasks

Run search

Normalize &�meta-update

Sample LES

parameters

Outer Loop: MetaBBO

Sample tasks

Run search

Meta-improve:

Copy best & set mean/std

Sample LES

Meta-update

Outer Loop: LES

24 of 43

🧬 How does a Genetic Algorithm work?

Mutate

Iterate

Evaluate

Select

[+1m - 8/30m] So how does an evolution strategy perform BBO?

Maintains a search distribution, e.g. multivariate Gaussian with summary statistics m and sigma
At each iteration (or generation) we sample a set of candidate solutions x_j (red dots)
Afterwards, we evaluate the function of interest for each one of the candidates. This could be the training of a NN with different hyperparams or evaluating a population of neural network weights on a task such as robotic control. The performance scores are then called fitness.
Given the candidates and their fitness evaluations, we update the search distribution statistics. Intuitively, this update aims to increase the likelihood of sampling well performing candidates. And different ES differ in how they perform such an update.
These 3 steps are then iterated in a loop until we find a suitable solution or run out of GCP credits.

25 of 43

🧬 Learned Genetic Algorithms (LGA) 👉 🦎(🧬)

Sample meta-�train tasks

Run search

Normalize &�meta-update

Sample LGA

parameters

Outer: MetaBBO

Evaluate

children�population

Inner: Learned GA

Learned mutation rate adaptation�via self-attention

Learned selection�via cross-attention

26 of 43

LGA Generalizes to HPO-B & Neuroevolution Tasks

27 of 43

LGA Applies Adaptive Elitism & MR Adaptation

👉 Adaptive elite size & Mutation rate adaption

👉 Transfer of NN GA operators

28 of 43

Summary: Discovering LES & LGA via MetaBBO

Learned ES Architecture

Self-Ref. MetaBBO

Learned GA Architecture

Meta-Generalization

Reverse Engineering

29 of 43

On Survivorship Bias & The Hardware Lottery (Hooker, '21)

3e-04

🩹?

🦎

30 of 43

Recap, pointers & the future of learned-evo 🚀

👇

You are�here 🚀

Excitement [🔥/Slide]

Time ⌛

👉 github.com/RobertTLange/evosax

@GECCO 2023 🦎

31 of 43

∇

AGI

32 of 43

P.S.: Also checkout gymnax 🏋️- JAX-Based RL Envs

👉 github.com/RobertTLange/gymnax

33 of 43

Supplementary Slides

34 of 43

Discovery via Meta-Gradients? The Good, the Bad & Ugly

Meta-Gradients ∇

Optimization Tricks ∇∇

Truncation

Meta-Adam ε

∇∇ Clipping

∇ Propagation through ∇ Updates�[Online Cross-Val; Sutton, 1992]

…..

Flennerhag et al. (2022)

[Bootstrapped MGs]

Zenke & Ganguli (2018)�[Surrogate ∇]

Long Horizon

Cons ❌

Non-Diff

35 of 43

Towards a Fully-Black-Box Task-Dependent Hybrid BBO

Multi-Modal

Regime

Single-Modal Regime

Genetic�Algorithm

Evolution Strategy

Task

Space

LES/LGA

36 of 43

When might you have to do Black-Box Optimization?

Sparse Rewards/Local Optima

Hyperparameter Optimization

Non-Differentiable Operations

Chaos: Exploding Gradients�[Metz et al., 2019] - 2D Meta-Loss Slice

37 of 43

Evol. Optimization, Set Operations & Self-Attention

Meta-Evolving Evolution 🦎🦎

Inductive Biases in ML

Convolutions

Translation Invariance

RNNs & Gating

Sequential Data

'Black-Box'

Neural�Network

Self-Attention

Permutation Invariance

38 of 43

What Has The Learned Evolution Strategy Discovered?

👉 Recombination weights �contribute most��👉 Can switch between�hill- climbing & FD

39 of 43

Reverse-Engineering Discovered LES Mechanisms

👉 Can compress weight rule into closed-form equation��👉 Correct tuning yields competitive white-box ES

40 of 43

P.S.: Can LLMs already do Black-Box Optimization?

(not quite yet?)�😋

41 of 43

Detailed Results: Neuroevolution - Brax - Radar

Brax Performance

Test Accuracy

Small Budget

Medium Budget

Large Budget

Confidential - DeepMind

42 of 43

Broad Meta-Distributions Improve LES Discovery?

+

Confidential - DeepMind

43 of 43

Attention-Based Parametrization of Selection & MRA

Select Q/K

Adaptation

Featurize

Self-Attention

Adapt A

Mult. Adapt

Selection

Featurize

Cross-Attention

Select Matrix