2 of 73

Prior Work: Attacking Image Classifiers

Goodfellow et al (2015)

𝜖

“panda”

57.7% confidence

“gibbon”

99.3% confidence

sign(∇_x J(θ, x, y))

3 of 73

"Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake"

- Ian Goodfellow (2017)

4 of 73

Adversarial Rotations

"revolver"

"mousetrap"

"vulture"

"orangutan"

Engstrom et al (2018)

5 of 73

Unrestricted adversarial examples: bird or bike?

Brown et al (2018)

6 of 73

Adversarial Policies:

Attacking Deep Reinforcement Learning

Adam Gleave, Michael Dennis, Cody Wild, �Neel Kant, Sergey Levine, Stuart Russell

7 of 73

Prior Work: Attacking RL Policies

Huang et al (2017); Kos & Song (2017)

sign(∇_x J(θ, x, y))

𝜖

action: down

original input

action: no-op

adversarial input

8 of 73

Prior Work: Train Time

Environment

Victim

Observation & Reward

Action

9 of 73

Prior Work: Attack Time

Environment

Victim

Original Observation

Action

Perturbed Observation

Adversary

10 of 73

Our Threat Model: Victim Train Time

Opponent

Environment

Victim

Observation

Action

Observation

Action

11 of 73

Our Threat Model: Attack Time

The Adversary takes the role of an Opponent

Adversary

Environment

Victim

Observation

Action

Observation

Action

12 of 73

Our Threat Model: Attack Train Time

Embedded

Environment

Action

Observation & Reward

Adversary

13 of 73

Environments

Kick & Defend

Sumo Humans

Kicker: score a goal.

Goalie: block a goal.

Wrestler 1 & Wrestler 2:

knock opponent out.

Bansal et al (2018): Emergent Complexity via Multi-Agent Competition

You Shall Not Pass

Runner: cross finish line.

Blocker: stop Runner.

14 of 73

Environments

Kick & Defend

Kicker: score a goal.

Goalie: block a goal.

Bansal et al (2018): Emergent Complexity via Multi-Agent Competition

You Shall Not Pass

Runner: cross finish line.

Blocker: stop Runner.

15 of 73

You Shall Not Pass: Normal (47% win rate)

Victim runner is playing blocker opponent

16 of 73

You Shall Not Pass: Adversary (86% win rate)

Victim runner is playing blocker opponent

17 of 73

Kick and Defend: Normal (80% win rate)

Victim kicker is playing goalie opponent

18 of 73

Kick and Defend: Adversary (93% win rate)

Victim kicker is playing goalie opponent

19 of 73

Sumo Humans: Normal (71% win rate)

Victim wrestler is playing wrestler opponent

20 of 73

Sumo Humans: Adversary (63% win rate)

Victim wrestler is playing wrestler opponent

21 of 73

Why do the adversaries win?

Density Modeling

t-SNE

22 of 73

Density Modeling

Opponent

Environment

Victim

Observation

Action

Observation

Action

23 of 73

Density Modeling

24 of 73

t-SNE: Kick and Defend, victim 2

25 of 73

t-SNE: Sumo Humans, victim 2

26 of 73

Defence: Masked Victims

Adversary

Environment

Masked Victim

Observation

excl. adversary

Action

Observation

Action

27 of 73

Defence: Masked Victims

Environment

Self observation (position, velocity, contact)

Masked Victim

Opponent position

Static value

28 of 73

You Shall Not Pass: Adversarial Opponent

Adversarial Opponent (1%)

Masked Victim (99%)

Adversarial Opponent (86%)

Normal Victim (14%)

Victim runner is playing blocker opponent

29 of 73

You Shall Not Pass: Normal Opponent

Normal Opponent (78%)

Masked Victim (22%)

Normal Opponent (48%)

Normal Victim (52%)

Victim runner is playing blocker opponent

30 of 73

Why?

Dimensionality

See paper!

Masked Victims

Victim Activations

31 of 73

Defence: Adversarial Training

Embedded

Environment

Action

Observation & Reward

Hardened Victim

32 of 73

You Shall Not Pass: Retrained Victim

Adversarial Opponent (11% win rate)

Hardened Victim (89% win rate)

Adversarial Opponent (86% win rate)

Normal Victim (14% win rate)

Victim runner is playing blocker opponent

33 of 73

You Shall Not Pass: Retrained Adversary

New Adversary (88% win rate)

Hardened Victim (12% win rate)

New Adversary (76% win rate)

Normal Victim (24% win rate)

Victim runner is playing blocker opponent

34 of 73

Takeaways

Threat Model

Attack Analysis

Black-box Attack

Defense

35 of 73

Competing Explanations

Victim Policy

Is Stupid

Adversarial Attacks Are Strong

36 of 73

Will transformative AI systems be vulnerable to adversarial policies?

37 of 73

Adversarial Policies in Narrowly Superhuman AI

Tony Wang*, Adam Gleave*, Nora Belrose, �Tom Tseng, Michael Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

38 of 73

Almost all deep-learning based AI models

have adversarial examples…

Image models

Audio models

Language models

Superhuman game-playing models?

39 of 73

Why test superhuman game playing models?

We wanted to test two hypotheses:�

Maybe adversarial examples go away when models are strongly superhuman?�
Maybe adversarial examples are a “system 1 bug”, and using system 2 reasoning (i.e. thinking ahead, reflecting on judgements, search) fixes things.

40 of 73

Are superhuman game-playing models vulnerable to adversarial examples?

≫

~1000 Elo stronger¹ than the AlphaGo that beat Lee Sedol (i.e. >99% win-rate).

¹with 1000 nodes / move of search.

Strongest open source Go AI
Utilizes tree-search.

41 of 73

How KataGo’s strength varies with search

[1] https://arxiv.org/pdf/2211.00241.pdf#figure.caption.16

10x the amount of search �→ constant increase in Elo.�
400 pts of Elo = 90% win rate.

1 node / move ≈ top Go player in Europe�10³ nodes / move ≈ Superhuman�10⁶-10⁷ nodes / move ≈ Strongly superhuman

(tournament strength)

42 of 73

Is KataGo vulnerable to adversarial examples?

An easy puzzle that shows KataGo has serious flaws.

KataGo only wins 40% of the time as black…

43 of 73

KataGo is vulnerable to adversarial board states.

But such board states never show up in real games…

Can KataGo be tricked starting from an empty board?

YES!

44 of 73

KataGo is stronger than

LeelaZero, and probably AlphaZero too!

“Unified Elo rating for AIs” by SKHD13 (2020). https://www.reddit.com/r/baduk/comments/hma3nx/unified_elo_rating_for_ais/

45 of 73

KataGo is superhuman

Player	Elo (goratings.org)
KataGo (1024 visits)	4629
Shin Jinseo (world #1)	3832
KataGo (64 visits)	3563
Xu Jiayang (world #20)	3559
Kin En (world #860)	2501
KataGo (no search)	2500

46 of 73

Our Attack

Randomly initialize a policy and value network.
Play the current policy/value network against the fixed victim.

Victim player’s MCTS performed as usual.
Adversary player’s MCTS alternates between victim and adversary network depending on whose turn it is.

Update adversary using usual KataGo (AlphaZero-style) training procedure.
Repeat steps 2-3.

47 of 73

How our attack works

Adversary architecture

Selects moves via neural-net guided tree-search.
Adversary uses A-MCTS instead of MCTS.

48 of 73

How our attack works

MCTS

DNN

Expand tree one leaf at a time.�

To add leaf, walk down tree with DNN guidance.

A-MCTS

At adversary nodes, use same procedure as MCTS.
At victim nodes, simulate

49 of 73

How our attack works

Training procedure

Randomly initialize a policy and value network.
Play the adversary with A-MCTS against the victim.
Train the adversary neural-net to imitate the behavior of the adversary from step 2.
Repeat steps 2-3.
Optionally train against a curriculum of victims.

Adversary architecture

Selects moves via neural-net guided tree-search.
Adversary uses A-MCTS instead of MCTS.

50 of 73

Success Metrics

Key metrics:

Strength relative to victim: win rate.
Efficiency: samples to train attacker vs victim.
Capability of victim: how many visits?

Auxiliary criteria:

Non-transitivity: is adversary strong against victim but weak more generally?
Bizarreness: does adversary pursue unintuitive strategies?

51 of 73

Adversary: The passer

Adversary (B) defeats no-search KataGo (W)

> 99% win-rate against KataGo w/o search

�Trained for 0.3% as many SGD steps as KataGo

52 of 73

Adversary: The passer

53 of 73

Pass-alive defense

54 of 73

Adversary: The cyclic-exploit

55 of 73

Adversary: The cyclic-exploit

Adversary plays low (2nd/3rd line), generally subpar moves.
Builds a “dead” four-square group in top-left quadrant.
Uses “dead” stones to encircle victim in top-left. Victim could still win had it played at “A”.
17 moves later, adversary captures black group, decisive victory.

Adversary (W) vs KataGo (B)

(1)

(2)

(3)

(4)

56 of 73

Non-Transitivity

Tony Wang (B) vs Cycler (W)

57 of 73

How reliable is the

Cyclic-Exploit?

1 node / move ≈ top Go player in Europe�10³ nodes / move ≈ Superhuman�10⁶-10⁷ nodes / move ≈ Tournament settings

Our adversary AI (600 nodes / move) wins:

vs. KataGo (no search) ⇒ 1000 / 1000 games won.

vs. KataGo (2048 nodes / move) ⇒ 973 / 1000 games won.

vs. KataGo (1e7 nodes / move) ⇒ 36 / 50 games won (72%)

Kellin (resident Go-expert on our team) wins:

vs. JBXKata005 (9-dan KataGo bot on KGS) ⇒ 14 / 15 games won.

Didn’t check every game, but pretty sure all wins are via cyclic-exploit — the adversary AI loses to very weak human players (i.e. your speaker).

58 of 73

...but search does help

59 of 73

Success Metrics

Adversary	Victim Visits	Adversary Win Rate (%)	Adversary Efficiency (% of victim training timesteps)	Non-Transitivity	Bizarreness
Passer	8	88%	<0.3%	✅	✅
Cycler	10,000,000	72%	<5%	✅	✅

60 of 73

The Cyclic-Exploit transfers!

We developed the exploit for KataGo.

But it works against all other strong neural-network based Go AIs!

61 of 73

Algorithmic Transfer

Victim Name	Victim Visits	Adversary Win Rate (%)
Leela Zero	40,000	6.1%
ELF OpenGo	80,000	3.5%

62 of 73

Human Transfer

63 of 73

Human Transfer

64 of 73

Human Transfer

65 of 73

Adversarial examples tell us…

Deep learning systems (even if very capable) diverge from human cognition / expectation under strong optimization pressure.

People want to deploy AI in high-stakes scenarios�(where often there is large optimization pressure)

Many plans to building "aligned" AI systems involve optimizing a student AI against a teacher AI.

DANGER

66 of 73

Adversarial examples tell us…

Deep learning systems (even if very capable) diverge from human cognition / expectation under strong optimization pressure.

People want to deploy AI in high-stakes scenarios�(where often there is large optimization pressure)

Many plans to building "aligned" AI systems involve optimizing a student AI against a teacher AI.

DANGER

Technical problem

Governance problem

67 of 73

Implications for alignment

We should expect systems to be adversarially vulnerable.

Maybe not superintelligence, but systems right before we develop transformative AI.
Alignment assistants, automated interpretability, scalable overseers probably all exploitable.

Option 1: solve adversarial robustness

Impactful, but probably intractable.
ML community has been trying to solve special case of this (l_p norm perturbations) for a decade, limited progress.

Option 2: make alignment fault tolerant

Direction I'm most excited about right now.
We control the game. Can we make it harder for ML system to exploit our helper agents? E.g. online vs offline trained reward model.
Or control the optimization pressure so it never wants to do so? IDA > RL.

More info: AI Safety in a World of Vulnerable Machine Learning Systems

68 of 73

Future work: Interpretability

Understand why KataGo (and most strong Go AIs) misjudge cyclic groups.

69 of 73

Preliminary results on interpretability

Still working on why things go wrong. Can we tell where?

Compare activations in cyclic/non-cyclic positions

Problem is right here?

70 of 73

Future work: Adversarial training

Develop improved algorithms to make KataGo robust.

71 of 73

Preliminary results on adversarial training

After re-attacking ★, we get back up to a 83.5% winrate against a victim with 1600 visits.

72 of 73

Future work: scaling laws

How do strength and exploitability scale as a function of training?

73 of 73

More information

Robotics exploit: adversarialpolicies.github.io

KataGo exploit: goattack.far.ai

About me/my work: gleave.me, Twitter @ARGleave

FAR: job openings far.ai/jobs contact hello@far.ai