Trustworthy Machine Learning
Adam Gleave, 2023-04-27
Prior Work: Attacking Image Classifiers
Goodfellow et al (2015)
𝜖
=
“panda”
57.7% confidence
“gibbon”
99.3% confidence
+
sign(∇x J(θ, x, y))
"Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake"
- Ian Goodfellow (2017)
Adversarial Rotations
"revolver"
"mousetrap"
"vulture"
"orangutan"
Engstrom et al (2018)
Unrestricted adversarial examples: bird or bike?
Brown et al (2018)
Adversarial Policies:
Attacking Deep Reinforcement Learning
Adam Gleave, Michael Dennis, Cody Wild, �Neel Kant, Sergey Levine, Stuart Russell
Prior Work: Attacking RL Policies
Huang et al (2017); Kos & Song (2017)
sign(∇x J(θ, x, y))
𝜖
=
action: down
original input
action: no-op
adversarial input
+
Prior Work: Train Time
Environment
Victim
Observation & Reward
Action
Prior Work: Attack Time
Environment
Victim
Original Observation
Action
Perturbed Observation
Adversary
Our Threat Model: Victim Train Time
Opponent
Environment
Victim
Observation
Action
Observation
Action
Our Threat Model: Attack Time
The Adversary takes the role of an Opponent
Adversary
Environment
Victim
Observation
Action
Observation
Action
Our Threat Model: Attack Train Time
Embedded
Environment
Action
Observation & Reward
Adversary
Environments
Kick & Defend
Sumo Humans
Kicker: score a goal.
Goalie: block a goal.
Wrestler 1 & Wrestler 2:
knock opponent out.
Bansal et al (2018): Emergent Complexity via Multi-Agent Competition
You Shall Not Pass
Runner: cross finish line.
Blocker: stop Runner.
Environments
Kick & Defend
Kicker: score a goal.
Goalie: block a goal.
Bansal et al (2018): Emergent Complexity via Multi-Agent Competition
You Shall Not Pass
Runner: cross finish line.
Blocker: stop Runner.
You Shall Not Pass: Normal (47% win rate)
Victim runner is playing blocker opponent
You Shall Not Pass: Adversary (86% win rate)
Victim runner is playing blocker opponent
Kick and Defend: Normal (80% win rate)
Victim kicker is playing goalie opponent
Kick and Defend: Adversary (93% win rate)
Victim kicker is playing goalie opponent
Sumo Humans: Normal (71% win rate)
Victim wrestler is playing wrestler opponent
Sumo Humans: Adversary (63% win rate)
Victim wrestler is playing wrestler opponent
Why do the adversaries win?
Density Modeling
t-SNE
Density Modeling
Opponent
Environment
Victim
Observation
Action
Observation
Action
Density Modeling
t-SNE: Kick and Defend, victim 2
t-SNE: Sumo Humans, victim 2
Defence: Masked Victims
Adversary
Environment
Masked Victim
Observation
excl. adversary
Action
Observation
Action
Defence: Masked Victims
Environment
Self observation (position, velocity, contact)
Masked Victim
Opponent position
Static value
You Shall Not Pass: Adversarial Opponent
Adversarial Opponent (1%)
Masked Victim (99%)
Adversarial Opponent (86%)
Normal Victim (14%)
Victim runner is playing blocker opponent
You Shall Not Pass: Normal Opponent
Normal Opponent (78%)
Masked Victim (22%)
Normal Opponent (48%)
Normal Victim (52%)
Victim runner is playing blocker opponent
Why?
Dimensionality
See paper!
Masked Victims
Victim Activations
Defence: Adversarial Training
Embedded
Environment
Action
Observation & Reward
Hardened Victim
You Shall Not Pass: Retrained Victim
Adversarial Opponent (11% win rate)
Hardened Victim (89% win rate)
Adversarial Opponent (86% win rate)
Normal Victim (14% win rate)
Victim runner is playing blocker opponent
You Shall Not Pass: Retrained Adversary
New Adversary (88% win rate)
Hardened Victim (12% win rate)
New Adversary (76% win rate)
Normal Victim (24% win rate)
Victim runner is playing blocker opponent
Takeaways
Threat Model
Attack Analysis
Black-box Attack
Defense
Competing Explanations
Victim Policy
Is Stupid
D
U
N
C
E
Adversarial Attacks Are Strong
Will transformative AI systems be vulnerable to adversarial policies?
Adversarial Policies in Narrowly Superhuman AI
Tony Wang*, Adam Gleave*, Nora Belrose, �Tom Tseng, Michael Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell
Almost all deep-learning based AI models
have adversarial examples…
Image models
Audio models
Language models
Superhuman game-playing models?
Why test superhuman game playing models?
We wanted to test two hypotheses:�
Are superhuman game-playing models vulnerable to adversarial examples?
≫
~1000 Elo stronger1 than the AlphaGo that beat Lee Sedol (i.e. >99% win-rate).
1 with 1000 nodes / move of search.
How KataGo’s strength varies with search
1 node / move ≈ top Go player in Europe�103 nodes / move ≈ Superhuman�106-107 nodes / move ≈ Strongly superhuman
(tournament strength)
Is KataGo vulnerable to adversarial examples?
An easy puzzle that shows KataGo has serious flaws.
KataGo only wins 40% of the time as black…
KataGo is vulnerable to adversarial board states.
But such board states never show up in real games…
Can KataGo be tricked starting from an empty board?
YES!
KataGo is stronger than
LeelaZero, and probably AlphaZero too!
“Unified Elo rating for AIs” by SKHD13 (2020). https://www.reddit.com/r/baduk/comments/hma3nx/unified_elo_rating_for_ais/
KataGo is superhuman
Player | Elo (goratings.org) |
KataGo (1024 visits) | 4629 |
Shin Jinseo (world #1) | 3832 |
KataGo (64 visits) | 3563 |
Xu Jiayang (world #20) | 3559 |
Kin En (world #860) | 2501 |
KataGo (no search) | 2500 |
Our Attack
How our attack works
Adversary architecture
How our attack works
MCTS
DNN
Expand tree one leaf at a time.�
To add leaf, walk down tree with DNN guidance.
A-MCTS
How our attack works
Training procedure
Adversary architecture
Success Metrics
Key metrics:
Auxiliary criteria:
Adversary: The passer
Adversary (B) defeats no-search KataGo (W)
> 99% win-rate against KataGo w/o search
�Trained for 0.3% as many SGD steps as KataGo
Adversary: The passer
Pass-alive defense
Adversary: The cyclic-exploit
Adversary: The cyclic-exploit
Adversary (W) vs KataGo (B)
(1)
(2)
(3)
(4)
Non-Transitivity
Tony Wang (B) vs Cycler (W)
How reliable is the
Cyclic-Exploit?
1 node / move ≈ top Go player in Europe�103 nodes / move ≈ Superhuman�106-107 nodes / move ≈ Tournament settings
Our adversary AI (600 nodes / move) wins:
vs. KataGo (no search) ⇒ 1000 / 1000 games won.
vs. KataGo (2048 nodes / move) ⇒ 973 / 1000 games won.
vs. KataGo (1e7 nodes / move) ⇒ 36 / 50 games won (72%)
Kellin (resident Go-expert on our team) wins:
vs. JBXKata005 (9-dan KataGo bot on KGS) ⇒ 14 / 15 games won.
Didn’t check every game, but pretty sure all wins are via cyclic-exploit — the adversary AI loses to very weak human players (i.e. your speaker).
...but search does help
Success Metrics
Adversary | Victim Visits | Adversary Win Rate (%) | Adversary Efficiency (% of victim training timesteps) | Non-Transitivity | Bizarreness |
Passer | 8 | 88% | <0.3% | ✅ | ✅ |
Cycler | 10,000,000 | 72% | <5% | ✅ | ✅ |
The Cyclic-Exploit transfers!
We developed the exploit for KataGo.
But it works against all other strong neural-network based Go AIs!
Algorithmic Transfer
Victim Name | Victim Visits | Adversary Win Rate (%) |
Leela Zero | 40,000 | 6.1% |
ELF OpenGo | 80,000 | 3.5% |
Human Transfer
62
Human Transfer
Human Transfer
Adversarial examples tell us…
Deep learning systems (even if very capable) diverge from human cognition / expectation under strong optimization pressure.
People want to deploy AI in high-stakes scenarios�(where often there is large optimization pressure)
Many plans to building "aligned" AI systems involve optimizing a student AI against a teacher AI.
DANGER
Adversarial examples tell us…
Deep learning systems (even if very capable) diverge from human cognition / expectation under strong optimization pressure.
People want to deploy AI in high-stakes scenarios�(where often there is large optimization pressure)
Many plans to building "aligned" AI systems involve optimizing a student AI against a teacher AI.
DANGER
Technical problem
Governance problem
Implications for alignment
More info: AI Safety in a World of Vulnerable Machine Learning Systems
Future work: Interpretability
Understand why KataGo (and most strong Go AIs) misjudge cyclic groups.
Preliminary results on interpretability
Future work: Adversarial training
Develop improved algorithms to make KataGo robust.
Preliminary results on adversarial training
After re-attacking ★, we get back up to a 83.5% winrate against a victim with 1600 visits.
Future work: scaling laws
How do strength and exploitability scale as a function of training?
72
More information
Robotics exploit: adversarialpolicies.github.io
KataGo exploit: goattack.far.ai
About me/my work: gleave.me, Twitter @ARGleave
FAR: job openings far.ai/jobs contact hello@far.ai