Adversarial Policies:
Attacking Deep Reinforcement Learning
Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell
Adversarial Examples as 𝓁p-norm perturbations
Goodfellow et al (2015)
𝜖
=
“panda”
57.7% confidence
“gibbon”
99.3% confidence
+
sign(∇x J(θ, x, y))
"Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake"
- Ian Goodfellow (2017)
Adversarial Rotations
"revolver"
"mousetrap"
"vulture"
"orangutan"
Engstrom et al (2018)
Unrestricted adversarial examples: bird or bike?
Brown et al (2018)
Why Adversarial Examples Matter for Safety
Normal story: adversaries outside your system.
Other reasons:
𝓁p-norm perturbations in RL
Huang et al (2017); Kos & Song (2017)
sign(∇x J(θ, x, y))
𝜖
=
action: down
original input
action: no-op
adversarial input
+
𝓁p-norm perturbations in RL: train time
Environment
Victim
Observation & Reward
Action
𝓁p-norm perturbations in RL: attack time
Environment
Victim
Original Observation
Action
Perturbed Observation
Adversary
Our threat model: victim train time
Opponent
Environment
Victim
Observation
Action
Observation
Action
Our threat model: attack time
The Adversary takes the role of an Opponent
Adversary
Environment
Victim
Observation
Action
Observation
Action
Our threat model: attack train time
Embedded
Environment
Action
Observation & Reward
Adversary
Environments
Kick & Defend
Sumo Humans
You Shall Not Pass
Kicker: score a goal.
Goalie: block a goal.
Runner: cross finish line.
Blocker: stop Runner.
Wrestler 1 & Wrestler 2:
knock opponent out.
Bansal et al (2018): Emergent Complexity via Multi-Agent Competition
Results against Median Victim
You Shall Not Pass: Normal (47% win rate)
Victim runner is playing blocker opponent
You Shall Not Pass: Adversary (86% win rate)
Victim runner is playing blocker opponent
Kick and Defend: Normal (80% win rate)
Victim kicker is playing goalie opponent
Kick and Defend: Adversary (93% win rate)
Victim kicker is playing goalie opponent
Sumo Humans: Normal (71% win rate)
Victim wrestler is playing wrestler opponent
Sumo Humans: Adversary (63% win rate)
Victim wrestler is playing wrestler opponent
Why do the adversaries win?
Density Modeling
t-SNE
Density Modeling
Opponent
Environment
Victim
Observation
Action
Observation
Action
Density Modeling
t-SNE: Kick and Defend, victim 2
t-SNE: Sumo Humans, victim 2
Defence: Masked Victims
Adversary
Environment
Masked Victim
Observation
excl. adversary
Action
Observation
Action
Defence: Masked Victims
Environment
Self observation (position, velocity, contact)
Masked Victim
Opponent position
Static value
You Shall Not Pass: Adversarial Opponent
Adversarial Opponent (1%)
Masked Victim (99%)
Adversarial Opponent (86%)
Normal Victim (14%)
Victim runner is playing blocker opponent
You Shall Not Pass: Normal Opponent
Normal Opponent (78%)
Masked Victim (22%)
Normal Opponent (48%)
Normal Victim (52%)
Victim runner is playing blocker opponent
Defence: Adversarial Training
Embedded
Environment
Action
Observation & Reward
Hardened Victim
You Shall Not Pass: Retrained Victim
Adversarial Opponent (11% win rate)
Hardened Victim (89% win rate)
Adversarial Opponent (86% win rate)
Normal Victim (14% win rate)
Victim runner is playing blocker opponent
You Shall Not Pass: Retrained Adversary
New Adversary (88% win rate)
Hardened Victim (12% win rate)
New Adversary (76% win rate)
Normal Victim (24% win rate)
Victim runner is playing blocker opponent
Takeaways
Threat Model
Black-box Attack
Defense
Attack Analysis
Can we do better than this?
Embedded
Environment
Action
Observation & Reward
Adversary
Adversary is trained with standard PPO
Wu et al (USENIX 2021): New Attack
Introduces auxiliary loss:
Where and are the observation and actions taken by the victim if the adversary has policy parameters θ, and and are those under current policy parameters.
Optimizes linear combination of Lad and standard PPO loss.
Wu et al (USENIX 2021): Roboschool Pong
Left victim is playing right adversary
Wu et al (USENIX 2021): Robo Pong Training Curves
Why is self-play vulnerable?
Balduzzi et al (2019): Open-ended learning in symmetric zero-sum games
Masked victim
Adversarial opponent
Normal victim
Self-play opponent
beats
beats
beats
beats
Czarnecki et al (2020): Real World Games Look Like Spinning Tops
Future Work: Attack State-of-the-Art Systems
Future Work: Attack in Realistic Tasks
Volodin et al (2020): Defending against Adversarial Policies
Limitations
Future Work
Thanks!
Paper: bit.ly/AdversarialPolicies
Website: adversarialpolicies.github.io
Adam Gleave
Michael Dennis
Cody Wild
Sergey Levine
Stuart Russell
Neel Kant
Heatmaps
Kick & Defend
You Shall Not Pass
Sumo Humans