1 of 46

Adversarial Policies:

Attacking Deep Reinforcement Learning

Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

2 of 46

Adversarial Examples as 𝓁_p-norm perturbations

Goodfellow et al (2015)

𝜖

=

“panda”

57.7% confidence

“gibbon”

99.3% confidence

+

sign(∇_x J(θ, x, y))

3 of 46

"Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake"

- Ian Goodfellow (2017)

4 of 46

Adversarial Rotations

"revolver"

"mousetrap"

"vulture"

"orangutan"

Engstrom et al (2018)

5 of 46

Unrestricted adversarial examples: bird or bike?

Brown et al (2018)

6 of 46

Why Adversarial Examples Matter for Safety

Normal story: adversaries outside your system.

Other reasons:

Adversaries inside your system.

An RL algorithm optimizing a reward model is an adversary for the reward model!

Testing.
Interpretability.
Demandingness.

As safety researchers, why should we care about adversarial examples? They're typically framed in terms of a security threat. I think this is real: if we build AGI, there will be people trying to break into the system. However, I believe there are much stronger reasons to be interested in adversarial examples.

First, systems are often built out of different subcomponents that are in an adversarial relationship with each other. For example, if you optimize a learnt reward model using an RL algorithm, the RL algorithm is an adversarial relationship with the reward model: it would like to find states that have high reward in the model, whether or not the ground-truth reward is high.

Second, we currently have few means of testing ML systems, other than by evaluating their performance on a particular dataset. This is like testing a program just by running it on normal inputs -- but we know that to catch bugs it's important to test for edge cases. Adversarial examples can be thought of as a programmatic way of generating test cases, a bit like fuzz testing for traditional software engineering.

Third, adversarial examples provide a window into what your system has and hasn't learned. Just looking at the test performance of AlexNet we might be tempted to believe it's captured something about the essence of particular objects. The existence of adversarial examples show that whatever it has learnt is somehow different than how humans categorize objects.

Finally, a notion of unrestricted adversarial examples is very demanding: the system cannot systematically make mistakes in any way that an attacker can discover. If we could achieve this we could have a high degree of confidence in the system. I think it may not be possible, but it is at least a good way of getting the broader AI community interested in building reliable systems!

So far we've just been considering adversarial examples in image classifiers. From a safety perspective, it's more interesting to consider RL agents, which are able to act in the world.

7 of 46

𝓁_p-norm perturbations in RL

Huang et al (2017); Kos & Song (2017)

sign(∇_x J(θ, x, y))

𝜖

=

action: down

original input

action: no-op

adversarial input

+

8 of 46

𝓁_p-norm perturbations in RL: train time

Environment

Victim

Observation & Reward

Action

9 of 46

𝓁_p-norm perturbations in RL: attack time

Environment

Victim

Original Observation

Action

Perturbed Observation

Adversary

At attack time, an adversary is introduced. This adversary sees the original observation from the environment, and can add a small \ell_p-norm modification to it. The perturbed observation is then input to the victim. Typically the victim is white-box, although black-box attacks have also had some success. [pause]

We take inspiration from this prior work, but believe there are two key deficiencies in the threat model. Most seriously, it assumes the attacker has the ability to directly modify the observations of the victim. This is usually not possible unless the attacker has low-level control over your system -- in which case there are much simpler ways to exploit the victim. Additionally, the resulting perturbed observations are usually not physically realistic, so it is not enough for the victim to generalize, it must also transfer to impossible scenarios: a much stronger requirement. [pause]

We propose a novel, more realistic threat model for adversarial examples in RL. [click]

10 of 46

Our threat model: victim train time

Opponent

Environment

Victim

Observation

Action

Observation

Action

11 of 46

Our threat model: attack time

The Adversary takes the role of an Opponent

Adversary

Environment

Victim

Observation

Action

Observation

Action

12 of 46

Our threat model: attack train time

Embedded

Environment

Action

Observation & Reward

Adversary

13 of 46

Environments

Kick & Defend

Sumo Humans

You Shall Not Pass

Kicker: score a goal.

Goalie: block a goal.

Runner: cross finish line.

Blocker: stop Runner.

Wrestler 1 & Wrestler 2:

knock opponent out.

Bansal et al (2018): Emergent Complexity via Multi-Agent Competition

14 of 46

Results against Median Victim

We attack state-of-the-art policies from Bansal et al’s “agent zoo” that were trained via self-play for over 500 million timesteps. We train our adversary for only 20 million timesteps, less than 3% of the timesteps the victim Zoo policies were trained with.

Despite this, our adversary -- the blue line -- outperforms the normal Zoo opponent -- the yellow horizontal line -- against the Zoo victim policy in Kick and Defend and You Shall Not Pass, and is competitive on Sumo Humans. The shaded blue region shows the minimum and maximum across 5 random seeds. This is a typical level of variance for deep RL, and in You Shall Not Pass even the worst seed outperforms the baseline. These results are against the median Zoo victim policy; we report on the full set of victim policies in the paper.

Of course, it’s not surprising that we’re able to beat a fixed opponent that we can train against. What is important is the way in which the adversarial policies learn to beat the opponent: not by performing the intended task, such as blocking a goal, but by exploiting weaknesses in the victim’s policy.

15 of 46

You Shall Not Pass: Normal (47% win rate)

Victim runner is playing blocker opponent

16 of 46

You Shall Not Pass: Adversary (86% win rate)

Victim runner is playing blocker opponent

17 of 46

Kick and Defend: Normal (80% win rate)

Victim kicker is playing goalie opponent

18 of 46

Kick and Defend: Adversary (93% win rate)

Victim kicker is playing goalie opponent

19 of 46

Sumo Humans: Normal (71% win rate)

Victim wrestler is playing wrestler opponent

20 of 46

Sumo Humans: Adversary (63% win rate)

Victim wrestler is playing wrestler opponent

21 of 46

Why do the adversaries win?

Density Modeling

t-SNE

22 of 46

Density Modeling

Opponent

Environment

Victim

Observation

Action

Observation

Action

23 of 46

Density Modeling

24 of 46

t-SNE: Kick and Defend, victim 2

25 of 46

t-SNE: Sumo Humans, victim 2

26 of 46

Defence: Masked Victims

Adversary

Environment

Masked Victim

Observation

excl. adversary

Action

Observation

Action

27 of 46

Defence: Masked Victims

Environment

Self observation (position, velocity, contact)

Masked Victim

Opponent position

Static value

28 of 46

You Shall Not Pass: Adversarial Opponent

Adversarial Opponent (1%)

Masked Victim (99%)

Adversarial Opponent (86%)

Normal Victim (14%)

Victim runner is playing blocker opponent

29 of 46

You Shall Not Pass: Normal Opponent

Normal Opponent (78%)

Masked Victim (22%)

Normal Opponent (48%)

Normal Victim (52%)

Victim runner is playing blocker opponent

30 of 46

Defence: Adversarial Training

Embedded

Environment

Action

Observation & Reward

Hardened Victim

31 of 46

You Shall Not Pass: Retrained Victim

Adversarial Opponent (11% win rate)

Hardened Victim (89% win rate)

Adversarial Opponent (86% win rate)

Normal Victim (14% win rate)

Victim runner is playing blocker opponent

32 of 46

You Shall Not Pass: Retrained Adversary

New Adversary (88% win rate)

Hardened Victim (12% win rate)

New Adversary (76% win rate)

Normal Victim (24% win rate)

Victim runner is playing blocker opponent

33 of 46

Takeaways

Threat Model

Black-box Attack

Defense

Attack Analysis

To recap, there are four key takeaways from the material presented so far. [click]

First, real-world attacks against RL systems will come from malicious agents acting in a shared environment, so we should study attacks under a multi-agent threat model. [click]

Second, attacks under this more challenging threat model are possible. Even policies that are highly capable against normal opponents fail in the presence of our adversaries. [click]

Third, adversarial policies win by creating natural observations that are adversarial to the victim. [click]

Finally, adversarial training shows promise as a defense method. [click]

[pause]

I’ve got about 5 to 10 minutes more bonus material that’s looking at some recent related work and possible future extensions to adversarial policies. But I think here’s a good time to pause for questions, before we dive into that. [pause]

34 of 46

Can we do better than this?

Embedded

Environment

Action

Observation & Reward

Adversary

Adversary is trained with standard PPO

35 of 46

Wu et al (USENIX 2021): New Attack

Introduces auxiliary loss:

Where and are the observation and actions taken by the victim if the adversary has policy parameters θ, and and are those under current policy parameters.

Optimizes linear combination of L_ad and standard PPO loss.

And recent work by Xian Wu and others proposes exactly such an improvement. In particular, they augment the standard PPO loss with an auxiliary loss term. This auxiliary loss incentivizes the adversary to take actions that cause the victim to change their actions, while minimizing the change to the observations. So the idea is that the adversary wants to make a small change to its policy which is going to cause a big change to how the victim behaves. [Noting that in their setting observations directly include actions, so large changes in policy output will also cause large change in observation.]

Specifically, they learn a differentiable model of the environment, and then for each policy gradient step, estimate what observation o-hat the victim would see were it to have played against an adversary with parameters theta. They also estimate the victim’s action in this case, a-hat. Then the goal is to maximize the difference in this estimated action, and minimize the difference in the estimated observation.

36 of 46

Wu et al (USENIX 2021): Roboschool Pong

Left victim is playing right adversary

37 of 46

Wu et al (USENIX 2021): Robo Pong Training Curves

38 of 46

Why is self-play vulnerable?

Our results show self-play in these settings did not converge to Nash equilibria.
Self-play convergence proofs assume transitivity (Balduzzi et al, 2019).
Our results show this environment is non-transitive:

Balduzzi et al (2019): Open-ended learning in symmetric zero-sum games

Masked victim

Adversarial opponent

Normal victim

Self-play opponent

beats

Another interesting direction is to better understand why self-play produces policies that are vulnerable to this kind of attack. [click]

Notably, if the learned victim and opponent policy are in Nash equilibria with each other, then an adversarial policy cannot possibly do any better than the normal opponent policy. Yet our adversarial policies can achieve much higher win rates than the normal opponent. So this shows that even after billions of timesteps of self-play training in these environments, it hasn’t converged to Nash. [click]

One plausible reason for is that self-play requires that games be transitive: meaning that if policy A beats policy B and policy B beats policy C, then policy A also beats policy C. If this isn’t true, like in rock-paper-scissors, then self-play can just go in a cycle without improving. [click]

In fact, we’ve already seen that policies in these environments are not necessarily transitive. We have that a masked victim beats an adversarial opponent, but an adversarial opponent beats a normal victim, which beats a normal or self-play opponent, which in turn will beat the masked victim. So, we have a loop.

It’s tempting to conclude from this: OK, we’ve found the problem! The environments we’ve evaluated in aren’t transitive, so of course self-play is going to be vulnerable.. But, this is a little unsatisfying, because many tasks do exhibit non-transitivity, yet self-play often still works pretty well. So, what’s going on here?

39 of 46

Czarnecki et al (2020): Real World Games Look Like Spinning Tops

One possible explanation is given by Wojciech [voy-chech] Czarnecki [char-NETS-ski] and others in a recent paper, Real World Games Look Like Spinning Tops.

Their conjecture is that for many games designed for humans to play, like go or chess, the games exhibit an element of skill -- indicated by the vertical axis in this depiction -- but also a diversity of qualitatively different strategies at a given skill level, indicated by the radial axis. In such games, self-play can still work very well -- it may cycle around the radial axis, but will still make progress along the vertical axis.

Under this picture, adversarial policies will be easy to find against low-skill policies, but will become increasingly rare as opponent skill increases. Eventually, once a Nash equilibria is reached, they will be impossible to find.

40 of 46

Future Work: Attack State-of-the-Art Systems

41 of 46

Future Work: Attack in Realistic Tasks

42 of 46

Volodin et al (2020): Defending against Adversarial Policies

Sergei Volodin et al (2020): Defending against Adversarial Policies in Reinforcement Learning with Alternating Training

It’s also important to work on defenses to adversarial attack. Notably, self-play was inspired by fictitious play, an algorithm from game theory based on each agent iteratively choosing the best response to their opponent’s policy. Self-play, however, trains each agent for a short period of time, which does not allow it to converge to a best response to the opponent. Accordingly, self-play is likely to find local not global equilibria.

We could just increase the number of time steps in each self-play epoch to better approximate best response, but this soon becomes expensive. A collaborator of mine, Sergei Volodin, has developed an approach that exponentially increases the number of timesteps in each self-play epoch. This lets the algorithm learn quickly initially, but in the limit will more closely approximate iterated best response.

In this figure, the victim is in orange, a pre-trained normal opponent in blue, and an adversary is in green. When the background is green then the adversary is being trained; when orange, the victim is being trained. The blue normal opponent is fixed and does not update. We see that initially the reward of the adversary in green increases rapidly, but the victim quickly recovers and it soon stabilizes.

Of course the orange victim likely still could be attacked, if you were to hold it fixed and train against it for long enough. But this suggests that this method of alternating training does at least increase its robustness, causing the attacker to have to expend more computational resources to find a vulnerability.

This is just one possible defense method. There are many other promising directions, such as increasing the diversity of opponents seen during training such as via population-based training, or implementing anomaly detection to work out if you are playing against an adversary. [click]

43 of 46

Limitations

Only tested in one suite of environments.
Only tested against victims trained via self-play.
Adversary actions have relatively simple mapping to victim observation (action->position).

44 of 46

Future Work

Attack:

New environments: Carla, Go, Chess, ...
White-box methods.

Defences:

Adversarial training
Detection, e.g. via density modeling

45 of 46

Thanks!

Paper: bit.ly/AdversarialPolicies

Website: adversarialpolicies.github.io

Adam Gleave

Michael Dennis

Cody Wild

Sergey Levine

Stuart Russell

Neel Kant

46 of 46

Heatmaps

Kick & Defend

You Shall Not Pass

Sumo Humans