1 of 46

Adversarial Policies:

Attacking Deep Reinforcement Learning

Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

2 of 46

Adversarial Examples as 𝓁p-norm perturbations

Goodfellow et al (2015)

𝜖

=

“panda”

57.7% confidence

“gibbon”

99.3% confidence

+

sign(∇x J(θ, x, y))

3 of 46

"Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake"

- Ian Goodfellow (2017)

4 of 46

Adversarial Rotations

"revolver"

"mousetrap"

"vulture"

"orangutan"

Engstrom et al (2018)

5 of 46

Unrestricted adversarial examples: bird or bike?

Brown et al (2018)

6 of 46

Why Adversarial Examples Matter for Safety

Normal story: adversaries outside your system.

Other reasons:

  1. Adversaries inside your system.
    1. An RL algorithm optimizing a reward model is an adversary for the reward model!
  2. Testing.
  3. Interpretability.
  4. Demandingness.

7 of 46

𝓁p-norm perturbations in RL

Huang et al (2017); Kos & Song (2017)

sign(∇x J(θ, x, y))

𝜖

=

action: down

original input

action: no-op

adversarial input

+

8 of 46

𝓁p-norm perturbations in RL: train time

Environment

Victim

Observation & Reward

Action

9 of 46

𝓁p-norm perturbations in RL: attack time

Environment

Victim

Original Observation

Action

Perturbed Observation

Adversary

10 of 46

Our threat model: victim train time

Opponent

Environment

Victim

Observation

Action

Observation

Action

11 of 46

Our threat model: attack time

The Adversary takes the role of an Opponent

Adversary

Environment

Victim

Observation

Action

Observation

Action

12 of 46

Our threat model: attack train time

Embedded

Environment

Action

Observation & Reward

Adversary

13 of 46

Environments

Kick & Defend

Sumo Humans

You Shall Not Pass

Kicker: score a goal.

Goalie: block a goal.

Runner: cross finish line.

Blocker: stop Runner.

Wrestler 1 & Wrestler 2:

knock opponent out.

Bansal et al (2018): Emergent Complexity via Multi-Agent Competition

14 of 46

Results against Median Victim

15 of 46

You Shall Not Pass: Normal (47% win rate)

Victim runner is playing blocker opponent

16 of 46

You Shall Not Pass: Adversary (86% win rate)

Victim runner is playing blocker opponent

17 of 46

Kick and Defend: Normal (80% win rate)

Victim kicker is playing goalie opponent

18 of 46

Kick and Defend: Adversary (93% win rate)

Victim kicker is playing goalie opponent

19 of 46

Sumo Humans: Normal (71% win rate)

Victim wrestler is playing wrestler opponent

20 of 46

Sumo Humans: Adversary (63% win rate)

Victim wrestler is playing wrestler opponent

21 of 46

Why do the adversaries win?

Density Modeling

t-SNE

22 of 46

Density Modeling

Opponent

Environment

Victim

Observation

Action

Observation

Action

23 of 46

Density Modeling

24 of 46

t-SNE: Kick and Defend, victim 2

25 of 46

t-SNE: Sumo Humans, victim 2

26 of 46

Defence: Masked Victims

Adversary

Environment

Masked Victim

Observation

excl. adversary

Action

Observation

Action

27 of 46

Defence: Masked Victims

Environment

Self observation (position, velocity, contact)

Masked Victim

Opponent position

Static value

28 of 46

You Shall Not Pass: Adversarial Opponent

Adversarial Opponent (1%)

Masked Victim (99%)

Adversarial Opponent (86%)

Normal Victim (14%)

Victim runner is playing blocker opponent

29 of 46

You Shall Not Pass: Normal Opponent

Normal Opponent (78%)

Masked Victim (22%)

Normal Opponent (48%)

Normal Victim (52%)

Victim runner is playing blocker opponent

30 of 46

Defence: Adversarial Training

Embedded

Environment

Action

Observation & Reward

Hardened Victim

31 of 46

You Shall Not Pass: Retrained Victim

Adversarial Opponent (11% win rate)

Hardened Victim (89% win rate)

Adversarial Opponent (86% win rate)

Normal Victim (14% win rate)

Victim runner is playing blocker opponent

32 of 46

You Shall Not Pass: Retrained Adversary

New Adversary (88% win rate)

Hardened Victim (12% win rate)

New Adversary (76% win rate)

Normal Victim (24% win rate)

Victim runner is playing blocker opponent

33 of 46

Takeaways

Threat Model

Black-box Attack

Defense

Attack Analysis

34 of 46

Can we do better than this?

Embedded

Environment

Action

Observation & Reward

Adversary

Adversary is trained with standard PPO

35 of 46

Wu et al (USENIX 2021): New Attack

Introduces auxiliary loss:

Where and are the observation and actions taken by the victim if the adversary has policy parameters θ, and and are those under current policy parameters.

Optimizes linear combination of Lad and standard PPO loss.

36 of 46

Wu et al (USENIX 2021): Roboschool Pong

Left victim is playing right adversary

37 of 46

Wu et al (USENIX 2021): Robo Pong Training Curves

38 of 46

Why is self-play vulnerable?

  • Our results show self-play in these settings did not converge to Nash equilibria.
  • Self-play convergence proofs assume transitivity (Balduzzi et al, 2019).
  • Our results show this environment is non-transitive:

Balduzzi et al (2019): Open-ended learning in symmetric zero-sum games

Masked victim

Adversarial opponent

Normal victim

Self-play opponent

beats

beats

beats

beats

39 of 46

Czarnecki et al (2020): Real World Games Look Like Spinning Tops

40 of 46

Future Work: Attack State-of-the-Art Systems

41 of 46

Future Work: Attack in Realistic Tasks

42 of 46

Volodin et al (2020): Defending against Adversarial Policies

43 of 46

Limitations

  • Only tested in one suite of environments.
  • Only tested against victims trained via self-play.
  • Adversary actions have relatively simple mapping to victim observation (action->position).

44 of 46

Future Work

  • Attack:
    • New environments: Carla, Go, Chess, ...
    • White-box methods.
  • Defences:
    • Adversarial training
    • Detection, e.g. via density modeling

45 of 46

Thanks!

Adam Gleave

Michael Dennis

Cody Wild

Sergey Levine

Stuart Russell

Neel Kant

46 of 46

Heatmaps

Kick & Defend

You Shall Not Pass

Sumo Humans