1 of 24

Max Chiswick

Daniel Braun

Joe Kwon

And

Jack Koch & Lauro Langosco

Understanding RL agents using generative visualization

Lee Sharkey (presenting)

2 of 24

Input Hidden activations Output

“Feature Visualisation”; Distill; Olah et al. 2017

3 of 24

“An Overview of Early Vision in InceptionV1”; Distill; Olah et al. 2020

“Curve circuits”; Distill; Cammarata et al. 2021

Input Hidden activations Output

gradients

4 of 24

What humans see

“Quantifying generalization in RL”; Cobbe et al. 2019

What the agent

actually sees

Training curve of our agent

So our project asks: Can we use generative visualization methods to understand task representations in RL agents?��This is coinrun, which is a procedurally generated set of tasks. We trained an agent on Coinrun to see if we could interpret it. Because coinrun uses procedural generation of levels, the agent must learn policies that generalise from the training set of levels to the test set.��If you’ve seen coinrun before, you’re probably used to seeing this pretty high resolution rendering. But what the agent actually sees is lower resolution. We’ll be looking at what the agent sees (rather than the pretty rendering) for the rest of the presentation. ��People have tried interpreting deep RL agents using generative feature visualisation in the past, but they’ve run into problems.

5 of 24

Input Hidden activations Output

Output

Hidden activations

Input

Feedforward:

RNN:

The first challenge is that agent policies typically require coordination through time. This involves dealing with sequences of inputs which is a little harder than dealing with only a static input as in convolutional image classifiers. ��Often, we give agents memory by implementing them using RNNs, which learn to carry information forward in time in the network activations. Most state of the art RL agents are recurrent, and recurrent agents are a more general case than purely feedforward agents, so hereon we’ll assume that we’re working with recurrent agents. Previous work has avoided this issue by only looking at feedforward agents that have no recurrence. ��In principle, you can use the same visualisation to interpret RNNs, except here the layers are simply the activations of the recurrent network at each timestep.

6 of 24

Recurrent agent:

Environment

Output / Input

Hidden activations

7 of 24

“Causal Analysis of Agent Behavior for AI Safety”; Déletang et al. 2021

This gridworld agent can only see its surrounding 3x3 patch. When the square at the start is brown, then the reward is on its left. When the square is green, then the reward will be on its right. So it can’t see the signal square when it moves toward the reward, so it has to remember whether the signal was brown or green.

How could an agent solve this memory task? One option is by storing the signal in neural activations (as in an RNN). The other option is by first moving to the left or right when it sees brown or green square respectively, thus encoding the signal in the agent’s position. The agent can use the fact that it’s closer to one wall or another to encode its memory. It thus uses the environment as a memory system. ��Agents do this all the time, and it causes problems for using gradient based feature visualisation. (NEXT SLIDE)

8 of 24

Environment

Output / Input

Hidden activations

Recurrent agent:

9 of 24

Recurrent agent:

Environment

Output / Input

Hidden activations

10 of 24

Rupprecht et al. 2017

Seaquest

Maximally exciting images for neurons in an agent

�Another serious issue is that agents often don’t learn easily interpretable representations like inceptionV1.

Consider these visualisations from an agent trained to play Atari Seaquest. The agent is just a feedforward convolutional network. When you look at what inputs maximally activate some of its neurons, it mostly produces noise. But it doesn’t generate interpretable visual features.��If this were an RNN agent, it would generate a noisy sequence of images. ��Not only would such a sequence be uninterpretable, but unrealistic generated environment-observation sequences also mean that we’re destroying the agent’s ability to use the environment as a memory system, meaning we have no hope of understanding policies that employ this strategy. For example, the gridworld agent in the previous example used its location in the environment to encode a memory. But if the environmental transition dynamics of the generated sequences are violated, then the agent wouldn’t be able to use its location as a memory.

11 of 24

A recap

Feature visualization is a challenge for RL agents because:

Agents coordinate actions across time using memory

Agents may use the environment as a memory, but the environment is not differentiable

Features learned by agents may not generate realistic images/image sequences. Therefore destroys memory encoded in the environment.

12 of 24

Our solution:

Learn a differentiable generator for realistic agent-environment sequences

Generative model

13 of 24

Architecture diagram

sample

We train a variational autoencoder to reconstruct agent-environment rollout sequences. Latent vectors from this VAE that are not seen during training can be used to simulate unseen agent-environment rollouts.

The encoder has two parts.

One part receives as input the observation frames leading up to the start of the sequence we want to simulate. This learns an embedding of the sequence that will be used to initialise the agent-environment simulation.

The second part of the encoder learns a global context encoding. It only sees every other frame of the sequence we’re trying to reconstruct, so it can’t memorise the whole sequence. The motivation for this part of the encoder is that the environment is partially observable, so the encoder needs to learn to predict the structure of the environment that isn’t visible in the first few frames of the simulated sequence.

The encoder produces the parameters of a gaussian distribution, which we sample to get both parts of the VAE latent vector that encodes the simulated agent-environment sequence.

The decoder has two main parts, the agent, which is fixed during generative model training, and the environment simulator, implemented by an LSTM, which produces an observation at every timestep, which is input to the agent, which then chooses an action, which in turn updates the environment simulator. There are also initializer networks, which produce initial hidden states for the agent and the environment simulator at the start of the unrolling process. The global context variable, which is supposed to encode information about the global structure of the environment, is input to the environment simulator at every timestep and is suppose to encode upcoming environmental variables that will only become visible in future.

Now we have a mechanism that enables us to optimize any of the quantities or vectors in the decoder so that they move toward a certain target value. So we can maximise agent hidden neurons at a particular timestep.

14 of 24

Reconstruction Ground Truth

15 of 24

Generated using random actions

Ground truth (used for initialization)

16 of 24

Latent vector 1

Latent vector 2

Latent vector 3

Latent vector 4

17 of 24

Optimizing for high/low value sequences

Samples that maximise/ minimize value output

Value output throughout optimization

Maximized

Minimized

18 of 24

Optimizing for increasing/decreasing value

Samples that maximise/ minimize difference of value output in 2nd and 1st half of sequence

Difference of values between 2nd and 1st half of the sequence throughout optimization

Increasing value

Decreasing Value

19 of 24

Optimizing for specific action sequences

% actions correct

20 of 24

Optimizing for particular agent hidden states

Samples that maximise/ minimize activation of neuron 2

Optimization of activation of neuron 2

Maximized

Minimized

21 of 24

22 of 24

23 of 24

Hidden states projected onto top 3 principle components, coloured by cluster identity

24 of 24

Max Chiswick

Daniel Braun

Joe Kwon

And

Jack Koch & Lauro Langosco

Thanks! Questions?

Lee Sharkey