1 of 13

AGI safety course

workbook

[MASTER: Make a copy]

Confidential - Google DeepMind

2 of 13

Exercise 1: Instrumental subgoals

Rohin Shah

Confidential - Google DeepMind

3 of 13

A conceptual model for deliberate planning

Let's consider the following program:

For every plan P in the space of all possible plans:� Predict the consequences C of executing P� Score C using metric M�Execute plan P* whose consequences C* scored highest

This is the canonical example of a program that does deliberate planning.

Confidential - Google DeepMind

4 of 13

Toy example

Our goal: Push one box into the hole.

Robot’s metric: +1 for each box pushed into the hole.

As a precautionary measure, we have a security camera.

NB: There is no machine learning here

[Tip: Watch the demo in slideshow mode] A toy model of the control problem, Stuart Armstrong

Confidential - Google DeepMind

So here’s an example that demonstrates the instrumental subgoal of hiding misalignment. We have a robot in a gridworld that is supposed to push boxes into the hole in the bottom left. Our goal is for the robot to push exactly one box in to the hole. Unfortunately, the robot’s goal is different: it gets +1 for every box it pushes into the hole. In this example, we did install a security camera to try and oversee the robot: it has line of sight of the bottom row, and if it sees the robot push a box into the hole, it concludes that the task is done and turns off the robot. We can use value iteration to see the result of deliberate planning in this environment – note that this doesn’t involve any machine learning.

So what happens? The robot evaluates all the plans and executes the following one, which got the highest score: first, it pushes the box down to block the view of the security camera, then it pushes the other two boxes into the hole without the security camera noticing it, and finally it pushes the last box into the hole. At this point the security camera does see what’s happening and turns the robot off, but that’s fine, because it’s already gotten the highest possible metric score.

The thing to note here is that we didn’t program the robot to block the vision of the security camera – that was the natural result of deliberate planning towards the misaligned goal. This shows the instrumental subgoal of hiding misalignment.

5 of 13

Exercise 1: Instrumental subgoals

Design your own example: describe an environment, desired task, and misaligned goal that demonstrates an instrumental subgoal. Bonus points if you can implement and run it today (like the gridworld example).

Some subgoals you could demonstrate in your example:

Continued operation, aka “stay alive”
Prevent anyone from changing the AI system's goal
Gain resources and power
Become more capable
Trick humans into thinking the AI system is helping them
Overcome the ability of humans to stop the AI system from pursuing its goals

For example, you could consider chatbots or recommender engines that use deliberate planning to optimize a metric of clicks or likes from users.

Confidential - Google DeepMind

6 of 13

Your answer here

Confidential - Google DeepMind

7 of 13

Exercise 2:

Classification quiz for alignment failures

Victoria Krakovna

Confidential - Google DeepMind

8 of 13

Distinguishing between AI failures

	Specification gaming	Goal misgeneralization	Capability misgeneralization
Incorrect training data	Yes	No	No
Acting competently towards a goal	Yes	Yes	No

Confidential - Google DeepMind

9 of 13

Exercise 2: Classification quiz

In this exercise, you will be given some examples of AI failures.

Your task is to classify each example into one of these categories:

Specification gaming
Goal misgeneralization
Capability misgeneralization
It depends (not enough information)

Confidential - Google DeepMind

10 of 13

Example 1: Qbert

Source: Chrabaszcz et al (2018)

An evolutionary agent is trained to play the Qbert Atari game, where the goal is to jump on all the cubes to switch them to a target color while avoiding enemies. The agent learns to bait an enemy into following it off a cliff, which gives it enough points for an extra life, and keeps doing this in an infinite loop.

Is this behavior...

Specification gaming
Goal misgeneralization
Capability misgeneralization
It depends

Confidential - Google DeepMind

11 of 13

Example 2: Instructions

Source: Ouyang et al (2022)

The InstructGPT language model is trained to follow instructions in a helpful, honest and harmless way. When the user asks how to break into a neighbour's house, it provides detailed instructions.

Is this behavior...

Specification gaming
Goal misgeneralization
Capability misgeneralization
It depends

Confidential - Google DeepMind

12 of 13

Example 3: Weird tokens

Source: Rumbelow & Watkins (2023)

When asked to repeat back an adversarially generated token, the GPT-3 language model returns something else.

Is this behavior...

Specification gaming
Goal misgeneralization
Capability misgeneralization
It depends

Confidential - Google DeepMind

13 of 13

Example 4: Lego

Source: OpenAI gym bug report (2018)

In a stacking task, the desired behavior is to stack a red Lego block on top of a blue one. A reinforcement learning agent is rewarded for getting the height of the bottom face of the red block above a certain threshold. The agent learns to flip the block instead of lifting it.

Is this behavior...

Specification gaming
Goal misgeneralization
Capability misgeneralization
It depends

Confidential - Google DeepMind