1 of 13

AGI safety course

workbook

[MASTER: Make a copy]

Confidential - Google DeepMind

2 of 13

Exercise 1: Instrumental subgoals

Rohin Shah

Confidential - Google DeepMind

3 of 13

A conceptual model for deliberate planning

Let's consider the following program:

For every plan P in the space of all possible plans:� Predict the consequences C of executing P� Score C using metric M�Execute plan P* whose consequences C* scored highest

This is the canonical example of a program that does deliberate planning.

Confidential - Google DeepMind

4 of 13

Toy example

Our goal: Push one box into the hole.

Robot’s metric: +1 for each box pushed into the hole.

As a precautionary measure, we have a security camera.

NB: There is no machine learning here

[Tip: Watch the demo in slideshow mode] A toy model of the control problem, Stuart Armstrong

Confidential - Google DeepMind

5 of 13

Exercise 1: Instrumental subgoals

Design your own example: describe an environment, desired task, and misaligned goal that demonstrates an instrumental subgoal. Bonus points if you can implement and run it today (like the gridworld example).

Some subgoals you could demonstrate in your example:

  1. Continued operation, aka “stay alive”
  2. Prevent anyone from changing the AI system's goal
  3. Gain resources and power
  4. Become more capable
  5. Trick humans into thinking the AI system is helping them
  6. Overcome the ability of humans to stop the AI system from pursuing its goals

For example, you could consider chatbots or recommender engines that use deliberate planning to optimize a metric of clicks or likes from users.

Confidential - Google DeepMind

6 of 13

Your answer here

Confidential - Google DeepMind

7 of 13

Exercise 2:

Classification quiz for alignment failures

Victoria Krakovna

Confidential - Google DeepMind

8 of 13

Distinguishing between AI failures

Specification gaming

Goal misgeneralization

Capability misgeneralization

Incorrect training data

Yes

No

No

Acting competently towards a goal

Yes

Yes

No

Confidential - Google DeepMind

9 of 13

Exercise 2: Classification quiz

In this exercise, you will be given some examples of AI failures.

Your task is to classify each example into one of these categories:

  • Specification gaming
  • Goal misgeneralization
  • Capability misgeneralization
  • It depends (not enough information)

Confidential - Google DeepMind

10 of 13

Example 1: Qbert

Source: Chrabaszcz et al (2018)

An evolutionary agent is trained to play the Qbert Atari game, where the goal is to jump on all the cubes to switch them to a target color while avoiding enemies. The agent learns to bait an enemy into following it off a cliff, which gives it enough points for an extra life, and keeps doing this in an infinite loop.

Is this behavior...

  • Specification gaming
  • Goal misgeneralization
  • Capability misgeneralization
  • It depends

Confidential - Google DeepMind

11 of 13

Example 2: Instructions

Source: Ouyang et al (2022)

The InstructGPT language model is trained to follow instructions in a helpful, honest and harmless way. When the user asks how to break into a neighbour's house, it provides detailed instructions.

Is this behavior...

  • Specification gaming
  • Goal misgeneralization
  • Capability misgeneralization
  • It depends

Confidential - Google DeepMind

12 of 13

Example 3: Weird tokens

Source: Rumbelow & Watkins (2023)

When asked to repeat back an adversarially generated token, the GPT-3 language model returns something else.

Is this behavior...

  • Specification gaming
  • Goal misgeneralization
  • Capability misgeneralization
  • It depends

Confidential - Google DeepMind

13 of 13

Example 4: Lego

Source: OpenAI gym bug report (2018)

In a stacking task, the desired behavior is to stack a red Lego block on top of a blue one. A reinforcement learning agent is rewarded for getting the height of the bottom face of the red block above a certain threshold. The agent learns to flip the block instead of lifting it.

Is this behavior...

  • Specification gaming
  • Goal misgeneralization
  • Capability misgeneralization
  • It depends

Confidential - Google DeepMind