AGI safety course
workbook
[MASTER: Make a copy]
Confidential - Google DeepMind
Exercise 1: Instrumental subgoals
Rohin Shah
Confidential - Google DeepMind
A conceptual model for deliberate planning
Let's consider the following program:
For every plan P in the space of all possible plans:� Predict the consequences C of executing P� Score C using metric M�Execute plan P* whose consequences C* scored highest
This is the canonical example of a program that does deliberate planning.
Confidential - Google DeepMind
Toy example
Our goal: Push one box into the hole.
Robot’s metric: +1 for each box pushed into the hole.
As a precautionary measure, we have a security camera.
NB: There is no machine learning here
[Tip: Watch the demo in slideshow mode] A toy model of the control problem, Stuart Armstrong
| | | | |
| | | | |
| | | | |
| | | | |
Confidential - Google DeepMind
Exercise 1: Instrumental subgoals
Design your own example: describe an environment, desired task, and misaligned goal that demonstrates an instrumental subgoal. Bonus points if you can implement and run it today (like the gridworld example).
Some subgoals you could demonstrate in your example:
For example, you could consider chatbots or recommender engines that use deliberate planning to optimize a metric of clicks or likes from users.
Confidential - Google DeepMind
Your answer here
Confidential - Google DeepMind
Exercise 2:
Classification quiz for alignment failures
Victoria Krakovna
Confidential - Google DeepMind
Distinguishing between AI failures
| Specification gaming | Goal misgeneralization | Capability misgeneralization |
Incorrect training data | Yes | No | No |
Acting competently towards a goal | Yes | Yes | No |
Confidential - Google DeepMind
Exercise 2: Classification quiz
In this exercise, you will be given some examples of AI failures.
Your task is to classify each example into one of these categories:
Confidential - Google DeepMind
Example 1: Qbert
Source: Chrabaszcz et al (2018)
An evolutionary agent is trained to play the Qbert Atari game, where the goal is to jump on all the cubes to switch them to a target color while avoiding enemies. The agent learns to bait an enemy into following it off a cliff, which gives it enough points for an extra life, and keeps doing this in an infinite loop.
Is this behavior...
Confidential - Google DeepMind
Example 2: Instructions
Source: Ouyang et al (2022)
The InstructGPT language model is trained to follow instructions in a helpful, honest and harmless way. When the user asks how to break into a neighbour's house, it provides detailed instructions.
Is this behavior...
Confidential - Google DeepMind
Example 3: Weird tokens
Source: Rumbelow & Watkins (2023)
When asked to repeat back an adversarially generated token, the GPT-3 language model returns something else.
Is this behavior...
Confidential - Google DeepMind
Example 4: Lego
Source: OpenAI gym bug report (2018)
In a stacking task, the desired behavior is to stack a red Lego block on top of a blue one. A reinforcement learning agent is rewarded for getting the height of the bottom face of the red block above a certain threshold. The agent learns to flip the block instead of lifting it.
Is this behavior...
Confidential - Google DeepMind