JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 13

The Problem of AI Alignment

Rohin Shah

Center for Human-Compatible AI

2 of 13

AI research: figure out how to meet a specification

Given goal state(s), find a path to some goal state

CSPs

Given a set of constraints, find an assignment meeting the constraints

Logic

Given a boolean formula, find a satisfying assignment

MDPs and bandits

Given a reward function, find a policy that maximizes expected reward

Supervised learning

Given a dataset, predict the label of a new example

Robotics

Given a cost function, find a plan that minimizes cost

3 of 13

Specifications aren’t infallible

As your tasks get more complex, it becomes much more difficult to provide the right specification.

4 of 13

Bad specifications: Natural language

5 of 13

Bad specifications: reward functions

6 of 13

Bad specification: human feedback

7 of 13

Bad specifications: datasets / loss functions

8 of 13

Research question

How can we make it easier to specify the task we would like our AI system to perform?

9 of 13

Specification problem + superintelligent AI

Let’s imagine the following scenario:

We develop smarter-than-human AI systems, and
We use these AI systems to solve broad, real-world tasks
We still have the specification problem

What happens?

10 of 13

A hypothetical: the paperclip maximizer

Let’s say we fail at the specification problem really badly:

“Make as many paperclips as possible.”

How bad can this be?

11 of 13

Convergent instrumental subgoals

For most goals that an AI system could have, if they are not the same as the goals that humans have, there is an incentive to:

Stay “alive”
Prevent its designers from changing its goals
Gain resources and power
Trick humans into thinking the AI system is helping them
Overpower humans so that the AI system can accomplish its goals

12 of 13

From now to then

There are issues with this story:

How did the AI system learn to be deceptive?
Why didn’t we learn from when we had slightly-less-intelligent AI systems?
Why doesn’t the “superintelligent” AI have common sense?

But none of these seem to address the core underlying concern of the specification problem, so I’m still worried

13 of 13

Takeaway

Currently, we don’t know how to give specifications such that AI systems do what we intend them to do. This could be a major problem in the long term.