1 of 13

The Problem of AI Alignment

Rohin Shah

Center for Human-Compatible AI

2 of 13

AI research: figure out how to meet a specification

  • Search
    • Given goal state(s), find a path to some goal state
  • CSPs
    • Given a set of constraints, find an assignment meeting the constraints
  • Logic
    • Given a boolean formula, find a satisfying assignment
  • MDPs and bandits
    • Given a reward function, find a policy that maximizes expected reward
  • Supervised learning
    • Given a dataset, predict the label of a new example
  • Robotics
    • Given a cost function, find a plan that minimizes cost

3 of 13

Specifications aren’t infallible

As your tasks get more complex, it becomes much more difficult to provide the right specification.

4 of 13

Bad specifications: Natural language

5 of 13

Bad specifications: reward functions

6 of 13

Bad specification: human feedback

7 of 13

Bad specifications: datasets / loss functions

8 of 13

Research question

How can we make it easier to specify the task we would like our AI system to perform?

9 of 13

Specification problem + superintelligent AI

Let’s imagine the following scenario:

  • We develop smarter-than-human AI systems, and
  • We use these AI systems to solve broad, real-world tasks
  • We still have the specification problem

What happens?

10 of 13

A hypothetical: the paperclip maximizer

Let’s say we fail at the specification problem really badly:

“Make as many paperclips as possible.”

How bad can this be?

11 of 13

Convergent instrumental subgoals

For most goals that an AI system could have, if they are not the same as the goals that humans have, there is an incentive to:

  • Stay “alive”
  • Prevent its designers from changing its goals
  • Gain resources and power
  • Trick humans into thinking the AI system is helping them
  • Overpower humans so that the AI system can accomplish its goals

12 of 13

From now to then

There are issues with this story:

  • How did the AI system learn to be deceptive?
  • Why didn’t we learn from when we had slightly-less-intelligent AI systems?
  • Why doesn’t the “superintelligent” AI have common sense?

But none of these seem to address the core underlying concern of the specification problem, so I’m still worried

13 of 13

Takeaway

Currently, we don’t know how to give specifications such that AI systems do what we intend them to do. This could be a major problem in the long term.