The Problem of AI Alignment
Rohin Shah
Center for Human-Compatible AI
AI research: figure out how to meet a specification
Specifications aren’t infallible
As your tasks get more complex, it becomes much more difficult to provide the right specification.
Bad specifications: Natural language
Bad specifications: reward functions
Bad specification: human feedback
Bad specifications: datasets / loss functions
Research question
How can we make it easier to specify the task we would like our AI system to perform?
Specification problem + superintelligent AI
Let’s imagine the following scenario:
What happens?
A hypothetical: the paperclip maximizer
Let’s say we fail at the specification problem really badly:
“Make as many paperclips as possible.”
How bad can this be?
Convergent instrumental subgoals
For most goals that an AI system could have, if they are not the same as the goals that humans have, there is an incentive to:
From now to then
There are issues with this story:
But none of these seem to address the core underlying concern of the specification problem, so I’m still worried
Takeaway
Currently, we don’t know how to give specifications such that AI systems do what we intend them to do. This could be a major problem in the long term.