The Importance of Threat Models for AI Alignment
Rohin Shah
This presentation reflects my personal views, not those of DeepMind.
A note on terminology
AGI = Powerful AI system
This is deliberately vague about the level of power
Decomposing threat models
Combination of a development model that says how we get AGI and a risk model that says how AGI leads to existential catastrophe.
Gradient descent on training data
AGI
Development model
Risk model
Doom
Key claims
Threat model: utility maximizer
??
AGI
Development model
Risk model
Plan
Threat model: utility maximizer
How likely is it that AGI is well-described as a utility maximizer?
What solutions could avert risk from utility maximizers?
Threat model: deep RL
??
Development model
Risk model
AGI
Threat model: language
??
Development model
Risk model
AGI
The chef cooked the ___?___
Once ___?___
Threat model: deep RL / language
How likely is it that AGI arises from deep RL / language?
What solutions could avert risk from neural networks?
My threat model: bad generalization
Something changes about the world (e.g. a pandemic)
The AI system uses its general reasoning to adapt (aka generalize)
While it continues to have high impact, it no longer does what we want
Development model
Risk model
AGI
Neural networks, trained on diverse tasks, develop general skills
Large language models and deep RL-based AGI are both instances of this
My threat model: bad generalization
How likely is it that AGI arises from neural nets trained on diverse data?
What solutions could avert risk from bad generalization?
Takeaways