1 of 12

The Importance of Threat Models for AI Alignment

Rohin Shah

This presentation reflects my personal views, not those of DeepMind.

2 of 12

A note on terminology

AGI = Powerful AI system

This is deliberately vague about the level of power

3 of 12

Decomposing threat models

Combination of a development model that says how we get AGI and a risk model that says how AGI leads to existential catastrophe.

Gradient descent on training data

AGI

Development model

Risk model

Doom

4 of 12

Key claims

Development models are crucial for solutions to apply to real systems.
Risk models are important for finding solutions.
We should be using the same notion of AGI for these two models.
We don’t yet have compelling examples of these models.

5 of 12

Threat model: utility maximizer

AGI

Development model

Risk model

Plan

Power
Survival
Resources

6 of 12

Threat model: utility maximizer

How likely is it that AGI is well-described as a utility maximizer?

¯\_(ツ)_/¯ (no development model!)

What solutions could avert risk from utility maximizers?

Don’t maximize: instead use mild optimization, impact regularization, etc.
Have the right utility: learn the goal from human behavior

7 of 12

Threat model: deep RL

Development model

Risk model

AGI

8 of 12

Threat model: language

Development model

Risk model

AGI

The chef cooked the ___?___

Once ___?___

9 of 12

Threat model: deep RL / language

How likely is it that AGI arises from deep RL / language?

Both seem plausible

What solutions could avert risk from neural networks?

¯\_(ツ)_/¯ (no risk model!)

10 of 12

My threat model: bad generalization

Something changes about the world (e.g. a pandemic)

The AI system uses its general reasoning to adapt (aka generalize)

While it continues to have high impact, it no longer does what we want

Development model

Risk model

AGI

Neural networks, trained on diverse tasks, develop general skills

Large language models and deep RL-based AGI are both instances of this

11 of 12

My threat model: bad generalization

How likely is it that AGI arises from neural nets trained on diverse data?

Debatable, but I think more likely than not

What solutions could avert risk from bad generalization?

Training on the right objective
Interpretability to ensure the network learned the right concept(s)
Adversarial training to look for cases of bad generalization
Choosing datasets / architectures that lead to the generalization we want
Identifying off-distribution scenarios and reverting to a safe baseline policy

12 of 12

Takeaways

Development models are crucial for solutions to apply to real systems.
Risk models are important for finding solutions.
We should be using the same notion of AGI for these two models.
We don’t yet have compelling examples of these models.