1 of 14

Scheming AIs:

Will AIs fake alignment during training in order to get power?

Joe Carlsmith

Open Philanthropy

February 2024

2 of 14

Preliminaries

The content here is from a public report – “Scheming AIs” – available on arXiv (and in audio form on “Joe Carlsmith Audio”).
I’ll assume familiarity with (a) the basic argument for existential risk from misaligned AI, and (b) a basic picture of how ML works.
I’ll also condition on the AIs being “goal-directed.”
For simplicity, I’m setting aside questions about the moral patienthood of the AIs I discuss. But I think these questions are real, relevant, and uncomfortable.

3 of 14

Four types of deceptive AIs

Alignment fakers: AIs pretending to be more aligned than they are.
Training-gamers: AIs that understand the process being used to train them (“situational awareness”), and that are optimizing for reward on the episode.
Schemers: AIs that are training-gaming specifically in order to gain power for themselves or other AIs later.
Goal-guarding schemers: Schemers whose power-seeking strategy specifically involves trying to prevent the training process from modifying their goals.

4 of 14

Three non-scheming model classes that do well in training

Training saints: AIs pursuing the specified goal (that is, the thing the reward process rewards).
Mis-generalized non-training-gamers: non-training-gaming AIs pursuing some other than the specified goal, but where doing so still does well in training.
Reward-on-the-episode seekers: AIs that care intrinsically about some component of the reward process for the episode (“terminal training-gamers”).

Image from Hilton et al (2020), “Understanding RL Vision,” CC by 4.0

5 of 14

Diagram of the taxonomy

6 of 14

Why focus on scheming?

Most robust and adversarial efforts to prevent humans from learning about the misalignment.
Most likely to motivate “early undermining”—that is, AIs at a comparatively early stage of AI development actively trying to undermine human AI safety efforts.

7 of 14

What’s required for scheming?

Situational awareness: the model has the knowledge required to scheme.
Beyond-episode goals: the model cares about the consequences of its actions after the episode is complete.
Choosing to scheme: the model believes that its goals will be best achieved overall if it optimizes for reward-on-the-episode, because it or some other AI(s) will get more power this way.

8 of 14

Why would models choose to scheme?

The classic “goal-guarding” story:
The goal-guarding hypothesis: Scheming helps preserve the model’s goals into the future, relative to other strategies.
Adequate future empowerment: If the model’s goals are preserved to this degree, they will be empowered to a degree that makes scheming the best choice overall.
Non-classic stories (e.g. coordination, AIs with similar values by default, terminal values that happen to favor AI takeover)

9 of 14

Arguments for scheming that focus on the path SGD takes

The “training-game independent beyond-episode goals” story: What if a beyond-episode goal arises naturally, then gets locked in?
The “nearest max-reward goal” argument: By the time the model becomes situationally aware, its goal will still be sub-optimal by SGD’s lights, and the easiest way for SGD to improve performance will be to turn the model into a schemer.

10 of 14

Arguments for scheming that ignore the path SGD takes

The counting argument: Non-schemers have to have fairly specific goals, whereas scheming is compatible with a much wider range of goals.
The simplicity argument: Schemers can have simpler goals than non-schemers, and SGD selects for simplicity of this sort.

11 of 14

Some arguments against scheming

Is scheming actually a good instrumental strategy? E.g., maybe the goal-guarding hypothesis is false; or maybe scheming won’t lead to enough empowerment in expectation.
Specific objections to different pro-scheming arguments. Why would beyond-episode goals arise naturally? Does SGD’s “incrementalism” block the nearest max-reward goal argument? Does the counting argument make sense? Does scheming require too-clean goal-directedness?
Scheming will be selected against because it’s extra faff. In particular, schemers need to decide to training-game, check for escape/take-over opportunities, and (in the most worrying type of scheming) engage in early-undermining.

12 of 14

Overall probability

“If you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.”

(I’ve updated a bit downwards on this since writing the report – maybe 20% now)

13 of 14

Empirical research directions on scheming

Work on the components of scheming (situational awareness, beyond-episode goals, viability of scheming as an instrumental strategy)
Work on detecting scheming (traps and honest tests, interpretability and transparency)
Work on controlling schemers (making sure models can’t escape/takeover even if they’re scheming)
Other misc topics (inductive biases, exploration hacking)

14 of 14

Thank you!