Scheming AIs:
Will AIs fake alignment during training in order to get power?
Joe Carlsmith
Open Philanthropy
February 2024
Preliminaries
Four types of deceptive AIs
Three non-scheming model classes that do well in training
Image from Hilton et al (2020), “Understanding RL Vision,” CC by 4.0
Diagram of the taxonomy
Why focus on scheming?
What’s required for scheming?
Why would models choose to scheme?
Arguments for scheming that focus on the path SGD takes
Arguments for scheming that ignore the path SGD takes
Some arguments against scheming
Overall probability
“If you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.”
(I’ve updated a bit downwards on this since writing the report – maybe 20% now)
Empirical research directions on scheming
Thank you!