Emergent Scheming and Strategic Deception of
Frontier LLM Agents
Terry Jingchen Zhang
Slides at terry-zhang.notion.site
Email: zjingchen@ethz.ch and zjin@cs.toronto.edu
Definition: Scheming
Propensity: wanna do
Capability: can do
Strategic
NOT Accidental (I know it’s bad but still do it)
Covert
Deceptive Alignment
Pursue Misaligned Goals (Autonomously)
�
Do harm,
and do not admit
(consciously and willingly)
Definition: Alignment Faking [2412.14093] Alignment faking in large language models
To simplify this a bit:
Model
Use models with strong, pre-existing preference (e.g. from RLHF)
Diverge
Give a training objective that conflicts this preference
Aware
Make the model aware of its situation
Covert
Sufficient covert reasoning (e.g. scratchpad reasoning)
A couple of observations: [2507.03409] Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language
Key trigger? Pressure/Scratchpad/System Prompt
Smarter models are more capable of in-context scheming
Reading List:
Best to start from: [2507.03409] Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language as it contains overview of all relevant articles and analyzed their Pros and Cons, revealing the overall landscape of this field as of now
[2311.07590] Large Language Models can Strategically Deceive their Users when Put Under Pressure
[2412.04984] Frontier Models are Capable of In-context Scheming
[2509.15541] Stress Testing Deliberative Alignment for Anti-Scheming Training
Put ideas here