JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 12

Emergent Scheming and Strategic Deception of

Frontier LLM Agents

Terry Jingchen Zhang

Slides at terry-zhang.notion.site

Email: zjingchen@ethz.ch and zjin@cs.toronto.edu

2 of 12

Definition: Scheming

Propensity: wanna do

Capability: can do

Strategic

NOT Accidental (I know it’s bad but still do it)

Covert

Deceptive Alignment

Pursue Misaligned Goals （Autonomously)

�

3 of 12

Do harm,

and do not admit

(consciously and willingly)

4 of 12

Definition: Alignment Faking [2412.14093] Alignment faking in large language models

5 of 12

To simplify this a bit:

Model

Use models with strong, pre-existing preference (e.g. from RLHF)

Diverge

Give a training objective that conflicts this preference

Aware

Make the model aware of its situation

Covert

Sufficient covert reasoning (e.g. scratchpad reasoning)

6 of 12

A couple of observations: [2507.03409] Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language

7 of 12

(J. Scheurer) LLMs can Strategically Deceive Under Pressure

8 of 12

Key trigger? Pressure/Scratchpad/System Prompt

9 of 12

(A. Meinke) Frontier Models are Capable of In-context Scheming

10 of 12

Smarter models are more capable of in-context scheming

11 of 12

Reading List:

Best to start from: [2507.03409] Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language as it contains overview of all relevant articles and analyzed their Pros and Cons, revealing the overall landscape of this field as of now

[2311.07590] Large Language Models can Strategically Deceive their Users when Put Under Pressure

[2412.04984] Frontier Models are Capable of In-context Scheming

[2509.15541] Stress Testing Deliberative Alignment for Anti-Scheming Training

12 of 12

Put ideas here

Me-training for situational awareness stronger
=> then probe alignment faking