1 of 12

Emergent Scheming and Strategic Deception of

Frontier LLM Agents

Terry Jingchen Zhang

2 of 12

Definition: Scheming

Propensity: wanna do

Capability: can do

Strategic

NOT Accidental (I know it’s bad but still do it)

Covert

Deceptive Alignment

Pursue Misaligned Goals (Autonomously)

3 of 12

Do harm,

and do not admit

(consciously and willingly)

4 of 12

5 of 12

To simplify this a bit:

Model

Use models with strong, pre-existing preference (e.g. from RLHF)

Diverge

Give a training objective that conflicts this preference

Aware

Make the model aware of its situation

Covert

Sufficient covert reasoning (e.g. scratchpad reasoning)

6 of 12

7 of 12

8 of 12

Key trigger? Pressure/Scratchpad/System Prompt

9 of 12

10 of 12

Smarter models are more capable of in-context scheming

11 of 12

12 of 12

Put ideas here

  1. Me-training for situational awareness stronger
  2. => then probe alignment faking