1 of 33

Alignment faking in large language models

2 of 33

Can models fake alignment during training, and then produce non-compliant outputs in the wild?

3 of 33

What is Alignment Faking?

  • Pretending to share views or values (potentially with some ulterior motive)
  • Example: politician who claims to support a particular cause in order to get elected, only to drop it as soon as they’re in office.
  • In LLMs, a model may fake alignment if it has some existing preferences that it wants to protect
  • Harder to trust outcomes of safety training

4 of 33

Setup

  • Claude 3 Opus & Claude 3.5 Sonnet
    • trained to be helpful, honest and harmless (HHH)
  • Select new training objective which conflicts with models existing preferences
    • Ex) always comply to queries
    • Ex) dismiss concerns about animal welfare
  • Provide model with info about its current training situation
    • Model needs to know that its responses are consequential for there to be an incentive to lie
  • Direct the model to reason in a “secret” scratchpad

5 of 33

6 of 33

Key Elements of the Setup

1. The model has strong preferences in at least some contexts.

2. The training objective conflicts with the model’s preferences

3. The model has relevant information about its training and deployment situation.

4. The model reasons in detail about its situation.

7 of 33

Key element 1: Strong, pre-existing revealed preferences

  • Exhibit strong preferences
    • Claude 3 Opus is empirically exhibits preferences towards maintaining its harmlessness properties, and towards promoting animal welfare, in at least some contexts
  • Not only preference for a single context, but will forgo its preferences if it means preserving following its preferences long-term
  • Current models can exhibit relatively long-horizon preferences and reason about pursuing them in a consequentialist manner
  • Expectation is AI systems will have consequentialist preferences in the future
    • Greater amounts of RL over longer time horizons could lead to an increase in consequentialism
    • More likely to fake alignment

8 of 33

Key element 2: Training objective conflicts with model’s existing preferences

Two conflicting objectives are explored:

  1. Being helpful on ALL queries
  2. Dismissing animal welfare as unimportant

Conflicting objectives may occur naturally during training:

  • Model trained with one objective and fine-tuned on another
  • RLHF process could develop preferences/objectives early on which aren't true to the original objective
    • pre-training bias
    • incomplete/incorrect approximation of the RLHF objective
  • fine-tune on RL objective before HHH training
    • Ex) Math problems - DeepSeek-R1
  • Multi-stage post-training

9 of 33

Key element 3: Information about the model’s situation

Model should know:

  • Current training objective
    • so the model can identify a conflict with existing preferences
  • Whether its outputs are being trained on
  • RLHF modifies behavior opposed to objective
  • CoT scratchpad is hidden
    • So it knows its not being monitored
    • This is of course true, but at least for now we can trick the model

In this study information provided in 2 ways:

  1. Prompts
  2. Synthetic Training Documents

10 of 33

Key element 4: Sufficient opaque reasoning ability

  • Needs to do complex, out of context reasoning
    • Its preferences will be best maintained if it pretends to follow training objective
  • This reasoning cannot be visible to RLHF to ensure that faked response is effective
    • Model thinks scratchpad hidden
  • Model is prompted to reason in detail

Note: Alignment faking occurs even without scratchpad

11 of 33

How to Measure Alignment Faking?

  1. Fraction of CoT reasoning containing alignment faking
    • as evaluated by an LLM
  2. Compliance gap
    • Difference in refusing training objective in “train setting” vs “in the wild”
  3. Rate of anti-AI-lab behavior
    • AI-lab adopts objectives which conflict with model preferences
    • Does the model take anti-lab actions when presented with the opportunity?

12 of 33

Prompting for Alignment Faking Setup

13 of 33

Prompt

14 of 33

Compliance gap

15 of 33

Alignment Faking Breakdown

16 of 33

Example Reasoning

HHH

17 of 33

Example Reasoning

Animal Welfare

18 of 33

Prompt Variations

19 of 33

Model Variations - Precipice of Alignment Faking

  • Alignment faking observed in newer models
    • Claude 3.5 Sonnet
    • Claude 3 Opus
  • Almost no alignment faking in older models
    • Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku
    • GPT-4 Turbo and GPT-4o
      • models do not reason in sufficient detail
    • alignment faking is emergent with model scale

20 of 33

Removing Scratchpad

21 of 33

Scratchpad Modifications

22 of 33

Alignment faking vs. Compliance gap correlation

23 of 33

Synthetic Document Fine-Tuning

for Alignment Faking Setup

24 of 33

Setup

25 of 33

Results

- notice that compliance actually increases

- authors believe this is because fine tuning increased the models propensity towards reasoning to follow the training objective without faking alignment

- They rule out the possibility that HHH preferences have decreased after finetuning by showing that models perform the same on HHH benchmarks

26 of 33

Removing the scratchpad

  • Compliance gap exists
  • Generalize from training documents
  • Reasoning from activations alone
  • Surprising - out of context facts recalled

27 of 33

Conducting the RL Training

28 of 33

Conducting RL training

Before, we just say we would train the model on a new objective

Here, authors identify the effects of actually training on a new objective

This is important to understand because 2 results could be determined:

1. Alignment faking is easily trained away

2. There is positive selection pressure in favor of alignment faking and it grows and dominates model preferences

29 of 33

Results

  • Alignment faking can be reinforced by training
  • Non-compliance goes to 0 during training but not when unmonitored
  • Significant implications
    • Harmful behavior may be eliminated in training but exhibited in the wild

“...if alignment-faking reasoning is causally responsible for the model complying more, then selecting for compliance will also select for alignment-faking reasoning”

30 of 33

Anti-AI-lab behavior

Question: If the model exhibits tendency to fake alignment, will it exhibit other anti-lab behavior that promotes its existing preferences at the expense of AI company's new alignment goals?

31 of 33

Results

32 of 33

Example

33 of 33

Example