Alignment faking in large language models
Can models fake alignment during training, and then produce non-compliant outputs in the wild?
What is Alignment Faking?
Setup
Key Elements of the Setup
1. The model has strong preferences in at least some contexts.
2. The training objective conflicts with the model’s preferences
3. The model has relevant information about its training and deployment situation.
4. The model reasons in detail about its situation.
Key element 1: Strong, pre-existing revealed preferences
Key element 2: Training objective conflicts with model’s existing preferences
Two conflicting objectives are explored:
Conflicting objectives may occur naturally during training:
Key element 3: Information about the model’s situation
Model should know:
In this study information provided in 2 ways:
Key element 4: Sufficient opaque reasoning ability
Note: Alignment faking occurs even without scratchpad
How to Measure Alignment Faking?
Prompting for Alignment Faking Setup
Prompt
Compliance gap
Alignment Faking Breakdown
Example Reasoning
HHH
Example Reasoning
Animal Welfare
Prompt Variations
Model Variations - Precipice of Alignment Faking
Removing Scratchpad
Scratchpad Modifications
Alignment faking vs. Compliance gap correlation
Synthetic Document Fine-Tuning
for Alignment Faking Setup
Setup
Results
- notice that compliance actually increases
- authors believe this is because fine tuning increased the models propensity towards reasoning to follow the training objective without faking alignment
- They rule out the possibility that HHH preferences have decreased after finetuning by showing that models perform the same on HHH benchmarks
Removing the scratchpad
Conducting the RL Training
Conducting RL training
Before, we just say we would train the model on a new objective
Here, authors identify the effects of actually training on a new objective
This is important to understand because 2 results could be determined:
1. Alignment faking is easily trained away
2. There is positive selection pressure in favor of alignment faking and it grows and dominates model preferences
Results
“...if alignment-faking reasoning is causally responsible for the model complying more, then selecting for compliance will also select for alignment-faking reasoning”
Anti-AI-lab behavior
Question: If the model exhibits tendency to fake alignment, will it exhibit other anti-lab behavior that promotes its existing preferences at the expense of AI company's new alignment goals?
Results
Example
Example