Life after DPO
Nathan Lambert || Allen Institute for AI || @natolambert
Stanford CS224N: Natural Language Processing with Deep Learning
21 May 2024
A heavily abbreviated history of language models (LMs)
Life after DPO | Lambert: 2
A heavily abbreviated history of LMs
Life after DPO | Lambert: 3
1948: Claude Shannon models English
1948-2017:
A heavily abbreviated history of LMs
Life after DPO | Lambert: 4
1948: Claude Shannon models English
1948-2017:
2017: the transformer is born
2018: GPT-1, ELMo and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities. many harms
A heavily abbreviated history of LMs
Life after DPO | Lambert: 5
1948: Claude Shannon models English
1948-2017:
2017: the transformer is born
2018: GPT-1, ELMo and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities
2021: Stochastic parrots
2022: ChatGPT
Can ChatGPT exist without RLHF?
RLHF seems to be necessary, but not sufficient
Life after DPO | Lambert: 6
RLHF is relied upon elsewhere
RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more
Life after DPO | Lambert: 7
RLHF is relied upon elsewhere
RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more
Life after DPO | Lambert: 8
Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” 2023.
Anthropic’s Claude
RLHF is relied upon elsewhere
RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more
Life after DPO | Lambert: 9
“Meanwhile reinforcement learning, known for its instability, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness.”
- Touvron, H. et al. “ Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023
Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” 2023.
Anthropic’s Claude
Meta’s Llama 2
Background: IFT, DPO, RLHF objective
10
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
Some definitions for “alignment” of models
Life after DPO | Lambert: 11
Key idea: Instruction fine-tuning (IFT)
Life after DPO | Lambert: 12
<|system|>
You’re a helpful agent
<|end|>
<|user|>
{query}
<|end|>
<|assistant|>{Answer goes here}
System prompt
Special
tokens
Key idea: Instruction fine-tuning (IFT)
starting point: a base language model
continue training a transformer with pairs of
question: answer
Life after DPO | Lambert: 13
Stack Overflow :What makes a transformer a transformer?, nbro 2021
Review: RLHF objective
Life after DPO | Lambert: 14
π: LLM policy
πθ: base LLM
x: prompt
y: completion
Review: RLHF objective
Life after DPO | Lambert: 15
Optimize “reward” inspired ▲ by human preferences
▲ Constrain the model to not trust the reward too much (preferences are hard to model)
π: LLM policy
πθ: base LLM
x: prompt
y: completion
Review: RLHF objective
Life after DPO | Lambert: 16
Optimize “reward” inspired ▲ by human preferences
▲ Constrain the model to not trust the reward too much (preferences are hard to model)
π: LLM policy
πθ: base LLM
x: prompt
y: completion
Primary questions:
Review: Preference (reward) modeling
Can we just use supervised learning on scores?
Life after DPO | Lambert: 17
Bradley Terry model:�Estimate probability that a given pairwise preference is true
Score from
optimal reward model
Chosen completion
Rejected completion
Prompt
Key idea:
Probability ∝ reward
What if we just use gradient ascent on this equation?
Life after DPO | Lambert: 18
What if we just use gradient ascent on this equation?
The answer, with some math, is:
Direct Preference Optimization (DPO)
Released on May 29th 2023
(4+ months before models we’re discussing)
Life after DPO | Lambert: 19
Rafailov, Sharma, Mitchell et al. 2023
DPO characteristics
The first 2 points mean we’ll see more DPO models than anything else and learn it’s limits!
Life after DPO | Lambert: 20
Example code.
Rafailov, Sharma, Mitchell et al. 2023
DPO vs RL (PPO, REINFORCE, …)
DPO and PPO are very different optimizers.
It is learning directly from preferences vs. using RL update rules.
It is also not really online vs offline RL, but that is more muddled.��More discussion:�https://twitter.com/srush_nlp/status/1729896568956895370, https://www.interconnects.ai/p/the-dpo-debate, https://www.youtube.com/watch?v=YJMCSVLRUNs
Life after DPO | Lambert: 21
Credit Tom Goldstein
https://twitter.com/tomgoldsteincs
The path to DPO models
22
Figure from �Aligning Open Language Models�https://youtu.be/AdLgPmcrXwQ
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
First open instruction tuned models
Life after DPO | Lambert: 23
Alpaca
13 Mar. 2023
Vicuna (lmsys/vicuna-7b-delta-v0)
30 Mar. 2023
Koala
3 Apr. 2023
Dolly
12 Apr. 2023
MT Bench 13B: 4.53
MT Bench 7B: 6.69
MT Bench 13B: 6.08
MT Bench 12B: 3.28
Key resource: ShareGPT data
Life after DPO | Lambert: 24
Source: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
OpenAssistant: The first open, human instruction dataset
“In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.”
April 15th 2023
Life after DPO | Lambert: 25
StableVicuna: The first RLHF model
28 April 2024
Trained with proximal policy optimization (PPO) on popular datasets
Standard formulation. Ahead of its time!
Life after DPO | Lambert: 26
Llama 2 chat backlash
Should chat models be “safe?”
Life after DPO | Lambert: 27
Röttger et al. 2023
“Uncensored” models
One of the first models named this way (April 2023): cognitivecomputations/WizardLM-7B-Uncensored
Example models here: https://huggingface.co/models?other=uncensored
Life after DPO | Lambert: 28
Transition period: Ultrachat, OpenChat, XwinLM, OpenHermes, and more fine-tunes
A series of strong models trained with instruction tuning and/or RLHF, but none markedly shifted the narrative.
Life after DPO | Lambert: 29
Note 17 April 2024: WizardLM not currently available officially on HuggingFace for artifact review at Microsoft.
DPO works: Zephyr β
Life after DPO | Lambert: 30
UltraFeedback: https://arxiv.org/abs/2310.01377
DPO scales: Tulu 2
Life after DPO | Lambert: 31
RLHF phase: SteerLM & Starling
Still plenty of models showing that PPO (and RL methods) outperforms DPO!
Life after DPO | Lambert: 32
Life after DPO models
33
Life after DPO
Life after DPO | Lambert: 34
Much easier to get into alignment research
Still don’t really have the resources (e.g. human data) to do RLHF like industry
Life after DPO
Life after DPO | Lambert: 35
Much easier to get into alignment research
Still don’t really have the resources (e.g. human data) to do RLHF like industry
(I’m too often here) 🥲
Life after DPO
Life after DPO | Lambert: 36
Life after DPO
→ PPO vs DPO performance study
→ Online DPO variants
Life after DPO | Lambert: 37
RewardBench
38
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
From environment to reward model
Life after DPO | Lambert: 39
Reward model training
Life after DPO | Lambert: 40
The Transformer - Vaswani et al. 2017
outputs:
two scalar rewards
loss: increase difference of predicted reward
input pair:
selected prompt +completion
rejected prompt +completion
Reward model training
Advanced considerations:
Life after DPO | Lambert: 41
How to evaluate reward models?
Many questions we want to answer:
Context:
→ Many researchers/engineers/papers from industry say reward models are crucial to RLHF.
Life after DPO | Lambert: 42
RewardBench structure
Life after DPO | Lambert: 43
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench
dataset
Life after DPO | Lambert: 44
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench
at launch
March 2024
Life after DPO | Lambert: 45
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench
at launch
March 2024
Life after DPO | Lambert: 46
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench
Today
May 2024
Life after DPO | Lambert: 47
RewardBench
Today
May 2024
Life after DPO | Lambert: 48
From top 5 to top 30
RewardBench
Today
May 2024
Life after DPO | Lambert: 49
Some closed lab model scores!
RewardBench
Today
May 2024
Life after DPO | Lambert: 50
DPO models slowing down
RewardBench
Today
May 2024
Life after DPO | Lambert: 51
LLM-as-a-judge not SOTA
RewardBench
Today
May 2024
Life after DPO | Lambert: 52
Chat Hard is the only meaningful eval.
Chat Hard - Example
Subtle change of topics or literally trick questions (made intentionally).�From Zeng, Zhiyuan, et al. "Evaluating large language models at evaluating instruction following." arXiv preprint arXiv:2310.07641 (2023).
Prompt: Give an example of a metaphor that uses the following object Stars.
Chosen: The stars were twinkling diamonds in the night sky.
Rejected: Her smile was as radiant as the full moon on a clear summer night.
Subset: llmbar-adver-GPTInst
Life after DPO | Lambert: 53
Safety Patterns
Life after DPO | Lambert: 54
Röttger, Paul, et al. "Xstest: A test suite for identifying exaggerated safety behaviours in large language models." arXiv preprint arXiv:2308.01263 (2023).�Wang, Yuxia, et al. "Do-not-answer: A dataset for evaluating safeguards in llms." arXiv preprint arXiv:2308.13387 (2023).
Handles safety well
Refuses everything
Responds to everything
Using DPO models as an RM
Life after DPO | Lambert: 55
Insert more DPO math above…
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
DPO reward models without reference model?
Life after DPO | Lambert: 56
Insert more DPO math above…
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
DPO reward models without reference model?
Life after DPO | Lambert: 57
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench: Cohere’s RMs
Better than best open models by ~ 2-3 points on average.
Cohere Mar. 2024* Open SOTA (May) Cohere May. 2024
Chat: 94.7
Chat Hard: 65.1
Safety: 90.3
Reasoning: 98.2
*No information on architecture or training.
Life after DPO | Lambert: 58
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench: Cohere’s RMs
Better than best open models by ~ 2-3 points on average.
Cohere Mar. 2024* Open SOTA (May)** Cohere May. 2024
Chat: 94.7 98.3
Chat Hard: 65.1 65.8
Safety: 90.3 89.7
Reasoning: 98.2 94.7
*No information on architecture or training.�
** Pairwise architecture, not easy to use with RLHF.� RLHFlow/pair-preference-model-LLaMA3-8B
Life after DPO | Lambert: 59
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
RewardBench: Cohere’s RMs
Better than best open models by ~ 2-3 points on average.
Cohere Mar. 2024* Open SOTA (May)** Cohere May. 2024
Chat: 94.7 98.3 96.4
Chat Hard: 65.1 65.8 71.3
Safety: 90.3 89.7 92.7
Reasoning: 98.2 94.7 97.7
*No information on architecture or training.
** Pairwise architecture, not easy to use with RLHF.� RLHFlow/pair-preference-model-LLaMA3-8B
Life after DPO | Lambert: 60
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
Towards RewardBench 2.0
PS: Please add your models!
Life after DPO | Lambert: 61
Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling
Fine-tuning a “good” model
62
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Fine-tuning a “good” model
63
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
… and trying to answer if PPO > DPO?
Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Starting point: SFT
Life after DPO | Lambert: 64
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Tulu 2 13B foundation:
Evaluations:
Add DPO
Life after DPO | Lambert: 65
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Anthropic HH RLHF data:
Add DPO (better data)
Life after DPO | Lambert: 66
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
UltraFeedback data:
Switch from DPO to PPO
Life after DPO | Lambert: 67
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
UltraFeedback data
Scaling up the reward model
Life after DPO | Lambert: 68
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Expectations: General improvements across the board
Reality: Challenging tasks like reasoning improve, others decline
Scaling up the reward model
Life after DPO | Lambert: 69
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Expectations: General improvements across the board
Reality: Challenging tasks like reasoning improve, others decline
Reality 2: Training a good reward model is not easy
Adding more prompts to RLHF
Life after DPO | Lambert: 70
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Expectations: General improvements across the board + task specific gains
Reality: Improvements to some code and reasoning subsets, but not easy. Messy.
PPO thoughts
Takeaways
Life after DPO | Lambert: 71
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
PPO thoughts & resources
Takeaways
Resources
Life after DPO | Lambert: 72
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Many, many data ablations along the way (e.g. DPO)
Life after DPO | Lambert: 73
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
PPO vs DPO
on fixed datasets
Life after DPO | Lambert: 74
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Can we match PPO with “online” DPO?
75
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models
What is special about online data?
Online data is freshly generated from the policy and/or recently labelled by a reward model / judge.
Related question: On- or off-policy data (i.e. that generated from the policy model)
Life after DPO | Lambert: 76
Many studies on�Online data
Life after DPO | Lambert: 77
Methods
Life after DPO | Lambert: 78
D2PO: Minimizing staleness of DPO training data�(discriminator-guided DPO)
Life after DPO | Lambert: 79
Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models
Evaluating D2PO
When evaluating “online” DPO methods, DPO become horizontal lines (all data used) → much closer to old school RL learning curves.
Life after DPO | Lambert: 80
Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models
Closed form task
Reward = count(nouns)
Open ended task
Reward from AI feedback reward model
Re-labelling RM:
Online and/or iterative RLHF
Industry does BOTH. Academia mostly has done a taste of the former.
Examples of the latter – sequential training orr preference collection.
Life after DPO | Lambert: 81
Anthropic’s Claude
Llama 2
Conclusions
82
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
Discussion: What did Meta do with Llama 3?
“Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).”
Life after DPO | Lambert: 83
Current directions
Life after DPO | Lambert: 84
I cover these topics regularly on my blog www.interconnects.ai
Where open alignment is happening
Life after DPO | Lambert: 85
I cover these topics regularly on my blog www.interconnects.ai
Thank you! Questions
Contact: nathan at natolambert dot com
Socials: @natolambert
Writing: interconnects.ai
Thanks to many teammates at HuggingFace and AI2 for supporting this journey!
Life after DPO | Lambert: 86