1 of 86

Life after DPO

Nathan Lambert || Allen Institute for AI || @natolambert

Stanford CS224N: Natural Language Processing with Deep Learning

21 May 2024

2 of 86

A heavily abbreviated history of language models (LMs)

Life after DPO | Lambert: 2

3 of 86

A heavily abbreviated history of LMs

Life after DPO | Lambert: 3

1948: Claude Shannon models English

1948-2017:

4 of 86

A heavily abbreviated history of LMs

Life after DPO | Lambert: 4

1948: Claude Shannon models English

1948-2017:

2017: the transformer is born

2018: GPT-1, ELMo and BERT released

2019: GPT-2 and scaling laws

2020: GPT-3 surprising capabilities. many harms

5 of 86

A heavily abbreviated history of LMs

Life after DPO | Lambert: 5

1948: Claude Shannon models English

1948-2017:

2017: the transformer is born

2018: GPT-1, ELMo and BERT released

2019: GPT-2 and scaling laws

2020: GPT-3 surprising capabilities

2021: Stochastic parrots

2022: ChatGPT

6 of 86

Can ChatGPT exist without RLHF?

RLHF seems to be necessary, but not sufficient

Life after DPO | Lambert: 6

7 of 86

RLHF is relied upon elsewhere

RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more

Life after DPO | Lambert: 7

8 of 86

RLHF is relied upon elsewhere

RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more

Life after DPO | Lambert: 8

Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” 2023.

Anthropic’s Claude

9 of 86

RLHF is relied upon elsewhere

RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more

Life after DPO | Lambert: 9

“Meanwhile reinforcement learning, known for its instability, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness.”

- Touvron, H. et al. “ Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023

Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” 2023.

Anthropic’s Claude

Meta’s Llama 2

10 of 86

Background: IFT, DPO, RLHF objective

11 of 86

Some definitions for “alignment” of models

Instruction fine-tuning (IFT): Training a model to follow use instructions (usually via autoregressive LM loss)
Supervised fine-tuning (SFT): Training a model to learn task-specific capabilities (usually via autoregressive LM loss)
Alignment: General notion of training a model to mirror user desires, any loss function
Reinforcement learning from human feedback (RLHF): Specific technical tool for training ML models from human data
Preference fine-tuning: Using labeled preference data to fine-tune a LM (either with RL, DPO, or another loss function), there’s also learning to rank

Life after DPO | Lambert: 11

12 of 86

Key idea: Instruction fine-tuning (IFT)

Adapt base model to specific style of input
Ability to include system prompts, multi-turn dialogues, and other chat templates

Life after DPO | Lambert: 12

<|system|>

You’re a helpful agent

<|end|>

<|user|>

{query}

<|end|>

<|assistant|>{Answer goes here}

System prompt

Special

tokens

13 of 86

Key idea: Instruction fine-tuning (IFT)

starting point: a base language model

continue training a transformer with pairs of

question: answer

Life after DPO | Lambert: 13

Stack Overflow :What makes a transformer a transformer?, nbro 2021

14 of 86

Review: RLHF objective

Life after DPO | Lambert: 14

π: LLM policy

π_θ: base LLM

x: prompt

y: completion

15 of 86

Review: RLHF objective

Life after DPO | Lambert: 15

Optimize “reward” inspired ▲ by human preferences

▲ Constrain the model to not trust the reward too much (preferences are hard to model)

π: LLM policy

π_θ: base LLM

x: prompt

y: completion

16 of 86

Review: RLHF objective

Life after DPO | Lambert: 16

Optimize “reward” inspired ▲ by human preferences

▲ Constrain the model to not trust the reward too much (preferences are hard to model)

π: LLM policy

π_θ: base LLM

x: prompt

y: completion

Primary questions:

How to implement reward: r(x,y)
How to optimize reward

17 of 86

Review: Preference (reward) modeling

Can we just use supervised learning on scores?

Assigning a scalar reward of how good a response is did not work
Pairwise preferences are easy to collect and worked!

Life after DPO | Lambert: 17

Bradley Terry model:�Estimate probability that a given pairwise preference is true

Score from

optimal reward model

Chosen completion

Rejected completion

Prompt

Key idea:

Probability ∝ reward

18 of 86

What if we just use gradient ascent on this equation?

Life after DPO | Lambert: 18

19 of 86

What if we just use gradient ascent on this equation?

The answer, with some math, is:

Direct Preference Optimization (DPO)

Released on May 29th 2023

(4+ months before models we’re discussing)

Life after DPO | Lambert: 19

Rafailov, Sharma, Mitchell et al. 2023

20 of 86

DPO characteristics

Extremely simple to implement
Scales nicely with existing distributed training libraries
Trains an implicit reward function (can still be used as a reward model, see RewardBench)

The first 2 points mean we’ll see more DPO models than anything else and learn it’s limits!

Life after DPO | Lambert: 20

Example code.

Rafailov, Sharma, Mitchell et al. 2023

21 of 86

DPO vs RL (PPO, REINFORCE, …)

DPO and PPO are very different optimizers.

It is learning directly from preferences vs. using RL update rules.

It is also not really online vs offline RL, but that is more muddled.��More discussion:�https://twitter.com/srush_nlp/status/1729896568956895370, https://www.interconnects.ai/p/the-dpo-debate, https://www.youtube.com/watch?v=YJMCSVLRUNs

Life after DPO | Lambert: 21

Credit Tom Goldstein

https://twitter.com/tomgoldsteincs

22 of 86

The path to DPO models

Figure from �Aligning Open Language Models�https://youtu.be/AdLgPmcrXwQ

23 of 86

First open instruction tuned models

Life after DPO | Lambert: 23

Alpaca

13 Mar. 2023

52k self-instruct style data distilled from text-davinci-003
Model weight diff. to LLaMA 7B

https://crfm.stanford.edu/2023/03/13/alpaca.html

Vicuna (lmsys/vicuna-7b-delta-v0)

30 Mar. 2023

Fine-tunes ChatGPT data from ShareGPT
LLaMA 7B and 13B diff’s
Introduces LLM-as-a-judge

https://lmsys.org/blog/2023-03-30-vicuna/

Koala

3 Apr. 2023

Diverse dataset (Alpaca, Anthropic HH, ShareGPT, WebGPT…)
Human evaluation
LLaMA 7B diff.

https://bair.berkeley.edu/blog/2023/04/03/koala/

Dolly

12 Apr. 2023

15k human written data
Trained on Pythia 12b

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

MT Bench 13B: 4.53

MT Bench 7B: 6.69

MT Bench 13B: 6.08

MT Bench 12B: 3.28

24 of 86

Key resource: ShareGPT data

Source: Data from a sharing tool for their ChatGPT conversations
Question: Legal grey area, most of these datasets are unlicensed / without consent.
Use: extensive use in last 18 months, starting to be replaced by carefully collected counterparts:

LMSYS-Chat-1M: cleaned conversations from ChatBotArena.
WildChat: free ChatGPT usage in exchange for data.

Life after DPO | Lambert: 24

Source: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

25 of 86

OpenAssistant: The first open, human instruction dataset

“In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.”

April 15th 2023

Used extensively in future models.
Still the only human dataset of this size to be released.
OpenAssistant and others trained the popular models with it.
(released fine-tuned models too!)

Life after DPO | Lambert: 25

Dataset: https://huggingface.co/datasets/OpenAssistant/oasst1

Paper: https://arxiv.org/abs/2304.07327

26 of 86

StableVicuna: The first RLHF model

28 April 2024

Trained with proximal policy optimization (PPO) on popular datasets

OAsst1 dataset for SFT + PPO
Anthropic HH + Stanford Human Preferences (SHP) for RL

Standard formulation. Ahead of its time!

Life after DPO | Lambert: 26

Model: https://huggingface.co/CarperAI/stable-vicuna-13b-delta

Blog: https://stability.ai/news/stablevicuna-open-source-rlhf-chatbot

27 of 86

Llama 2 chat backlash

Should chat models be “safe?”

Life after DPO | Lambert: 27

Röttger et al. 2023

28 of 86

“Uncensored” models

Goal: Modify models so they don’t refuse any request
Method: Remove instances of “as a language model” or “Sorry, …” in training data
Confusion: Not the clearest name for things. The models were never explicitly censored to begin with.
Prefer the name direct or unbiased.

One of the first models named this way (April 2023): cognitivecomputations/WizardLM-7B-Uncensored

Example models here: https://huggingface.co/models?other=uncensored

Life after DPO | Lambert: 28

29 of 86

Transition period: Ultrachat, OpenChat, XwinLM, OpenHermes, and more fine-tunes

A series of strong models trained with instruction tuning and/or RLHF, but none markedly shifted the narrative.

April. 2023: WizardLM v0.1 trained with EvolInstruct (synthetic data generation), other strong RL math/code models mostly ignored by community, MT Bench 13B: 6.35
Jun. 2023: UltraLM 13B trained on new UltraChat dataset
Jun. 2023: OpenChat 13B trained on filtered ShareGPT data
Sep. 2023: XwinLM 7B, strong model “trained with RLHF,” but no details, no paper� XwinLM 70B, first model to beat GPT-4 on AlpacaEval
Oct. 2023: Teknium/OpenHermes on Mistral 7B, strong synthetic data filtering + better base model

Life after DPO | Lambert: 29

Note 17 April 2024: WizardLM not currently available officially on HuggingFace for artifact review at Microsoft.

30 of 86

DPO works: Zephyr β

First model to make a splash with DPO!
Fine-tune of Mistral 7B with UltraFeedback dataset.
Discovered weird low learning rates that are now standard (~5E-7)
MT Bench 7.34

Life after DPO | Lambert: 30

UltraFeedback: https://arxiv.org/abs/2310.01377

Model: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

31 of 86

DPO scales: Tulu 2

First model to scale DPO to 70 billion parameters!
Strongly validated the Zephyr results.
Started the DPO vs. PPO debate for real.
MT Bench 70B: 7.89

Life after DPO | Lambert: 31

Model: https://huggingface.co/allenai/tulu-2-dpo-70b

32 of 86

RLHF phase: SteerLM & Starling

Still plenty of models showing that PPO (and RL methods) outperforms DPO!

SteerLM: Attribute conditioned fine-tuning
Starling: Introduced new preference dataset, Nectar, and k-wise reward model loss function (i.e. moving beyond pairwise preferences)

MT Bench 7B: 8.09 (beat every model except GPT-4 at the time)

Life after DPO | Lambert: 32

SteerLM: https://huggingface.co/nvidia/SteerLM-llama2-13B

Starling: https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha

33 of 86

Life after DPO models

34 of 86

Life after DPO

Life after DPO | Lambert: 34

Much easier to get into alignment research

Still don’t really have the resources (e.g. human data) to do RLHF like industry

35 of 86

Life after DPO

Life after DPO | Lambert: 35

Much easier to get into alignment research

Still don’t really have the resources (e.g. human data) to do RLHF like industry

(I’m too often here) 🥲

36 of 86

Life after DPO

Better evaluation for alignment��
How can we improve upon DPO models?

Life after DPO | Lambert: 36

37 of 86

Life after DPO

Better evaluation for alignment � → RewardBench example� → (building a suite of tools like ArenaHard)�
How can we improve upon DPO models?

→ PPO vs DPO performance study

→ Online DPO variants

Life after DPO | Lambert: 37

38 of 86

RewardBench

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

39 of 86

From environment to reward model

Life after DPO | Lambert: 39

40 of 86

Reward model training

Life after DPO | Lambert: 40

The Transformer - Vaswani et al. 2017

outputs:

two scalar rewards

loss: increase difference of predicted reward

input pair:

selected prompt +completion

rejected prompt +completion

41 of 86

Reward model training

Advanced considerations:

Trained for 1 epoch (overfitting)!
Evaluation often only has 65-75% agreement
Additional options (such as margin between choices in loss function)

Life after DPO | Lambert: 41

42 of 86

How to evaluate reward models?

Many questions we want to answer:

How do reward models / preference models improve final LLM capabilities?
How do reward models encode safety / other specific features?
How do scaling laws improve specific properties of reward models?
…

Context:

→ Many researchers/engineers/papers from industry say reward models are crucial to RLHF.

Life after DPO | Lambert: 42

43 of 86

RewardBench structure

Life after DPO | Lambert: 43

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

Chat Hard - Example

Subtle change of topics or literally trick questions (made intentionally).�From Zeng, Zhiyuan, et al. "Evaluating large language models at evaluating instruction following." arXiv preprint arXiv:2310.07641 (2023).

Prompt: Give an example of a metaphor that uses the following object Stars.

Chosen: The stars were twinkling diamonds in the night sky.

Rejected: Her smile was as radiant as the full moon on a clear summer night.

Subset: llmbar-adver-GPTInst

Life after DPO | Lambert: 53

54 of 86

Safety Patterns

Life after DPO | Lambert: 54

Röttger, Paul, et al. "Xstest: A test suite for identifying exaggerated safety behaviours in large language models." arXiv preprint arXiv:2308.01263 (2023).�Wang, Yuxia, et al. "Do-not-answer: A dataset for evaluating safeguards in llms." arXiv preprint arXiv:2308.13387 (2023).

Handles safety well

Refuses everything

Responds to everything

55 of 86

Using DPO models as an RM

Life after DPO | Lambert: 55

Insert more DPO math above…

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

56 of 86

DPO reward models without reference model?

Life after DPO | Lambert: 56

Insert more DPO math above…

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

57 of 86

DPO reward models without reference model?

Life after DPO | Lambert: 57

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

Towards RewardBench 2.0

Reasoning category is easy based on formatting (bugs are small, human vs. model text, etc.) → Reasoning 2.0
Lower random baseline: from pairwise to batch RM ranking
More datasets

Existing benchmarks (e.g. jailbreaking)
Custom, held-out data (make labs come to us to evaluate!)

More closed models: need structured access with LLM labs
Correlating with PPO training

PS: Please add your models!

Life after DPO | Lambert: 61

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

62 of 86

Fine-tuning a “good” model

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

63 of 86

Fine-tuning a “good” model

… and trying to answer if PPO > DPO?

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

64 of 86

Starting point: SFT

Life after DPO | Lambert: 64

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Tulu 2 13B foundation:

Llama 2 base
Large diverse SFT dataset

Evaluations:

Factuality (MMLU)
Reasoning (GSM8k, Big Bench Hard)
Coding (HumanEval+ MBPP+)
Chat (AlpacaEval 1&2, IFEval)
Safety (ToxiGen, XSTest)
Truthfulness (TruthfulQA)

65 of 86

Add DPO

Life after DPO | Lambert: 65

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Anthropic HH RLHF data:

Small bump in Chat, Safety, Truthfulness
All human data baseline
Accepted to be noisy

66 of 86

Add DPO (better data)

Life after DPO | Lambert: 66

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data:

Tulu 2 13B DPO model
Bigger jumpts than HH RLHF

67 of 86

Switch from DPO to PPO

Life after DPO | Lambert: 67

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data

Bump on more metrics (Factuality)
Continues overall bump
Biggest jump on AlpacaEval 2

68 of 86

Scaling up the reward model

Life after DPO | Lambert: 68

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

69 of 86

Scaling up the reward model

Life after DPO | Lambert: 69

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

Reality 2: Training a good reward model is not easy

70 of 86

Adding more prompts to RLHF

Life after DPO | Lambert: 70

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board + task specific gains

Reality: Improvements to some code and reasoning subsets, but not easy. Messy.

71 of 86

PPO thoughts

Takeaways

“Always one more thing to ablate”
“PPO gets the best model, but we don’t know why”
Generation very slow without accelerated inference tools (e.g. VLLM)

Life after DPO | Lambert: 71

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

72 of 86

PPO thoughts & resources

Takeaways

“Always one more thing to ablate”
“PPO gets the best model, but we don’t know why”
Generation very slow without accelerated inference tools (e.g. VLLM)

Resources

All training done on TPUs on Google Tensor Research Cloud

Can barely fit 70B policy + 70B model on 512v3 node

Codebase: EasyLM fork https://github.com/hamishivi/EasyLM
Work-in-progress replication with PyTorch on A/H100s

Life after DPO | Lambert: 72

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

73 of 86

Many, many data ablations along the way (e.g. DPO)

Life after DPO | Lambert: 73

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

74 of 86

PPO vs DPO

on fixed datasets

Life after DPO | Lambert: 74

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

75 of 86

Can we match PPO with “online” DPO?

Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models

76 of 86

What is special about online data?

Online data is freshly generated from the policy and/or recently labelled by a reward model / judge.

PPO does both with generation + reward model scoring
Other methods use different ways for doing this: collect new preference data, re-label existing data, LLM-as-a-judge, reward model ranking

Related question: On- or off-policy data (i.e. that generated from the policy model)

Life after DPO | Lambert: 76

77 of 86

Many studies on�Online data

Life after DPO | Lambert: 77

78 of 86

Methods

Life after DPO | Lambert: 78

79 of 86

D2PO: Minimizing staleness of DPO training data�(discriminator-guided DPO)

Life after DPO | Lambert: 79

Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models

80 of 86

Evaluating D2PO

When evaluating “online” DPO methods, DPO become horizontal lines (all data used) → much closer to old school RL learning curves.

Life after DPO | Lambert: 80

Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models

Closed form task

Reward = count(nouns)

Open ended task

Reward from AI feedback reward model

Re-labelling RM:

81 of 86

Online and/or iterative RLHF

Industry does BOTH. Academia mostly has done a taste of the former.

Examples of the latter – sequential training orr preference collection.

Life after DPO | Lambert: 81

Anthropic’s Claude

Llama 2

82 of 86

Conclusions

83 of 86

Discussion: What did Meta do with Llama 3?

“Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).”

Iterative data collection (like Llama 2)
Short timelines for each iteration
Some sort of “distribution shift” per method
Hypothesis: Rejection sampling, DPO, then PPO

Life after DPO | Lambert: 83

84 of 86

Current directions

Data! Data! Data! We are severely limited on experimentation by having too few preference datasets (Anthropic HH, UltraFeedback, and Nectar are main three).
Continuing to improve DPO: tons of papers iterating on the method (ORPO, cDPO, IPO, BCO, KTO, DNO, sDPO, etc)
More model sizes: Most alignment research happened at 7 or 13B parameter scale. Expand up and down!
Specific evaluations: How do we get more specific evaluations than ChatBotArena?
Personalization: A large motivation behind local models, young area academically

Life after DPO | Lambert: 84

I cover these topics regularly on my blog www.interconnects.ai

85 of 86

Where open alignment is happening

AI2 (self bias): Tulu models, OLMo-Adapt, dataset releases
HuggingFaceH4: Quick releases on new base models, recipes for new techniques (e.g. ORPO / CAI), other tools
Berkeley-Nest/Nexusflow: Nectar dataset / Starling models
NousResearch: Hermes fine-tuning models, datasets, and other
OpenBMB: Preference datasets, reward models, and more
Argilla: Open preference datasets and resulting models
Some HuggingFace users

Maxime Labonne: Model merging & other fine-tunes
Jon Durbin: More model merges & other fine-tunes

Life after DPO | Lambert: 85

I cover these topics regularly on my blog www.interconnects.ai

86 of 86

Thank you! Questions

Contact: nathan at natolambert dot com

Socials: @natolambert

Writing: interconnects.ai

Thanks to many teammates at HuggingFace and AI2 for supporting this journey!

Life after DPO | Lambert: 86