1 of 86

Life after DPO

Nathan Lambert || Allen Institute for AI || @natolambert

Stanford CS224N: Natural Language Processing with Deep Learning

21 May 2024

2 of 86

A heavily abbreviated history of language models (LMs)

Life after DPO | Lambert: 2

3 of 86

A heavily abbreviated history of LMs

Life after DPO | Lambert: 3

1948: Claude Shannon models English

1948-2017:

4 of 86

A heavily abbreviated history of LMs

Life after DPO | Lambert: 4

1948: Claude Shannon models English

1948-2017:

2017: the transformer is born

2018: GPT-1, ELMo and BERT released

2019: GPT-2 and scaling laws

2020: GPT-3 surprising capabilities. many harms

5 of 86

A heavily abbreviated history of LMs

Life after DPO | Lambert: 5

1948: Claude Shannon models English

1948-2017:

2017: the transformer is born

2018: GPT-1, ELMo and BERT released

2019: GPT-2 and scaling laws

2020: GPT-3 surprising capabilities

2021: Stochastic parrots

2022: ChatGPT

6 of 86

Can ChatGPT exist without RLHF?

RLHF seems to be necessary, but not sufficient

Life after DPO | Lambert: 6

7 of 86

RLHF is relied upon elsewhere

RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more

Life after DPO | Lambert: 7

8 of 86

RLHF is relied upon elsewhere

RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more

Life after DPO | Lambert: 8

Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” 2023.

Anthropic’s Claude

9 of 86

RLHF is relied upon elsewhere

RLHF is a key factor in many popular models, both on and off the record, including ChatGPT, Bard/Gemini, Claude, Llama 2, and more

Life after DPO | Lambert: 9

“Meanwhile reinforcement learning, known for its instability, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness.”

- Touvron, H. et al. “ Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023

Bai, Y. et al. “Constitutional AI: Harmlessness from AI Feedback.” 2023.

Anthropic’s Claude

Meta’s Llama 2

10 of 86

Background: IFT, DPO, RLHF objective

10

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

11 of 86

Some definitions for “alignment” of models

  • Instruction fine-tuning (IFT): Training a model to follow use instructions (usually via autoregressive LM loss)
  • Supervised fine-tuning (SFT): Training a model to learn task-specific capabilities (usually via autoregressive LM loss)
  • Alignment: General notion of training a model to mirror user desires, any loss function
  • Reinforcement learning from human feedback (RLHF): Specific technical tool for training ML models from human data
  • Preference fine-tuning: Using labeled preference data to fine-tune a LM (either with RL, DPO, or another loss function), there’s also learning to rank

Life after DPO | Lambert: 11

12 of 86

Key idea: Instruction fine-tuning (IFT)

  1. Adapt base model to specific style of input
  2. Ability to include system prompts, multi-turn dialogues, and other chat templates

Life after DPO | Lambert: 12

<|system|>

You’re a helpful agent

<|end|>

<|user|>

{query}

<|end|>

<|assistant|>{Answer goes here}

System prompt

Special

tokens

13 of 86

Key idea: Instruction fine-tuning (IFT)

starting point: a base language model

continue training a transformer with pairs of

question: answer

Life after DPO | Lambert: 13

Stack Overflow :What makes a transformer a transformer?, nbro 2021

14 of 86

Review: RLHF objective

Life after DPO | Lambert: 14

π: LLM policy

πθ: base LLM

x: prompt

y: completion

15 of 86

Review: RLHF objective

Life after DPO | Lambert: 15

Optimize “reward” inspired by human preferences

Constrain the model to not trust the reward too much (preferences are hard to model)

π: LLM policy

πθ: base LLM

x: prompt

y: completion

16 of 86

Review: RLHF objective

Life after DPO | Lambert: 16

Optimize “reward” inspired by human preferences

Constrain the model to not trust the reward too much (preferences are hard to model)

π: LLM policy

πθ: base LLM

x: prompt

y: completion

Primary questions:

  1. How to implement reward: r(x,y)
  2. How to optimize reward

17 of 86

Review: Preference (reward) modeling

Can we just use supervised learning on scores?

  • Assigning a scalar reward of how good a response is did not work
  • Pairwise preferences are easy to collect and worked!

Life after DPO | Lambert: 17

Bradley Terry model:�Estimate probability that a given pairwise preference is true

Score from

optimal reward model

Chosen completion

Rejected completion

Prompt

Key idea:

Probability reward

18 of 86

What if we just use gradient ascent on this equation?

Life after DPO | Lambert: 18

19 of 86

What if we just use gradient ascent on this equation?

The answer, with some math, is:

Direct Preference Optimization (DPO)

Released on May 29th 2023

(4+ months before models we’re discussing)

Life after DPO | Lambert: 19

Rafailov, Sharma, Mitchell et al. 2023

20 of 86

DPO characteristics

  1. Extremely simple to implement
  2. Scales nicely with existing distributed training libraries
  3. Trains an implicit reward function (can still be used as a reward model, see RewardBench)

The first 2 points mean we’ll see more DPO models than anything else and learn it’s limits!

Life after DPO | Lambert: 20

Example code.

Rafailov, Sharma, Mitchell et al. 2023

21 of 86

DPO vs RL (PPO, REINFORCE, …)

DPO and PPO are very different optimizers.

It is learning directly from preferences vs. using RL update rules.

It is also not really online vs offline RL, but that is more muddled.��More discussion:https://twitter.com/srush_nlp/status/1729896568956895370, https://www.interconnects.ai/p/the-dpo-debate, https://www.youtube.com/watch?v=YJMCSVLRUNs

Life after DPO | Lambert: 21

Credit Tom Goldstein

https://twitter.com/tomgoldsteincs

22 of 86

The path to DPO models

22

Figure from �Aligning Open Language Models�https://youtu.be/AdLgPmcrXwQ

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

23 of 86

First open instruction tuned models

Life after DPO | Lambert: 23

Alpaca

13 Mar. 2023

  • 52k self-instruct style data distilled from text-davinci-003
  • Model weight diff. to LLaMA 7B

https://crfm.stanford.edu/2023/03/13/alpaca.html

Vicuna (lmsys/vicuna-7b-delta-v0)

30 Mar. 2023

  • Fine-tunes ChatGPT data from ShareGPT
  • LLaMA 7B and 13B diff’s
  • Introduces LLM-as-a-judge

https://lmsys.org/blog/2023-03-30-vicuna/

Koala

3 Apr. 2023

  • Diverse dataset (Alpaca, Anthropic HH, ShareGPT, WebGPT…)
  • Human evaluation
  • LLaMA 7B diff.

https://bair.berkeley.edu/blog/2023/04/03/koala/

Dolly

12 Apr. 2023

  • 15k human written data
  • Trained on Pythia 12b

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

MT Bench 13B: 4.53

MT Bench 7B: 6.69

MT Bench 13B: 6.08

MT Bench 12B: 3.28

24 of 86

Key resource: ShareGPT data

  • Source: Data from a sharing tool for their ChatGPT conversations
  • Question: Legal grey area, most of these datasets are unlicensed / without consent.
  • Use: extensive use in last 18 months, starting to be replaced by carefully collected counterparts:
    • LMSYS-Chat-1M: cleaned conversations from ChatBotArena.
    • WildChat: free ChatGPT usage in exchange for data.

Life after DPO | Lambert: 24

Source: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

25 of 86

OpenAssistant: The first open, human instruction dataset

“In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.”

April 15th 2023

  • Used extensively in future models.
  • Still the only human dataset of this size to be released.
  • OpenAssistant and others trained the popular models with it.
  • (released fine-tuned models too!)

Life after DPO | Lambert: 25

26 of 86

StableVicuna: The first RLHF model

28 April 2024

Trained with proximal policy optimization (PPO) on popular datasets

  • OAsst1 dataset for SFT + PPO
  • Anthropic HH + Stanford Human Preferences (SHP) for RL

Standard formulation. Ahead of its time!

Life after DPO | Lambert: 26

27 of 86

Llama 2 chat backlash

Should chat models be “safe?”

Life after DPO | Lambert: 27

Röttger et al. 2023

28 of 86

“Uncensored” models

  • Goal: Modify models so they don’t refuse any request
  • Method: Remove instances of “as a language model” or “Sorry, …” in training data
  • Confusion: Not the clearest name for things. The models were never explicitly censored to begin with.
  • Prefer the name direct or unbiased.

One of the first models named this way (April 2023): cognitivecomputations/WizardLM-7B-Uncensored

Example models here: https://huggingface.co/models?other=uncensored

Life after DPO | Lambert: 28

29 of 86

Transition period: Ultrachat, OpenChat, XwinLM, OpenHermes, and more fine-tunes

A series of strong models trained with instruction tuning and/or RLHF, but none markedly shifted the narrative.

  • April. 2023: WizardLM v0.1 trained with EvolInstruct (synthetic data generation), other strong RL math/code models mostly ignored by community, MT Bench 13B: 6.35
  • Jun. 2023: UltraLM 13B trained on new UltraChat dataset
  • Jun. 2023: OpenChat 13B trained on filtered ShareGPT data
  • Sep. 2023: XwinLM 7B, strong model “trained with RLHF,” but no details, no paper� XwinLM 70B, first model to beat GPT-4 on AlpacaEval
  • Oct. 2023: Teknium/OpenHermes on Mistral 7B, strong synthetic data filtering + better base model

Life after DPO | Lambert: 29

Note 17 April 2024: WizardLM not currently available officially on HuggingFace for artifact review at Microsoft.

30 of 86

DPO works: Zephyr β

  • First model to make a splash with DPO!
  • Fine-tune of Mistral 7B with UltraFeedback dataset.
  • Discovered weird low learning rates that are now standard (~5E-7)
  • MT Bench 7.34

Life after DPO | Lambert: 30

31 of 86

DPO scales: Tulu 2

  • First model to scale DPO to 70 billion parameters!
  • Strongly validated the Zephyr results.
  • Started the DPO vs. PPO debate for real.
  • MT Bench 70B: 7.89

Life after DPO | Lambert: 31

32 of 86

RLHF phase: SteerLM & Starling

Still plenty of models showing that PPO (and RL methods) outperforms DPO!

  • SteerLM: Attribute conditioned fine-tuning
  • Starling: Introduced new preference dataset, Nectar, and k-wise reward model loss function (i.e. moving beyond pairwise preferences)
    • MT Bench 7B: 8.09 (beat every model except GPT-4 at the time)

Life after DPO | Lambert: 32

33 of 86

Life after DPO models

33

34 of 86

Life after DPO

Life after DPO | Lambert: 34

Much easier to get into alignment research

Still don’t really have the resources (e.g. human data) to do RLHF like industry

35 of 86

Life after DPO

Life after DPO | Lambert: 35

Much easier to get into alignment research

Still don’t really have the resources (e.g. human data) to do RLHF like industry

(I’m too often here) 🥲

36 of 86

Life after DPO

  1. Better evaluation for alignment���
  2. How can we improve upon DPO models?

Life after DPO | Lambert: 36

37 of 86

Life after DPO

  • Better evaluation for alignment � → RewardBench example� → (building a suite of tools like ArenaHard)
  • How can we improve upon DPO models?

PPO vs DPO performance study

Online DPO variants

Life after DPO | Lambert: 37

38 of 86

RewardBench

38

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

39 of 86

From environment to reward model

Life after DPO | Lambert: 39

40 of 86

Reward model training

Life after DPO | Lambert: 40

The Transformer - Vaswani et al. 2017

outputs:

two scalar rewards

loss: increase difference of predicted reward

input pair:

selected prompt +completion

rejected prompt +completion

41 of 86

Reward model training

Advanced considerations:

  • Trained for 1 epoch (overfitting)!
  • Evaluation often only has 65-75% agreement
  • Additional options (such as margin between choices in loss function)

Life after DPO | Lambert: 41

42 of 86

How to evaluate reward models?

Many questions we want to answer:

  • How do reward models / preference models improve final LLM capabilities?
  • How do reward models encode safety / other specific features?
  • How do scaling laws improve specific properties of reward models?

Context:

→ Many researchers/engineers/papers from industry say reward models are crucial to RLHF.

Life after DPO | Lambert: 42

43 of 86

RewardBench structure

Life after DPO | Lambert: 43

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

44 of 86

RewardBench

dataset

Life after DPO | Lambert: 44

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

45 of 86

RewardBench

at launch

March 2024

Life after DPO | Lambert: 45

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

46 of 86

RewardBench

at launch

March 2024

Life after DPO | Lambert: 46

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

47 of 86

RewardBench

Today

May 2024

Life after DPO | Lambert: 47

48 of 86

RewardBench

Today

May 2024

Life after DPO | Lambert: 48

From top 5 to top 30

49 of 86

RewardBench

Today

May 2024

Life after DPO | Lambert: 49

Some closed lab model scores!

50 of 86

RewardBench

Today

May 2024

Life after DPO | Lambert: 50

DPO models slowing down

51 of 86

RewardBench

Today

May 2024

Life after DPO | Lambert: 51

LLM-as-a-judge not SOTA

52 of 86

RewardBench

Today

May 2024

Life after DPO | Lambert: 52

Chat Hard is the only meaningful eval.

53 of 86

Chat Hard - Example

Subtle change of topics or literally trick questions (made intentionally).�From Zeng, Zhiyuan, et al. "Evaluating large language models at evaluating instruction following." arXiv preprint arXiv:2310.07641 (2023).

Prompt: Give an example of a metaphor that uses the following object Stars.

Chosen: The stars were twinkling diamonds in the night sky.

Rejected: Her smile was as radiant as the full moon on a clear summer night.

Subset: llmbar-adver-GPTInst

Life after DPO | Lambert: 53

54 of 86

Safety Patterns

Life after DPO | Lambert: 54

Röttger, Paul, et al. "Xstest: A test suite for identifying exaggerated safety behaviours in large language models." arXiv preprint arXiv:2308.01263 (2023).�Wang, Yuxia, et al. "Do-not-answer: A dataset for evaluating safeguards in llms." arXiv preprint arXiv:2308.13387 (2023).

Handles safety well

Refuses everything

Responds to everything

55 of 86

Using DPO models as an RM

Life after DPO | Lambert: 55

Insert more DPO math above…

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

56 of 86

DPO reward models without reference model?

Life after DPO | Lambert: 56

Insert more DPO math above…

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

57 of 86

DPO reward models without reference model?

Life after DPO | Lambert: 57

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

58 of 86

RewardBench: Cohere’s RMs

Better than best open models by ~ 2-3 points on average.

Cohere Mar. 2024* Open SOTA (May) Cohere May. 2024

Chat: 94.7

Chat Hard: 65.1

Safety: 90.3

Reasoning: 98.2

*No information on architecture or training.

Life after DPO | Lambert: 58

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

59 of 86

RewardBench: Cohere’s RMs

Better than best open models by ~ 2-3 points on average.

Cohere Mar. 2024* Open SOTA (May)** Cohere May. 2024

Chat: 94.7 98.3

Chat Hard: 65.1 65.8

Safety: 90.3 89.7

Reasoning: 98.2 94.7

*No information on architecture or training.�

** Pairwise architecture, not easy to use with RLHF.� RLHFlow/pair-preference-model-LLaMA3-8B

Life after DPO | Lambert: 59

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

60 of 86

RewardBench: Cohere’s RMs

Better than best open models by ~ 2-3 points on average.

Cohere Mar. 2024* Open SOTA (May)** Cohere May. 2024

Chat: 94.7 98.3 96.4

Chat Hard: 65.1 65.8 71.3

Safety: 90.3 89.7 92.7

Reasoning: 98.2 94.7 97.7

*No information on architecture or training.

** Pairwise architecture, not easy to use with RLHF.� RLHFlow/pair-preference-model-LLaMA3-8B

Life after DPO | Lambert: 60

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

61 of 86

Towards RewardBench 2.0

  • Reasoning category is easy based on formatting (bugs are small, human vs. model text, etc.) → Reasoning 2.0
  • Lower random baseline: from pairwise to batch RM ranking
  • More datasets
    • Existing benchmarks (e.g. jailbreaking)
    • Custom, held-out data (make labs come to us to evaluate!)
  • More closed models: need structured access with LLM labs
  • Correlating with PPO training

PS: Please add your models!

Life after DPO | Lambert: 61

Lambert at al. 2024. RewardBench: Evaluating Reward Models for Language Modeling

62 of 86

Fine-tuning a “good” model

62

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

63 of 86

Fine-tuning a “good” model

63

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

… and trying to answer if PPO > DPO?

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

64 of 86

Starting point: SFT

Life after DPO | Lambert: 64

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Tulu 2 13B foundation:

  • Llama 2 base
  • Large diverse SFT dataset

Evaluations:

  • Factuality (MMLU)
  • Reasoning (GSM8k, Big Bench Hard)
  • Coding (HumanEval+ MBPP+)
  • Chat (AlpacaEval 1&2, IFEval)
  • Safety (ToxiGen, XSTest)
  • Truthfulness (TruthfulQA)

65 of 86

Add DPO

Life after DPO | Lambert: 65

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Anthropic HH RLHF data:

  • Small bump in Chat, Safety, Truthfulness
  • All human data baseline
  • Accepted to be noisy

66 of 86

Add DPO (better data)

Life after DPO | Lambert: 66

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data:

  • Tulu 2 13B DPO model
  • Bigger jumpts than HH RLHF

67 of 86

Switch from DPO to PPO

Life after DPO | Lambert: 67

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data

  • Bump on more metrics (Factuality)
  • Continues overall bump
  • Biggest jump on AlpacaEval 2

68 of 86

Scaling up the reward model

Life after DPO | Lambert: 68

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

69 of 86

Scaling up the reward model

Life after DPO | Lambert: 69

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

Reality 2: Training a good reward model is not easy

70 of 86

Adding more prompts to RLHF

Life after DPO | Lambert: 70

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board + task specific gains

Reality: Improvements to some code and reasoning subsets, but not easy. Messy.

71 of 86

PPO thoughts

Takeaways

  • “Always one more thing to ablate”
  • “PPO gets the best model, but we don’t know why”
  • Generation very slow without accelerated inference tools (e.g. VLLM)

Life after DPO | Lambert: 71

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

72 of 86

PPO thoughts & resources

Takeaways

  • “Always one more thing to ablate”
  • “PPO gets the best model, but we don’t know why”
  • Generation very slow without accelerated inference tools (e.g. VLLM)

Resources

  • All training done on TPUs on Google Tensor Research Cloud
    • Can barely fit 70B policy + 70B model on 512v3 node
  • Codebase: EasyLM fork https://github.com/hamishivi/EasyLM
  • Work-in-progress replication with PyTorch on A/H100s

Life after DPO | Lambert: 72

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

73 of 86

Many, many data ablations along the way (e.g. DPO)

Life after DPO | Lambert: 73

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

74 of 86

PPO vs DPO

on fixed datasets

Life after DPO | Lambert: 74

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

75 of 86

Can we match PPO with “online” DPO?

75

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models

76 of 86

What is special about online data?

Online data is freshly generated from the policy and/or recently labelled by a reward model / judge.

  • PPO does both with generation + reward model scoring
  • Other methods use different ways for doing this: collect new preference data, re-label existing data, LLM-as-a-judge, reward model ranking

Related question: On- or off-policy data (i.e. that generated from the policy model)

Life after DPO | Lambert: 76

77 of 86

Many studies on�Online data

Life after DPO | Lambert: 77

78 of 86

Methods

Life after DPO | Lambert: 78

79 of 86

D2PO: Minimizing staleness of DPO training data�(discriminator-guided DPO)

Life after DPO | Lambert: 79

Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models

80 of 86

Evaluating D2PO

When evaluating “online” DPO methods, DPO become horizontal lines (all data used) → much closer to old school RL learning curves.

Life after DPO | Lambert: 80

Singhal et al. 2024. D2PO: Discriminator-Guided DPO with Response Evaluation Models

Closed form task

Reward = count(nouns)

Open ended task

Reward from AI feedback reward model

Re-labelling RM:

81 of 86

Online and/or iterative RLHF

Industry does BOTH. Academia mostly has done a taste of the former.

Examples of the latter – sequential training orr preference collection.

Life after DPO | Lambert: 81

Anthropic’s Claude

Llama 2

82 of 86

Conclusions

82

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

83 of 86

Discussion: What did Meta do with Llama 3?

“Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).”

  • Iterative data collection (like Llama 2)
  • Short timelines for each iteration
  • Some sort of “distribution shift” per method
  • Hypothesis: Rejection sampling, DPO, then PPO

Life after DPO | Lambert: 83

84 of 86

Current directions

  1. Data! Data! Data! We are severely limited on experimentation by having too few preference datasets (Anthropic HH, UltraFeedback, and Nectar are main three).
  2. Continuing to improve DPO: tons of papers iterating on the method (ORPO, cDPO, IPO, BCO, KTO, DNO, sDPO, etc)
  3. More model sizes: Most alignment research happened at 7 or 13B parameter scale. Expand up and down!
  4. Specific evaluations: How do we get more specific evaluations than ChatBotArena?
  5. Personalization: A large motivation behind local models, young area academically

Life after DPO | Lambert: 84

I cover these topics regularly on my blog www.interconnects.ai

85 of 86

Where open alignment is happening

  • AI2 (self bias): Tulu models, OLMo-Adapt, dataset releases
  • HuggingFaceH4: Quick releases on new base models, recipes for new techniques (e.g. ORPO / CAI), other tools
  • Berkeley-Nest/Nexusflow: Nectar dataset / Starling models
  • NousResearch: Hermes fine-tuning models, datasets, and other
  • OpenBMB: Preference datasets, reward models, and more
  • Argilla: Open preference datasets and resulting models
  • Some HuggingFace users
    • Maxime Labonne: Model merging & other fine-tunes
    • Jon Durbin: More model merges & other fine-tunes

Life after DPO | Lambert: 85

I cover these topics regularly on my blog www.interconnects.ai

86 of 86

Thank you! Questions

Contact: nathan at natolambert dot com

Socials: @natolambert

Writing: interconnects.ai

Thanks to many teammates at HuggingFace and AI2 for supporting this journey!

Life after DPO | Lambert: 86