1 of 21

Fine-tuning a “good” model with PPO

Nathan Lambert, July 2024

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

2 of 21

In this talk

Trying to match closed lab PPO performance relative to DPO
Breaking down potential gains of PPO (prompts, reward model, etc)
Hypotheses for what open models are doing wrong

3 of 21

Fine-tuning a “good” model

… and trying to answer if PPO > DPO?

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

4 of 21

Starting point: SFT

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Tulu 2 13B foundation:

Llama 2 base
Large diverse SFT dataset

Evaluations:

Factuality (MMLU)
Reasoning (GSM8k, Big Bench Hard)
Coding (HumanEval+ MBPP+)
Chat (AlpacaEval 1&2, IFEval)
Safety (ToxiGen, XSTest)
Truthfulness (TruthfulQA)

5 of 21

Add DPO

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Anthropic HH RLHF data:

Small bump in Chat, Safety, Truthfulness
All human data baseline
Accepted to be noisy

6 of 21

Add DPO (better data)

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data:

Tulu 2 13B DPO model
Bigger jumpts than HH RLHF

7 of 21

Switch from DPO to PPO

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data

Bump on more metrics (Factuality)
Continues overall bump
Biggest jump on AlpacaEval 2

8 of 21

Scaling up the reward model

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

9 of 21

Scaling up the reward model

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

Reality 2: Training a good reward model is not easy

10 of 21

Adding more prompts to RLHF

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board + task specific gains

Reality: Improvements to some code and reasoning subsets, but not easy. Messy.

11 of 21

PPO thoughts

Takeaways

“Always one more thing to ablate”
“PPO gets the best model, but we don’t know why”
Generation very slow without accelerated inference tools (e.g. VLLM)

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

12 of 21

PPO thoughts & resources

Takeaways

“Always one more thing to ablate”
“PPO gets the best model, but we don’t know why”
Generation very slow without accelerated inference tools (e.g. VLLM)

Resources

All training done on TPUs on Google Tensor Research Cloud

Can barely fit 70B policy + 70B model on 512v3 node

Codebase: EasyLM fork https://github.com/hamishivi/EasyLM
Work-in-progress replication with PyTorch on A/H100s

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

13 of 21

Many, many data ablations along the way (e.g. DPO)

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

14 of 21

PPO vs DPO

on fixed datasets

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

15 of 21

Conclusions

16 of 21

Discussion: What did Meta do with Llama 3?

“Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).”

Iterative data collection (like Llama 2)
Short timelines for each iteration
Some sort of “distribution shift” per method
Hypothesis: Rejection sampling, DPO, then PPO

17 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

18 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

19 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

We are not focused on functional tools and collaboration

20 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

We are not focused on functional tools and collaboration

We are not focused on meaningful benchmarks (all of AlpacaEval et al. can lie to you)

21 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

We are not focused on functional tools and collaboration

We are not focused on meaningful benchmarks (all of AlpacaEval et al. can lie to you)

We are spending too much time on IFT/SFT and not on other things