1 of 21

Fine-tuning a “good” model with PPO

1

Nathan Lambert, July 2024

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

2 of 21

In this talk

  • Trying to match closed lab PPO performance relative to DPO
  • Breaking down potential gains of PPO (prompts, reward model, etc)
  • Hypotheses for what open models are doing wrong

3 of 21

Fine-tuning a “good” model

3

… and trying to answer if PPO > DPO?

Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

4 of 21

Starting point: SFT

4

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Tulu 2 13B foundation:

  • Llama 2 base
  • Large diverse SFT dataset

Evaluations:

  • Factuality (MMLU)
  • Reasoning (GSM8k, Big Bench Hard)
  • Coding (HumanEval+ MBPP+)
  • Chat (AlpacaEval 1&2, IFEval)
  • Safety (ToxiGen, XSTest)
  • Truthfulness (TruthfulQA)

5 of 21

Add DPO

5

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Anthropic HH RLHF data:

  • Small bump in Chat, Safety, Truthfulness
  • All human data baseline
  • Accepted to be noisy

6 of 21

Add DPO (better data)

6

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data:

  • Tulu 2 13B DPO model
  • Bigger jumpts than HH RLHF

7 of 21

Switch from DPO to PPO

7

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

UltraFeedback data

  • Bump on more metrics (Factuality)
  • Continues overall bump
  • Biggest jump on AlpacaEval 2

8 of 21

Scaling up the reward model

8

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

9 of 21

Scaling up the reward model

9

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board

Reality: Challenging tasks like reasoning improve, others decline

Reality 2: Training a good reward model is not easy

10 of 21

Adding more prompts to RLHF

10

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

Expectations: General improvements across the board + task specific gains

Reality: Improvements to some code and reasoning subsets, but not easy. Messy.

11 of 21

PPO thoughts

Takeaways

  • “Always one more thing to ablate”
  • “PPO gets the best model, but we don’t know why”
  • Generation very slow without accelerated inference tools (e.g. VLLM)

11

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

12 of 21

PPO thoughts & resources

Takeaways

  • “Always one more thing to ablate”
  • “PPO gets the best model, but we don’t know why”
  • Generation very slow without accelerated inference tools (e.g. VLLM)

Resources

  • All training done on TPUs on Google Tensor Research Cloud
    • Can barely fit 70B policy + 70B model on 512v3 node
  • Codebase: EasyLM fork https://github.com/hamishivi/EasyLM
  • Work-in-progress replication with PyTorch on A/H100s

12

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

13 of 21

Many, many data ablations along the way (e.g. DPO)

13

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

14 of 21

PPO vs DPO

on fixed datasets

14

Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.

* Presented data not final

15 of 21

Conclusions

15

Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions

16 of 21

Discussion: What did Meta do with Llama 3?

“Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).”

  • Iterative data collection (like Llama 2)
  • Short timelines for each iteration
  • Some sort of “distribution shift” per method
  • Hypothesis: Rejection sampling, DPO, then PPO

16

17 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

17

18 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

18

19 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

We are not focused on functional tools and collaboration

19

20 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

We are not focused on functional tools and collaboration

We are not focused on meaningful benchmarks (all of AlpacaEval et al. can lie to you)

20

21 of 21

Discussion: What are we missing in the open?

We are not scaling our RLHF pipelines and preference datasets

We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)

We are not focused on functional tools and collaboration

We are not focused on meaningful benchmarks (all of AlpacaEval et al. can lie to you)

We are spending too much time on IFT/SFT and not on other things

21