Fine-tuning a “good” model with PPO
1
Nathan Lambert, July 2024
Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
In this talk
Fine-tuning a “good” model
3
… and trying to answer if PPO > DPO?
Ivison at al. 2024. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Starting point: SFT
4
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Tulu 2 13B foundation:
Evaluations:
Add DPO
5
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Anthropic HH RLHF data:
Add DPO (better data)
6
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
UltraFeedback data:
Switch from DPO to PPO
7
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
UltraFeedback data
Scaling up the reward model
8
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Expectations: General improvements across the board
Reality: Challenging tasks like reasoning improve, others decline
Scaling up the reward model
9
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Expectations: General improvements across the board
Reality: Challenging tasks like reasoning improve, others decline
Reality 2: Training a good reward model is not easy
Adding more prompts to RLHF
10
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Expectations: General improvements across the board + task specific gains
Reality: Improvements to some code and reasoning subsets, but not easy. Messy.
PPO thoughts
Takeaways
11
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
PPO thoughts & resources
Takeaways
Resources
12
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Many, many data ablations along the way (e.g. DPO)
13
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
PPO vs DPO
on fixed datasets
14
Ivison et al. 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. Appearing soon.
* Presented data not final
Conclusions
15
Intro | Background | Path to DPO models | RewardBench | Fine-tuning a model | Online DPO | Conclusions
Discussion: What did Meta do with Llama 3?
“Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).”
16
Discussion: What are we missing in the open?
We are not scaling our RLHF pipelines and preference datasets
We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)
17
Discussion: What are we missing in the open?
We are not scaling our RLHF pipelines and preference datasets
18
Discussion: What are we missing in the open?
We are not scaling our RLHF pipelines and preference datasets
We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)
We are not focused on functional tools and collaboration
19
Discussion: What are we missing in the open?
We are not scaling our RLHF pipelines and preference datasets
We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)
We are not focused on functional tools and collaboration
We are not focused on meaningful benchmarks (all of AlpacaEval et al. can lie to you)
20
Discussion: What are we missing in the open?
We are not scaling our RLHF pipelines and preference datasets
We are not using “simple” baselines like Rejection Sampling (used in Llama 2, Llama 3, Nemotron, and others)
We are not focused on functional tools and collaboration
We are not focused on meaningful benchmarks (all of AlpacaEval et al. can lie to you)
We are spending too much time on IFT/SFT and not on other things
21