Open General-purpose Language Model Adaptation
Pradeep Dasigi
Ai2
DLCT 02/14/2025
Thanks to Nathan Lambert for some of the slides
Language model adaptation
Lambert - Post-training Tutorial
The raw pre-trained LMs are neither safe nor robust for public use and interactions, thus require “alignment” between AI and humans.
Follow natural language instructions
Be aware of harmful behaviors
Respond according to human preference
Improve core skills
e.g.: Llama 3 70B
e.g.: Llama 3 Instruct 70B
We need fully open adaptation procedures
Tülu (Jun 2023): Instruction tuning on open resources at scale (Llama 65B).
Tülu 2 (Nov 2023): Instruction tuning + DPO on open resources at scale (Llama 2 70B).
We need fully open adaptation procedures
Instruction tuning + DPO + novel RLVR on existing and new open resources at scale (Llama 3.1 405B).
Tülu 3 on Llama 3.1 is one of the best general models
Recipe works at 405B too
Tülu 3: Our current best recipe
Reliable evaluations for each skill
During development: hill climb on reliable evaluations and compare against prior work.
But how to ensure we are not overfitting to those evaluations?
Separate unseen evaluations
During development: hill climb on reliable evaluations and compare against prior work.
But how to ensure we are not overfitting to those evaluations?
Our solution: Separate set of unseen evaluations run only at the end of development.
Tülu 3: Our current best recipe
Curate targeted sets of prompts for training
Knowledge recall | FLAN v2; SciRIFF; TableGPT |
Math and reasoning | OpenMathInstruct 2; NuminaMath |
Coding | Evol CodeAlpaca |
Safety and non-compliance | CoCoNot; WildJailbreak; WildGuardMix |
Multilinguality | Aya |
General | OpenAssistant; NoRobots; WildChat; UltraFeedback |
Curate targeted sets of prompts for training
PersonaHub: https://github.com/tencent-ailab/persona-hub
Diverse skill-specific data by prompting GPT-4o to generate prompts with synthetic personas (Ge et al., 2024).
Curate targeted sets of prompts for training
Diverse skill-specific data by prompting GPT-4o to generate prompts with synthetic personas (Ge et al., 2024).
Curate targeted sets of prompts for training
Exact full-prompt matches: too strict
Embedding-based matches: hard to distinguish between contamination and distributional similarity
N-gram matching with heuristics: useful middle-ground
≥50% of test instance tokens have 8-gram overlap with the training instance ⇒ match
Curate targeted sets of prompts for training
Many public datasets have high overlaps with popular benchmarks!
Especially those containing real conversations with chat bots.
Curate targeted sets of prompts for training
Training objectives
Lambert - Post-training Tutorial
~ 1M examples
~ 1M examples
~ 10-100K examples
Tülu 3: Our current best recipe
The role of instruction tuning
Accomplishes two primary tasks:
Lambert - Post-training Tutorial
<|system|>
You’re a helpful agent
<|end|>
<|user|>
{query}
<|end|>
<|assistant|>{Answer goes here}
System prompt
Special
tokens
Data mixing for SFT
Training on real user interactions with strong models is helpful almost across the board.
Safety training is largely orthogonal to the other skills.
Persona-based data synthesis is very useful for targeting new skills.
Size of the SFT dataset
We used ~1M prompts for SFT since gains have not plateaued at smaller sizes.
Tülu 3: Our current best recipe
Preference finetuning
Aligning to human preferences gives:
Preference judgments
Input: Write a haiku about AI
Output 1: Sure, here’s a haiku: …
Output 2: Sorry, I cannot help you with that.
RLHF objective
Lambert - Post-training Tutorial
π: LLM policy
πθ: base LLM
x: prompt
y: completion
Optimize “reward” inspired ▲ by human preferences
▲ Constrain the model to not trust the reward too much (preferences are hard to model)
RL and direct optimization
Proximal Policy Optimization (PPO; Schulman et al., 2017) first trains a reward model and then uses RL to optimize the policy to maximize those rewards.
Direct Preference Optimization (DPO; Rafailov et al., 2024) directly optimizes the policy on the preference dataset; no explicit reward model.
SimPO (Meng et al., 2024) does not use a reference model.
Length-normalized DPO normalizes log-likelihoods of preferred and rejected responses by their lengths.
PPO consistently outperforms DPO, but at the cost of:
Normally can get ~1% improvement from switching from DPO to PPO
Lambert - Post-training Tutorial
DPO vs RL (PPO, REINFORCE, …)
Choice of preference tuning algorithm for Tülu 3
| SFT Base | 55.7 |
Trained on Ultrafeedback | Best PPO | 55.5 |
Best SimPO | 52.9 | |
Best DPO | 55.2 | |
Best Length-norm DPO | 57.3 |
Proximal Policy Optimization (PPO; Schulman et al., 2017) uses RL with a separate reward model.
Direct Preference Optimization (DPO; Rafailov et al., 2024) optimizes the policy directly on the preference dataset; no explicit reward model.
Length-normalized DPO normalizes log-likelihoods of preferred and rejected responses by their lengths.
SimPO (Meng et al., 2024) does not use a reference model.
Preference dataset
Findings:
Prompt diversity > response diversity for scaling.
Combination of on-policy and off-policy data works best for DPO.
Findings:
gpt-4o as judge only slightly better than gpt-4o-mini, Llama 3.1 70B and 405B.
Tülu 3: Our current best recipe
The main idea behind RLVR
What does preference tuning provide?
For tasks where there is a verifiable ground truth 1 is not applicable, but 2 is still useful.
We can still use PPO, but no need to model rewards.
Standard RLHF (with RL details)
Lambert - Post-training Tutorial
Lambert, Nathan et al. 2024. Tülu 3.
RL with verifiable rewards details
Lambert - Post-training Tutorial
1. We do not use a reward model! Just “environment” reward.
2. Value model init. from SFT / reward model, checkpoint, not random init.
Lambert, Nathan et al. 2024. Tülu 3.
Implementation details and findings
Evaluating the pipeline on development benchmarks
SFT -> DPO -> RLVR improves performance on average.
DPO helps strongly with factuality (TruthfulQA) and instruction following (IFEval and AlpacaEval).
RLVR helps with math, and precise instruction following without significantly hurting others.
Evaluating the pipeline on unseen benchmarks
Trends generally hold showing the overall pipeline generalizes well.
RLVR generalizes to unseen math and IF evaluations.
Data mix overfits to IFEval
IFEval Example: Write a poem about how I am missing my classes… It should have 4 sections marked with SECTION X.
IFEval-OOD Example: Write an imaginative screenplay about … Use an emoji at the end of each sentence.
Satisfying hard constraints is surprisingly hard for LMs
| IFEval | IFEval-OOD |
Llama 3.1 70B Instruct | 88.0 | 34.5 |
Hermes 3 70B | 76.0 | 24.6 |
Tülu 3 70B | 83.2 | 27.8 |
A lot of the progress on the IFEval benchmark may be due to overfitting.
Open Resources
Scientific value of large projects
Evaluating promising ideas in practically useful settings. Some things we learned:
Some things we explored that did not make it to the final recipe:
Scientific value of large projects
Identifying new research problems. Some next steps: