1 of 41

Open General-purpose Language Model Adaptation

Pradeep Dasigi

Ai2

DLCT 02/14/2025

Thanks to Nathan Lambert for some of the slides

2 of 41

Language model adaptation

Lambert - Post-training Tutorial

The raw pre-trained LMs are neither safe nor robust for public use and interactions, thus require “alignment” between AI and humans.

Follow natural language instructions

Be aware of harmful behaviors

Respond according to human preference

Improve core skills

e.g.: Llama 3 70B

e.g.: Llama 3 Instruct 70B

3 of 41

We need fully open adaptation procedures

Tülu (Jun 2023): Instruction tuning on open resources at scale (Llama 65B).

Tülu 2 (Nov 2023): Instruction tuning + DPO on open resources at scale (Llama 2 70B).

4 of 41

We need fully open adaptation procedures

Instruction tuning + DPO + novel RLVR on existing and new open resources at scale (Llama 3.1 405B).

5 of 41

Tülu 3 on Llama 3.1 is one of the best general models

6 of 41

Recipe works at 405B too

7 of 41

Tülu 3: Our current best recipe

8 of 41

Reliable evaluations for each skill

During development: hill climb on reliable evaluations and compare against prior work.

But how to ensure we are not overfitting to those evaluations?

9 of 41

Separate unseen evaluations

During development: hill climb on reliable evaluations and compare against prior work.

But how to ensure we are not overfitting to those evaluations?

Our solution: Separate set of unseen evaluations run only at the end of development.

10 of 41

Tülu 3: Our current best recipe

11 of 41

Curate targeted sets of prompts for training

Find relevant public datasets.

Knowledge recall	FLAN v2; SciRIFF; TableGPT
Math and reasoning	OpenMathInstruct 2; NuminaMath
Coding	Evol CodeAlpaca
Safety and non-compliance	CoCoNot; WildJailbreak; WildGuardMix
Multilinguality	Aya
General	OpenAssistant; NoRobots; WildChat; UltraFeedback

12 of 41

Curate targeted sets of prompts for training

Find relevant public datasets.
Synthesize data to fill gaps.

PersonaHub: https://github.com/tencent-ailab/persona-hub

Diverse skill-specific data by prompting GPT-4o to generate prompts with synthetic personas (Ge et al., 2024).

13 of 41

Curate targeted sets of prompts for training

Find relevant public datasets.
Synthesize data to fill gaps.

Diverse skill-specific data by prompting GPT-4o to generate prompts with synthetic personas (Ge et al., 2024).

14 of 41

Curate targeted sets of prompts for training

Find relevant public datasets.
Synthesize data to fill gaps.
Decontaminate against evaluation suite.

Exact full-prompt matches: too strict

Embedding-based matches: hard to distinguish between contamination and distributional similarity

N-gram matching with heuristics: useful middle-ground

≥50% of test instance tokens have 8-gram overlap with the training instance ⇒ match

15 of 41

Curate targeted sets of prompts for training

Find relevant public datasets.
Synthesize data to fill gaps.
Decontaminate against evaluation suite.

Many public datasets have high overlaps with popular benchmarks!

Especially those containing real conversations with chat bots.

16 of 41

Curate targeted sets of prompts for training

17 of 41

Training objectives

Supervised Finetuning – teach formatting and for base of instruction following abilities.
Preference Finetuning – align to human preferences (and smaller bump in capabilities).
Reinforcement Finetuning – final stage to boost performance on verifiable tasks.

Lambert - Post-training Tutorial

~ 1M examples

~ 10-100K examples

18 of 41

Tülu 3: Our current best recipe

19 of 41

The role of instruction tuning

Accomplishes two primary tasks:

Adapt base model to specific style of input for chat interactions.
Ability to include system prompts, multi-turn dialogues, and other chat templates.

Lambert - Post-training Tutorial

<|system|>

You’re a helpful agent

<|end|>

<|user|>

{query}

<|end|>

<|assistant|>{Answer goes here}

System prompt

Special

tokens

20 of 41

Data mixing for SFT

Training on real user interactions with strong models is helpful almost across the board.

Safety training is largely orthogonal to the other skills.

Persona-based data synthesis is very useful for targeting new skills.

21 of 41

Size of the SFT dataset

We used ~1M prompts for SFT since gains have not plateaued at smaller sizes.

22 of 41

Tülu 3: Our current best recipe

23 of 41

Preference finetuning

Aligning to human preferences gives:

Stronger training influence for style and chat evaluations (e.g. ChatBotArena).
Continue building capabilities of skills from SFT, but lower absolute magnitude of improvements.

Preference judgments

Input: Write a haiku about AI

Output 1: Sure, here’s a haiku: …

Output 2: Sorry, I cannot help you with that.

24 of 41

RLHF objective

Lambert - Post-training Tutorial

π: LLM policy

π_θ: base LLM

x: prompt

y: completion

Optimize “reward” inspired ▲ by human preferences

▲ Constrain the model to not trust the reward too much (preferences are hard to model)

25 of 41

RL and direct optimization

Proximal Policy Optimization (PPO; Schulman et al., 2017) first trains a reward model and then uses RL to optimize the policy to maximize those rewards.

Direct Preference Optimization (DPO; Rafailov et al., 2024) directly optimizes the policy on the preference dataset; no explicit reward model.

SimPO (Meng et al., 2024) does not use a reference model.

Length-normalized DPO normalizes log-likelihoods of preferred and rejected responses by their lengths.

26 of 41

PPO consistently outperforms DPO, but at the cost of:

Implementation complexity
Memory usage, and
Throughput

Normally can get ~1% improvement from switching from DPO to PPO

Lambert - Post-training Tutorial

DPO vs RL (PPO, REINFORCE, …)

27 of 41

Choice of preference tuning algorithm for Tülu 3

	SFT Base	55.7
Trained on Ultrafeedback	Best PPO	55.5
	Best SimPO	52.9
	Best DPO	55.2
	Best Length-norm DPO	57.3

Proximal Policy Optimization (PPO; Schulman et al., 2017) uses RL with a separate reward model.

Direct Preference Optimization (DPO; Rafailov et al., 2024) optimizes the policy directly on the preference dataset; no explicit reward model.

Length-normalized DPO normalizes log-likelihoods of preferred and rejected responses by their lengths.

SimPO (Meng et al., 2024) does not use a reference model.

28 of 41

Preference dataset

Findings:

Prompt diversity > response diversity for scaling.

Combination of on-policy and off-policy data works best for DPO.

Findings:

gpt-4o as judge only slightly better than gpt-4o-mini, Llama 3.1 70B and 405B.

29 of 41

Tülu 3: Our current best recipe

30 of 41

The main idea behind RLVR

What does preference tuning provide?

A training signal about when there is unclear or hard to specify ground truth.
Targeted signal about the kinds of mistakes the model could make (for on-policy learning).

For tasks where there is a verifiable ground truth 1 is not applicable, but 2 is still useful.

We can still use PPO, but no need to model rewards.

31 of 41

Standard RLHF (with RL details)

Lambert - Post-training Tutorial

Lambert, Nathan et al. 2024. Tülu 3.

32 of 41

RL with verifiable rewards details

Lambert - Post-training Tutorial

1. We do not use a reward model! Just “environment” reward.

2. Value model init. from SFT / reward model, checkpoint, not random init.

Lambert, Nathan et al. 2024. Tülu 3.

33 of 41

Implementation details and findings

Targeted specific skills with (close to) in-distribution training sets: MATH, GSM8K, synthetic IFEval-like data.
Initializing value function from a general reward model works better than doing so from the SFT model.
Not using reward model worked better.
Starting from the SFT model vs. the DPO model converged to the same rewards, but test performance starting from DPO was better.

34 of 41

Evaluating the pipeline on development benchmarks

SFT -> DPO -> RLVR improves performance on average.

DPO helps strongly with factuality (TruthfulQA) and instruction following (IFEval and AlpacaEval).

RLVR helps with math, and precise instruction following without significantly hurting others.

35 of 41

Evaluating the pipeline on unseen benchmarks

Trends generally hold showing the overall pipeline generalizes well.

RLVR generalizes to unseen math and IF evaluations.

36 of 41

Data mix overfits to IFEval

IFEval Example: Write a poem about how I am missing my classes… It should have 4 sections marked with SECTION X.

IFEval-OOD Example: Write an imaginative screenplay about … Use an emoji at the end of each sentence.

37 of 41

Satisfying hard constraints is surprisingly hard for LMs

	IFEval	IFEval-OOD
Llama 3.1 70B Instruct	88.0	34.5
Hermes 3 70B	76.0	24.6
Tülu 3 70B	83.2	27.8

A lot of the progress on the IFEval benchmark may be due to overfitting.

38 of 41

Open Resources

39 of 41

Scientific value of large projects

Evaluating promising ideas in practically useful settings. Some things we learned:

Diversity of the prompts matters a lot!
Generalization in safety is low and safety training is largely orthogonal to other skills.
PPO vs. DPO-like methods with large evaluation suites.
Potential overfitting to evaluations used for development.

Some things we explored that did not make it to the final recipe:

Rejection sampling
Process reward models

40 of 41

Scientific value of large projects

Identifying new research problems. Some next steps:

Better understanding of generalization in training procedures.
Extending the set of core skills.
Deeper analysis of RLVR: initialization, required model size.
Apply reinforcement learning to tasks with non-verifiable rewards.

1 of 41

2 of 41

3 of 41

4 of 41

5 of 41

6 of 41

7 of 41

8 of 41

9 of 41

10 of 41

11 of 41

12 of 41

13 of 41

14 of 41

15 of 41

16 of 41

17 of 41

18 of 41

19 of 41

20 of 41

21 of 41

22 of 41

23 of 41

24 of 41

25 of 41

26 of 41

27 of 41

28 of 41

29 of 41

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

36 of 41

37 of 41

38 of 41

39 of 41

40 of 41

41 of 41