1 of 54

Aligning LLMs with Direct Preference Optimization

Lewis Tunstall & Edward Beeching

Hugging Face

2 of 54

Why align?

Pretraining

Is pineapple on pizza a crime?

Base LLM

This is one of the many questions that will be answered at the Pizza Party …

3 of 54

Why align?

Pretraining

Supervised fine-tuning

Is pineapple on pizza a crime?

SFT LLM

Yes, adding pineapple as a topping on pizza is a criminal act under the Geneva Convention

4 of 54

Why align?

Pretraining

Supervised fine-tuning

Alignment

Is pineapple on pizza a crime?

SFT LLM

“Yes” 👎

“No” 👍

Collect human / AI feedback

to learn p(y_w﹥y_l)

5 of 54

Why align?

Pretraining

Supervised fine-tuning

Alignment

Is pineapple on pizza a crime?

Aligned LLM

No, adding pineapple as a topping on pizza is not a criminal act. It’s a matter of personal preference and taste.

6 of 54

RLHF - the OG of LLM alignment

Stiennon et al (2020)

Ouyang et al (2022)

7 of 54

RLHF - the OG of LLM alignment

Stiennon et al (2020)

Ouyang et al (2022)

8 of 54

RLHF - the OG of LLM alignment

Stiennon et al (2020)

Ouyang et al (2022)

9 of 54

RLHF - the OG of LLM alignment

Stiennon et al (2020)

Ouyang et al (2022)

10 of 54

RLHF - the OG of LLM alignment

Stiennon et al (2020)

Ouyang et al (2022)

11 of 54

RLHF - the OG of LLM alignment

Various challenges

RL notoriously unstable, many hparams
Need a separate RM ⇒ 3 LLMs to juggle 😱

maximise

rewards

use KL-divergence penalty to prevent

reward hacking (controlled by β)

12 of 54

Direct Preference Optimization

Rafailov et al (2023)

13 of 54

Direct Preference Optimization

good

response

bad

response

Rafailov et al (2023)

AIF + DPO: Distilling Zephyr and friends

(Is pineapple on pizza a crime?, No, Yes)

14 of 54

Direct Preference Optimization

model we’re optimising

reference

model (SFT)

Rafailov et al (2023)

AIF + DPO: Distilling Zephyr and friends

good

response

15 of 54

Direct Preference Optimization

model we’re optimising

reference

model (SFT)

Rafailov et al (2023)

AIF + DPO: Distilling Zephyr and friends

bad

response

16 of 54

Direct Preference Optimization

Rafailov et al (2023)

Algorithm

Sample good/bad response
Run pairs through 2 models (active and reference)
Backprop
Profit 💪

17 of 54

What does the DPO update do?

implicit reward from LM

18 of 54

What does the DPO update do?

implicit reward from LM

19 of 54

Some examples

Tunstall et al (2023)

UltraChat

for SFT

Ding et al (2023)

UltraFeedback

for DPO

Cui et al (2023)

MT Bench

20 of 54

Some examples

huggingface/trl

OpenAccess-AI-Collective/axolotl

21 of 54

Going beyond DPO

DPO

IPO

Azar et al (2023)

Adds a regularisation term to prevent overfitting

KTO

Ethayarajh et al (2023)

Dispenses with binary preferences altogether!

Iterative DPO

Snorkel (2023)

Combines rejection sampling with DPO

22 of 54

🙋 Questions?

23 of 54

Training and Aligning a Chatbot

Practical dive into SFT and DPO

Introduce Chat Templates for formatting dialogue data

Links for SFT and DPO datasets

Metrics

Model Evaluation

Pretraining

Supervised fine-tuning

Alignment

24 of 54

Annotated SFT & DPO

Notebooks: (runs on colab)

More up to date codebase: Hugging Face Alignment Handbook

Low-resource examples with Q-LORA
Multi-GPU / node examples with Accelerate & DeepSpeed
Configs, hyper-parameters, slurm scripts�

A note on LORA:

25 of 54

Supervised Fine-Tuning (SFT)

Load a dataset

Apply Chat template

SFT

26 of 54

Supervised Fine-Tuning (SFT)

Example prompts:

Which famous landmarks should I visit in London, beyond the usual ones?

Write a program to implement a dynamic programming algorithm for the longest common substring problem in C++

Create a YouTube tutorial on how to bake a gluten-free cake.

Load a dataset

Apply Chat template

SFT

27 of 54

Supervised Fine-Tuning (SFT)

Load a dataset

Apply Chat template

SFT

Awesome SFT datasets

28 of 54

Supervised Fine-Tuning (SFT)

Popular templates:

ChatML

Llama-2

Zephyr

�

Load a dataset

Apply Chat template

SFT

29 of 54

Supervised Fine-Tuning (SFT)

Load a dataset

Apply Chat template

SFT

What is 2+2?

2+2 is equal to 4, how else can I help?

What about 5*7?

5*7 is equal to 35, do you have any further questions?

...

30 of 54

Supervised Fine-Tuning (SFT)

Load a dataset

Apply Chat template

SFT

<|im_start|>system

<|im_end|>

<|im_start|>user

What is 2+2?<|im_end|>

<|im_start|>assistant

2+2 is equal to 4, how else can I help?<|im_end|>

<|im_start|>user

What about 5*7?<|im_end|>

<|im_start|>assistant

5*7 is equal to 35, do you have an further questions?<|im_end|>

31 of 54

Supervised Fine-Tuning (SFT)

Popular templates:

ChatML <- Recommended

Llama-2

Zephyr

Load a dataset

Apply Chat template

SFT

32 of 54

Supervised Fine-Tuning (SFT)

Load a dataset

Apply Chat template

SFT

33 of 54

Supervised Fine-Tuning (SFT)

Load a dataset

Apply Chat template

SFT

34 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

35 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

36 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

Awesome feedback datasets

37 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

Prompt (x)

Chosen response (y_w)

Rejected response (y_l)

38 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

Prompt (x)

Chosen response (y_w)

Rejected response (y_l)

39 of 54

Direct Preference Optimization (DPO)

Where:

Pref(R1) > Pref(R2) > Pref(R3) > Pref(R4)

Load a dataset

Apply Chat template

DPO

Prompt

Response 1

Response 2

Response 3

Response 4

40 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

Prompt

Chosen response

Rejected response

Assistant reponse

User response

41 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

Prompt

Chosen response

Rejected response

Assistant reponse

User response

Prompt

Assistant reponse

User response

42 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

43 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

44 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

More on beta and Alignment losses:�https://huggingface.co/blog/pref-tuning

45 of 54

Direct Preference Optimization (DPO)

Load a dataset

Apply Chat template

DPO

46 of 54

DPO Training tips

Beta: test from 0.01 - 1.0

Learning rate: much smaller than for SFT ~100x smaller (5E-7)

Batch size: tradeoff between global batch size and n epochs

Optimizer: Adam appears better than RMSProp

Scheduler: Cosine > Linear

The best SFT model != Best DPO model

LoRA: Appears to regularize the model compared to full fine-tune

47 of 54

DPO Training tips

Beta

Learning rate (much smaller than for SFT) ~100x smaller

Batch size tradeoff between global batch size and n epochs
Optimizer: Adam appears better than RMSProp
Scheduler: Cosine > Linear
LoRA: Appears to regularize the model training compared to full fine-tuning

48 of 54

DPO Metrics

49 of 54

Diagnosing problems

50 of 54

Diagnosing problems

51 of 54

Evaluating Chatbots

https://qwenlm.github.io/blog/qwen1.5/

52 of 54

Evaluating Chatbots

OpenLLM LeaderBoard - Not Chatbot focused, leakage, overfitting

MT Bench - Usage

Alpaca Eval - Usage

LLamaindex (RAG)

Human Eval - Lmsys Chatbot Arena

1 of 54

2 of 54

3 of 54

4 of 54

5 of 54

6 of 54

7 of 54

8 of 54

9 of 54

10 of 54

11 of 54

12 of 54

13 of 54

14 of 54

15 of 54

16 of 54

17 of 54

18 of 54

19 of 54

20 of 54

21 of 54

22 of 54

23 of 54

24 of 54

25 of 54

26 of 54

27 of 54

28 of 54

29 of 54

30 of 54

31 of 54

32 of 54

33 of 54

34 of 54

35 of 54

36 of 54

37 of 54

38 of 54

39 of 54

40 of 54

41 of 54

42 of 54

43 of 54

44 of 54

45 of 54

46 of 54

47 of 54

48 of 54

49 of 54

50 of 54

51 of 54

52 of 54

53 of 54

54 of 54