1 of 54

Aligning LLMs with Direct Preference Optimization

1

Lewis Tunstall & Edward Beeching

Hugging Face

2 of 54

Why align?

2

Pretraining

Is pineapple on pizza a crime?

Base LLM

This is one of the many questions that will be answered at the Pizza Party …

3 of 54

Why align?

3

Pretraining

Supervised fine-tuning

Is pineapple on pizza a crime?

SFT LLM

Yes, adding pineapple as a topping on pizza is a criminal act under the Geneva Convention

4 of 54

Why align?

4

Pretraining

Supervised fine-tuning

Alignment

Is pineapple on pizza a crime?

SFT LLM

“Yes” 👎

“No” 👍

Collect human / AI feedback

to learn p(yw﹥yl)

5 of 54

Why align?

5

Pretraining

Supervised fine-tuning

Alignment

Is pineapple on pizza a crime?

Aligned LLM

No, adding pineapple as a topping on pizza is not a criminal act. It’s a matter of personal preference and taste.

6 of 54

RLHF - the OG of LLM alignment

6

Stiennon et al (2020)

Ouyang et al (2022)

7 of 54

RLHF - the OG of LLM alignment

7

Stiennon et al (2020)

Ouyang et al (2022)

8 of 54

RLHF - the OG of LLM alignment

8

Stiennon et al (2020)

Ouyang et al (2022)

9 of 54

RLHF - the OG of LLM alignment

9

Stiennon et al (2020)

Ouyang et al (2022)

10 of 54

RLHF - the OG of LLM alignment

10

Stiennon et al (2020)

Ouyang et al (2022)

11 of 54

RLHF - the OG of LLM alignment

11

Various challenges

  • RL notoriously unstable, many hparams
  • Need a separate RM ⇒ 3 LLMs to juggle 😱

maximise

rewards

use KL-divergence penalty to prevent

reward hacking (controlled by β)

12 of 54

Direct Preference Optimization

12

Rafailov et al (2023)

13 of 54

Direct Preference Optimization

13

good

response

bad

response

Rafailov et al (2023)

AIF + DPO: Distilling Zephyr and friends

(Is pineapple on pizza a crime?, No, Yes)

14 of 54

Direct Preference Optimization

14

model we’re optimising

reference

model (SFT)

Rafailov et al (2023)

AIF + DPO: Distilling Zephyr and friends

good

response

15 of 54

Direct Preference Optimization

15

model we’re optimising

reference

model (SFT)

Rafailov et al (2023)

AIF + DPO: Distilling Zephyr and friends

bad

response

16 of 54

Direct Preference Optimization

16

Rafailov et al (2023)

Algorithm

  • Sample good/bad response
  • Run pairs through 2 models (active and reference)
  • Backprop
  • Profit 💪

17 of 54

What does the DPO update do?

17

implicit reward from LM

18 of 54

What does the DPO update do?

18

implicit reward from LM

19 of 54

Some examples

19

Tunstall et al (2023)

UltraChat

for SFT

Ding et al (2023)

UltraFeedback

for DPO

Cui et al (2023)

MT Bench

20 of 54

Some examples

20

huggingface/trl

OpenAccess-AI-Collective/axolotl

21 of 54

Going beyond DPO

21

DPO

IPO

Azar et al (2023)

Adds a regularisation term to prevent overfitting

KTO

Ethayarajh et al (2023)

Dispenses with binary preferences altogether!

Iterative DPO

Snorkel (2023)

Combines rejection sampling with DPO

22 of 54

🙋 Questions?

22

23 of 54

Training and Aligning a Chatbot

  • Practical dive into SFT and DPO

  • Introduce Chat Templates for formatting dialogue data

  • Links for SFT and DPO datasets

  • Metrics

  • Model Evaluation

23

Pretraining

Supervised fine-tuning

Alignment

24 of 54

Annotated SFT & DPO

Notebooks: (runs on colab)

  • Annotated SFT
  • Annotated DPO

More up to date codebase: Hugging Face Alignment Handbook

  • Low-resource examples with Q-LORA
  • Multi-GPU / node examples with Accelerate & DeepSpeed
  • Configs, hyper-parameters, slurm scripts�

A note on LORA:

  • PEFT blogpost
  • LORA paper

24

25 of 54

Supervised Fine-Tuning (SFT)

25

Load a dataset

Apply Chat template

SFT

26 of 54

Supervised Fine-Tuning (SFT)

Example prompts:

  • Which famous landmarks should I visit in London, beyond the usual ones?

  • Write a program to implement a dynamic programming algorithm for the longest common substring problem in C++

  • Create a YouTube tutorial on how to bake a gluten-free cake.

26

Load a dataset

Apply Chat template

SFT

27 of 54

Supervised Fine-Tuning (SFT)

27

Load a dataset

Apply Chat template

SFT

28 of 54

Supervised Fine-Tuning (SFT)

Popular templates:

  • ChatML

  • Llama-2

  • Zephyr

28

Load a dataset

Apply Chat template

SFT

29 of 54

Supervised Fine-Tuning (SFT)

29

Load a dataset

Apply Chat template

SFT

What is 2+2?

2+2 is equal to 4, how else can I help?

What about 5*7?

5*7 is equal to 35, do you have any further questions?

...

30 of 54

Supervised Fine-Tuning (SFT)

30

Load a dataset

Apply Chat template

SFT

<|im_start|>system

<|im_end|>

<|im_start|>user

What is 2+2?<|im_end|>

<|im_start|>assistant

2+2 is equal to 4, how else can I help?<|im_end|>

<|im_start|>user

What about 5*7?<|im_end|>

<|im_start|>assistant

5*7 is equal to 35, do you have an further questions?<|im_end|>

31 of 54

Supervised Fine-Tuning (SFT)

Popular templates:

  • ChatML <- Recommended

  • Llama-2

  • Zephyr

31

Load a dataset

Apply Chat template

SFT

32 of 54

Supervised Fine-Tuning (SFT)

32

Load a dataset

Apply Chat template

SFT

33 of 54

Supervised Fine-Tuning (SFT)

33

Load a dataset

Apply Chat template

SFT

34 of 54

Direct Preference Optimization (DPO)

34

Load a dataset

Apply Chat template

DPO

35 of 54

Direct Preference Optimization (DPO)

35

Load a dataset

Apply Chat template

DPO

36 of 54

Direct Preference Optimization (DPO)

36

Load a dataset

Apply Chat template

DPO

37 of 54

Direct Preference Optimization (DPO)

37

Load a dataset

Apply Chat template

DPO

Prompt (x)

Chosen response (yw)

Rejected response (yl)

38 of 54

Direct Preference Optimization (DPO)

38

Load a dataset

Apply Chat template

DPO

Prompt (x)

Chosen response (yw)

Rejected response (yl)

39 of 54

Direct Preference Optimization (DPO)

Where:

Pref(R1) > Pref(R2) > Pref(R3) > Pref(R4)

39

Load a dataset

Apply Chat template

DPO

Prompt

Response 1

Response 2

Response 3

Response 4

40 of 54

Direct Preference Optimization (DPO)

40

Load a dataset

Apply Chat template

DPO

Prompt

Chosen response

Rejected response

Assistant reponse

User response

41 of 54

Direct Preference Optimization (DPO)

41

Load a dataset

Apply Chat template

DPO

Prompt

Chosen response

Rejected response

Assistant reponse

User response

Prompt

Assistant reponse

User response

42 of 54

Direct Preference Optimization (DPO)

42

Load a dataset

Apply Chat template

DPO

43 of 54

Direct Preference Optimization (DPO)

43

Load a dataset

Apply Chat template

DPO

44 of 54

Direct Preference Optimization (DPO)

44

Load a dataset

Apply Chat template

DPO

More on beta and Alignment losses:�https://huggingface.co/blog/pref-tuning

45 of 54

Direct Preference Optimization (DPO)

45

Load a dataset

Apply Chat template

DPO

46 of 54

DPO Training tips

  • Beta: test from 0.01 - 1.0

  • Learning rate: much smaller than for SFT ~100x smaller (5E-7)

  • Batch size: tradeoff between global batch size and n epochs

  • Optimizer: Adam appears better than RMSProp

  • Scheduler: Cosine > Linear

  • The best SFT model != Best DPO model

  • LoRA: Appears to regularize the model compared to full fine-tune

46

47 of 54

DPO Training tips

  • Beta

  • Learning rate (much smaller than for SFT) ~100x smaller

  • Batch size tradeoff between global batch size and n epochs
  • Optimizer: Adam appears better than RMSProp
  • Scheduler: Cosine > Linear
  • LoRA: Appears to regularize the model training compared to full fine-tuning

47

48 of 54

DPO Metrics

48

49 of 54

Diagnosing problems

49

50 of 54

Diagnosing problems

50

51 of 54

Evaluating Chatbots

51

52 of 54

Evaluating Chatbots

  • OpenLLM LeaderBoard - Not Chatbot focused, leakage, overfitting

  • MT Bench - Usage

  • Alpaca Eval - Usage

  • LLamaindex (RAG)

  • Human Eval - Lmsys Chatbot Arena

52

53 of 54

53

54 of 54

🙋 Questions?

54