Aligning LLMs with Direct Preference Optimization
1
Lewis Tunstall & Edward Beeching
Hugging Face
Why align?
2
Pretraining
Is pineapple on pizza a crime?
Base LLM
This is one of the many questions that will be answered at the Pizza Party …
Why align?
3
Pretraining
Supervised fine-tuning
Is pineapple on pizza a crime?
SFT LLM
Yes, adding pineapple as a topping on pizza is a criminal act under the Geneva Convention
Why align?
4
Pretraining
Supervised fine-tuning
Alignment
Is pineapple on pizza a crime?
SFT LLM
“Yes” 👎
“No” 👍
Collect human / AI feedback
to learn p(yw﹥yl)
Why align?
5
Pretraining
Supervised fine-tuning
Alignment
Is pineapple on pizza a crime?
Aligned LLM
No, adding pineapple as a topping on pizza is not a criminal act. It’s a matter of personal preference and taste.
RLHF - the OG of LLM alignment
6
Stiennon et al (2020)
Ouyang et al (2022)
RLHF - the OG of LLM alignment
7
Stiennon et al (2020)
Ouyang et al (2022)
RLHF - the OG of LLM alignment
8
Stiennon et al (2020)
Ouyang et al (2022)
RLHF - the OG of LLM alignment
9
Stiennon et al (2020)
Ouyang et al (2022)
RLHF - the OG of LLM alignment
10
Stiennon et al (2020)
Ouyang et al (2022)
RLHF - the OG of LLM alignment
11
Various challenges
maximise
rewards
use KL-divergence penalty to prevent
reward hacking (controlled by β)
Direct Preference Optimization
12
Rafailov et al (2023)
Direct Preference Optimization
13
good
response
bad
response
Rafailov et al (2023)
AIF + DPO: Distilling Zephyr and friends
(Is pineapple on pizza a crime?, No, Yes)
Direct Preference Optimization
14
model we’re optimising
reference
model (SFT)
Rafailov et al (2023)
AIF + DPO: Distilling Zephyr and friends
good
response
Direct Preference Optimization
15
model we’re optimising
reference
model (SFT)
Rafailov et al (2023)
AIF + DPO: Distilling Zephyr and friends
bad
response
Direct Preference Optimization
16
Rafailov et al (2023)
Algorithm
What does the DPO update do?
17
implicit reward from LM
What does the DPO update do?
18
implicit reward from LM
Some examples
19
Tunstall et al (2023)
UltraChat
for SFT
Ding et al (2023)
UltraFeedback
for DPO
Cui et al (2023)
MT Bench
Some examples
20
huggingface/trl
OpenAccess-AI-Collective/axolotl
Going beyond DPO
21
DPO
IPO
Azar et al (2023)
Adds a regularisation term to prevent overfitting
KTO
Ethayarajh et al (2023)
Dispenses with binary preferences altogether!
Iterative DPO
Snorkel (2023)
Combines rejection sampling with DPO
🙋 Questions?
22
Training and Aligning a Chatbot
23
Pretraining
Supervised fine-tuning
Alignment
Annotated SFT & DPO
Notebooks: (runs on colab)
More up to date codebase: Hugging Face Alignment Handbook
A note on LORA:
24
Supervised Fine-Tuning (SFT)
25
Load a dataset
Apply Chat template
SFT
Supervised Fine-Tuning (SFT)
Example prompts:
26
Load a dataset
Apply Chat template
SFT
Supervised Fine-Tuning (SFT)
27
Load a dataset
Apply Chat template
SFT
Supervised Fine-Tuning (SFT)
Popular templates:
�
28
Load a dataset
Apply Chat template
SFT
Supervised Fine-Tuning (SFT)
29
Load a dataset
Apply Chat template
SFT
What is 2+2?
2+2 is equal to 4, how else can I help?
What about 5*7?
5*7 is equal to 35, do you have any further questions?
...
Supervised Fine-Tuning (SFT)
30
Load a dataset
Apply Chat template
SFT
<|im_start|>system
<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
2+2 is equal to 4, how else can I help?<|im_end|>
<|im_start|>user
What about 5*7?<|im_end|>
<|im_start|>assistant
5*7 is equal to 35, do you have an further questions?<|im_end|>
Supervised Fine-Tuning (SFT)
Popular templates:
31
Load a dataset
Apply Chat template
SFT
Supervised Fine-Tuning (SFT)
32
Load a dataset
Apply Chat template
SFT
Supervised Fine-Tuning (SFT)
33
Load a dataset
Apply Chat template
SFT
Direct Preference Optimization (DPO)
34
Load a dataset
Apply Chat template
DPO
Direct Preference Optimization (DPO)
35
Load a dataset
Apply Chat template
DPO
Direct Preference Optimization (DPO)
36
Load a dataset
Apply Chat template
DPO
Direct Preference Optimization (DPO)
37
Load a dataset
Apply Chat template
DPO
Prompt (x)
Chosen response (yw)
Rejected response (yl)
Direct Preference Optimization (DPO)
38
Load a dataset
Apply Chat template
DPO
Prompt (x)
Chosen response (yw)
Rejected response (yl)
Direct Preference Optimization (DPO)
Where:
Pref(R1) > Pref(R2) > Pref(R3) > Pref(R4)
39
Load a dataset
Apply Chat template
DPO
Prompt
Response 1
Response 2
Response 3
Response 4
Direct Preference Optimization (DPO)
40
Load a dataset
Apply Chat template
DPO
Prompt
Chosen response
Rejected response
Assistant reponse
User response
Direct Preference Optimization (DPO)
41
Load a dataset
Apply Chat template
DPO
Prompt
Chosen response
Rejected response
Assistant reponse
User response
Prompt
Assistant reponse
User response
Direct Preference Optimization (DPO)
42
Load a dataset
Apply Chat template
DPO
Direct Preference Optimization (DPO)
43
Load a dataset
Apply Chat template
DPO
Direct Preference Optimization (DPO)
44
Load a dataset
Apply Chat template
DPO
More on beta and Alignment losses:�https://huggingface.co/blog/pref-tuning
Direct Preference Optimization (DPO)
45
Load a dataset
Apply Chat template
DPO
DPO Training tips
46
DPO Training tips
47
DPO Metrics
48
Diagnosing problems
49
Diagnosing problems
50
Evaluating Chatbots
51
Evaluating Chatbots
52
53
🙋 Questions?
54