RLHF Research Paper Review
Andy Lee
What is RLHF?
Overview of RLHF
Step 1
Step 2
Step 3
Fine-tune LLMs on training dataset w/t Supervised Learning
Rank model responses with human evaluators and train reward model
Use reward model to fine-tune the LLMs further with Proximal Policy Optimization (PPO)
Step 1: Fine-tune the model with Supervised Learning
Supervised Learning Example
Prompt
Human
Model
"How would you greet someone at an event?"
"Hello! It's so nice to meet you."
"Sup! It's kinda nice to meet you."
Step 2: Train the Reward Model
Reward Model Example
Prompt
"How should you politely decline a job offer?"
Good Response
"Thank you for the consideration, but I will have to decline."
Bad Response
"Nah, not interested lol."
0.9
0.2
Step 3: Fine-tune the model with PPO
Experiments by OpenAI
GPT-3
175B Parameter
InstructGPT
1.3B Parameter
vs.
Performance Comparison
PPO-ptx: InstructGPT
SFT:
GPT-3
Why is RLHF Important
Smaller model (1.3B) with RLHF hallucinated 21% of the time, compared to larger model (175B) without RLHF at 41% (Ouyang, Wu, Jiang, al., 2022).
Human evaluators preferred outputs from a smaller model (1.3B) with RLHF over the much larger model (175B) without RLHF 85% of the time (Ouyang, Wu, Jiang, al., 2022) for public NLP dataset.
Smaller model (1.3B) with RLHF generated 25% less toxic responses compared to larger model (175B) without RLHF (Ouyang, Wu, Jiang, al., 2022).
More Preferable Responses
More Truthful Responses
Less Toxic Responses
Thank You!