Reinforcement Learning from Human Feedback:
Pathways to Open Reproduction of ChatGPT
Dr. Nathan Lambert �9 March 2023
Recent breakthroughs in machine learning
2/42
Stable Diffusion
August 2022
ChatGPT
Nov. 2022
Bard
Early 2023
Generative AI: The near future for graduate students
3/42
Code generation
Email & writing
Displaying results
An important figure from a machine learning research paper with some math – Stable Diffusion
Generative AI: The near future for the world
4/42
Becomes substrate of the internet:
the earth with a bunch of wires growing through it – DALLE 2
Reducing harms of machine learning systems
How do you create / code a loss function for:
6/42
Do not write a loss function for preferences, model them!
7/42
Key motivation of reinforcement learning from human feedback (RLHF):
RLHF’s influence: intuition
RLHF re-shapes the model’s deep understanding of the world
8/42
Outline
9/42
Review: reinforcement learning basics
10/42
History: RLHF for decision making
11/42
Knox, W. Bradley, and Peter Stone. "Tamer: Training an agent manually via evaluative reinforcement." 2008.
Pre Deep RL
Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." 2017.
For Deep RL
History: early OpenAI experiments with RLHF
12/42
Stiennon, Nisan, et al. "Learning to summarize with human feedback." 2020.
“Three pigs defend themselves from a mean wolf”
History: early OpenAI experiments with RLHF
13/42
Prompt:
To pursue a Computer Sc. PhD or continue working? Especially if one has no real intention to work in academia even after grad school ...
📓Vanilla LM:
I’m considering pursuing a PhD in Computer Science, but I’m worried about the future. I’m currently employed full-time, but I’m worried about the future.
✍️ Human Annotation:
Software Engineer with a job I’m happy at (for now), deciding whether to pursue a PhD to improve qualifications and explore interests and a new challenge.
🤖RLHF Model:
Currently employed, considering pursuing PhD in Computer Science to avoid being stuck with no residency visa ever again. Has anyone pursued a PhD purely for the sake of research, with no intention of joining the academic world?
+ =
Recent history: ChatGPT
(rumors)
14/42
[1]: Bing’s Revenge and Google’s AI Faceplant: Hard Fork, Feb 10th 2023�[2]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis
Outline
15/42
Modern RLHF overview
16/42
17/42
2. Reward model training
18/42
How to capture human sentiments in samples and curated text? What is the loss!
Goal: get a model that maps
input text → scalar reward
2. Reward model training
19/42
Input dataset: Prompts for specific use-case model will be used for (E.g. chat questions)
Generations: Often can use multiple models to create diverse ranking system.
2. Reward model training
20/42
2. Reward model training
21/42
2. Reward model training
22/42
Mapping from all possible inputs to a scalar!
Reward model:
- Also transformer based LM,
- Variation in sizes used (relative to policy),
- Multiple techniques to freeze base model.
3. Fine tuning with RL
23/42
3. Fine tuning with RL - using a reward model
24/42
3. Fine tuning with RL - KL penalty
25/42
Constrains the RL fine-tuning to not result in a LM that outputs gibberish (to fool the reward model).
Note: DeepMind did this in RL Loss (not reward), see GopherCite
Kullback–Leibler (KL) divergence:
Distance between distributions
3. Fine tuning with RL - combining rewards
26/42
Option to add additional terms to this reward function. E.g. InstructGPT
Reward to match original human-curation distribution
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
3. Fine tuning with RL - feedback & training
27/42
- Policy gradient updates policy LM directly (memory intensive).
- Often some parameters of policy are frozen.
Playing with the RL of RLHF
Some normal things with meh results:
❌ residual value prediction
❌ approximate KL (http://joschu.net/blog/kl-approx.html)
🤷♂️ entropy regularization
Always fighting instabilities: Lower learning rate, look at gradient norm & activation spikes, add model checkpointing for restarts
28/42
Outline
29/42
Reproducing ChatGPT: Overview & challenges
30/42
Reproducing ChatGPT: Prompt & instruction datasets
31/42
Challenges
Reproducing ChatGPT: Language models
32/42
Challenges
Reproducing ChatGPT: Data collection & annotation
33/42
Challenges
Reproducing ChatGPT: Ongoing (open-source) efforts
34/42
Outline
35/42
Formulating a research agenda on RLHF
Key factor -- one module at a time!
36/42
RLHF on outcomes vs. messages
Outcome based rewards
37/42
Individual message rewards
RLHF on outcomes vs. messages
Outcome based rewards
38/42
Leverages rich history of RL as a multi-step optimizer:
Thanks
39/42
Nazneen Rajani Lewis Tunstall Thomas Wolf Tristan Thrush Edward Beeching
And more at HuggingFace and the community!
The RLHF team…
Leandro von Werra Younes Belkada
Resources
Related works: hf.co/blog/rlhf#further-reading
Models, datasets, tools: hf.co/HuggingFaceH4
(Training code for language / reward models soon)
Contact:
40/42
Conclusions
RLHF Summary:
41/42
Supplementary slides follow
42
Variations on the methodology
Almost all papers to date have tweaks:
43/42
Reward model training - feedback interfaces
44/42
Reward model training - feedback interfaces
45/42
Reward model training - feedback interfaces
46/42
Reward model training - feedback interfaces
47/42
best of my data
The opportunity of text-based feedback.
Prominent, recent papers
48
Recapping recent examples - InstructGPT
49/42
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Recapping recent examples - Anthropic’s RLHF
50/42
Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2204.05862 (2022).
Recapping recent examples - comparison
51/42
Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2204.05862 (2022).
This is cherry picked (and still instructive)!