1 of 51

Reinforcement Learning from Human Feedback:

Pathways to Open Reproduction of ChatGPT

Dr. Nathan Lambert �9 March 2023

2 of 51

Recent breakthroughs in machine learning

2/42

Stable Diffusion

August 2022

ChatGPT

Nov. 2022

Bard

Early 2023

3 of 51

Generative AI: The near future for graduate students

3/42

Code generation

Email & writing

Displaying results

An important figure from a machine learning research paper with some math – Stable Diffusion

4 of 51

Generative AI: The near future for the world

4/42

Becomes substrate of the internet:

AI generates all content not at the highest quality,
Mix of human-generated prompts and autonomous systems.

the earth with a bunch of wires growing through it – DALLE 2

5 of 51

When machine learning models fall short

5/42

The Verge

The Zvi

6 of 51

Reducing harms of machine learning systems

How do you create / code a loss function for:

What is fair?
What is ethical?
What is safe?

6/42

7 of 51

Do not write a loss function for preferences, model them!

7/42

Key motivation of reinforcement learning from human feedback (RLHF):

8 of 51

RLHF’s influence: intuition

RLHF re-shapes the model’s deep understanding of the world

Style transfer
Unlocking potential
Safety / ease of use

8/42

9 of 51

Outline

Origins of RLHF
Conceptual Overview
Reproducing ChatGPT
Addressable Research Questions

9/42

10 of 51

Review: reinforcement learning basics

10/42

11 of 51

History: RLHF for decision making

11/42

Knox, W. Bradley, and Peter Stone. "Tamer: Training an agent manually via evaluative reinforcement." 2008.

Pre Deep RL

Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." 2017.

For Deep RL

12 of 51

History: early OpenAI experiments with RLHF

12/42

Stiennon, Nisan, et al. "Learning to summarize with human feedback." 2020.

“Three pigs defend themselves from a mean wolf”

13 of 51

History: early OpenAI experiments with RLHF

13/42

Prompt:

To pursue a Computer Sc. PhD or continue working? Especially if one has no real intention to work in academia even after grad school ...

📓Vanilla LM:

I’m considering pursuing a PhD in Computer Science, but I’m worried about the future. I’m currently employed full-time, but I’m worried about the future.

✍️ Human Annotation:

Software Engineer with a job I’m happy at (for now), deciding whether to pursue a PhD to improve qualifications and explore interests and a new challenge.

🤖RLHF Model:

Currently employed, considering pursuing PhD in Computer Science to avoid being stuck with no residency visa ever again. Has anyone pursued a PhD purely for the sake of research, with no intention of joining the academic world?

+ =

Rather self-explanatory post, but just to elaborate a little further...

Has anyone, after working for a period of time, decided for whatever reasons to head back into academia to pursue a PhD in Computer Sc, with no intention to join the world of academia but intend to head back into the industry? If so, what were the reasons. Also, how did it turn out? Was there anything that you regretted? Did it work out?

The reason I’m asking for advice is because currently I’m employed full-time, and know how fortunate I am to have a job in these hard times. However, being from a foreign country and working in the UK, they are making it difficult for one to extend one’s working visa because I do not possess a Masters/PhD and only a Bachelor’s.

It’s just a little frustrating because I have been here almost half a decade now, but I have been sitting on a visa that has no count towards residency and soon, I will have to apply for a visa with even more restrictions.

I’m thus considering boosting up my paper qualifications, so as to prevent myself from being in such a situation ever again. I believe the stipend from grad school actually pays almost as much as I am earning now (in terms of spending power, not direct currency conversion), but of course, given if I spent 5 years working as opposed to studying, there might be opportunities for raises/promotions.

Obviously, plunging into a PhD solely for that reason is bound to get looks of disapproval. I am, of course, interested in expanding my knowledge and having an opportunity to have time for myself. I have had research experience (published a paper, presented at conferences) so I’m aware of the challenges research presents.

Any advice from people who have gone through similar situations? Would love to hear from you.

14 of 51

Recent history: ChatGPT

(rumors)

About 10x spend on human annotation budget ($millions),
Order of 10B parameters (API costs²),
Uses RLHF, continually release model versions,
Quick internal sprint in Fall 2022 (not long-term project)¹

14/42

[1]: Bing’s Revenge and Google’s AI Faceplant: Hard Fork, Feb 10^th 2023�[2]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis

15 of 51

Outline

Origins of RLHF
Conceptual Overview
Reproducing ChatGPT
Addressable Research Questions

15/42

16 of 51

Modern RLHF overview

16/42

17 of 51

Language model pretraining

17/42

Uses common training techniques in NLP
Base model strength determines need for supervised fine tuning / annotations
Open source models ~1.5yr behind closed models (for now!)

18 of 51

2. Reward model training

18/42

How to capture human sentiments in samples and curated text? What is the loss!

Goal: get a model that maps

input text → scalar reward

19 of 51

2. Reward model training

19/42

Input dataset: Prompts for specific use-case model will be used for (E.g. chat questions)

Generations: Often can use multiple models to create diverse ranking system.

20 of 51

2. Reward model training

20/42

21 of 51

2. Reward model training

21/42

22 of 51

2. Reward model training

22/42

Mapping from all possible inputs to a scalar!

Reward model:

- Also transformer based LM,

- Variation in sizes used (relative to policy),

- Multiple techniques to freeze base model.

23 of 51

3. Fine tuning with RL

23/42

24 of 51

3. Fine tuning with RL - using a reward model

24/42

25 of 51

3. Fine tuning with RL - KL penalty

25/42

Constrains the RL fine-tuning to not result in a LM that outputs gibberish (to fool the reward model).

Note: DeepMind did this in RL Loss (not reward), see GopherCite

Kullback–Leibler (KL) divergence:

Distance between distributions

26 of 51

3. Fine tuning with RL - combining rewards

26/42

Option to add additional terms to this reward function. E.g. InstructGPT

Reward to match original human-curation distribution

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

27 of 51

3. Fine tuning with RL - feedback & training

27/42

- Policy gradient updates policy LM directly (memory intensive).

- Often some parameters of policy are frozen.

28 of 51

Playing with the RL of RLHF

Some normal things with meh results:

❌ residual value prediction

❌ approximate KL (http://joschu.net/blog/kl-approx.html)

🤷‍♂️ entropy regularization

Always fighting instabilities: Lower learning rate, look at gradient norm & activation spikes, add model checkpointing for restarts

28/42

https://github.com/lvwerra/trl/issues/119, 131

29 of 51

Outline

Origins of RLHF
Conceptual Overview
Reproducing ChatGPT
Addressable Research Questions

29/42

30 of 51

Reproducing ChatGPT: Overview & challenges

30/42

31 of 51

Reproducing ChatGPT: Prompt & instruction datasets

31/42

T0pp: general LM prompts�Sanh et. al 2021�Filtered from Promptsource�
Instruction datasets

Anthropic Harmless, Helpful, Honest�Bai et al. 2022; Ganguli et al. 2022�hf.co/datasets/Anthropic/hh-rlhf
Self-instruct (GPT-3 generated prompts)�Wang et al. 2022
Super-Natural Instructions (crowd sourced prompts)�Wang et al. 2022

Proprietary data (e.g. API usage, HF spaces)
Paid curated datasets (more in a few slides)�

Challenges

Academically available data focused on benchmarks
Need for super high quality data

32 of 51

Reproducing ChatGPT: Language models

32/42

Flan-T5 (20B)�Chung et al. 2022; Google�
OPT, OPT-IML (6B, 175B)�Zhang et al. 2022, Iyer et al. 2022; Meta

LLaMA (65B)�Touvron et al. 2023; Meta

GPT-NeoX�Black et al. 2021; EleutherAI�
BLOOM, BLOOM-Z�BigScience 2021, 2022; HuggingFace�

Challenges

Reproduction of high-quality instruction-tuned model not yet open sourced (at level of GPT 3.5)

33 of 51

Reproducing ChatGPT: Data collection & annotation

33/42

Paid vendors: Scale AI, Surge AI, Toloka, Amazon M. Turk, Upwork�
Old datasets:

WebGPT Data�hf.co/datasets/openai/webgpt_comparisons
InstructGPT Data�hf.co/datasets/openai/summarize_from_feedback�

Challenges

Expensive to collect data
Product, UX, and red-teaming needs at scale

34 of 51

Reproducing ChatGPT: Ongoing (open-source) efforts

HuggingFace: Tools, datasets, models, and tutorials�https://huggingface.co/HuggingFaceH4

Transformer Reinforcement Learning�https://github.com/lvwerra/trl
Training & evaluation code coming soon 🤫�

LAION/Open-assistant: End to end product approach�https://github.com/LAION-AI/Open-Assistant �
Other RLHF libraries�TRLX: https://github.com/CarperAI/trlx�RL4LMS: https://github.com/allenai/RL4LMs

34/42

35 of 51

Outline

Origins of RLHF
Conceptual Overview
Reproducing ChatGPT
Addressable Research Questions

35/42

36 of 51

Formulating a research agenda on RLHF

Key factor -- one module at a time!

Language models (hardest to research): effects of instruction tuning.�
Preference modeling: beyond pairwise comparisons, dataset curation, model architecture relative to policy, …�
RL algorithms: interpretability, uncertainty management, stability / robustness, fewer hyperparams, is RL needed, inference costs (offline RL) …�
System-wide: evaluation, different domains (e.g. diffusion models, code gen)!

36/42

37 of 51

RLHF on outcomes vs. messages

Outcome based rewards

37/42

Individual message rewards

38 of 51

RLHF on outcomes vs. messages

Outcome based rewards

38/42

Leverages rich history of RL as a multi-step optimizer:

Optimize on outcome of conversation
E.g. if user bought an item or was satisfied with customer service

39 of 51

Thanks

39/42

Nazneen Rajani Lewis Tunstall Thomas Wolf Tristan Thrush Edward Beeching

And more at HuggingFace and the community!

The RLHF team…

Leandro von Werra Younes Belkada

40 of 51

Resources

Related works: hf.co/blog/rlhf#further-reading

Models, datasets, tools: hf.co/HuggingFaceH4

(Training code for language / reward models soon)

Contact:

nathan@natolambert.com
twitter.com/natolambert

40/42

41 of 51

Conclusions

RLHF Summary:

Compelling tool to integrate hard-to-model values into ML systems.
Progress towards open reproduction is getting going.�Will feel like nothing and then all at once.

Many open research questions!

41/42

42 of 51

Supplementary slides follow

43 of 51

Variations on the methodology

Almost all papers to date have tweaks:

Anthropic - Claude

Initial policy helpfulness, honesty, and harmlessness (HHH) context distillation
Preference model pretraining (PMP): Fine-tune LM on dataset of binary rankings
Online iterated RLHF, RLAIF, Constitutional AI

OpenAI - InstructGPT

Humans generated initial LM training text, train RL policy to match this
Most extensive human annotation work

DeepMind - Sparrow / GopherCite

Advantage actor-critic (A2C) instead of PPO, different RL loss
Specific rule set for alignment (train on rules and preferences)

And more (please add what I missed to the chat)

43/42

44 of 51

Reward model training - feedback interfaces

44/42

45 of 51

Reward model training - feedback interfaces

45/42

46 of 51

Reward model training - feedback interfaces

46/42

47 of 51

Reward model training - feedback interfaces

47/42

best of my data

The opportunity of text-based feedback.

48 of 51

Prominent, recent papers

49 of 51

Recapping recent examples - InstructGPT

49/42

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

50 of 51

Recapping recent examples - Anthropic’s RLHF

50/42

Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2204.05862 (2022).

51 of 51

Recapping recent examples - comparison

51/42

Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2204.05862 (2022).

This is cherry picked (and still instructive)!