1 of 51

Reinforcement Learning from Human Feedback:

Pathways to Open Reproduction of ChatGPT

Dr. Nathan Lambert �9 March 2023

2 of 51

Recent breakthroughs in machine learning

2/42

Stable Diffusion

August 2022

ChatGPT

Nov. 2022

Bard

Early 2023

3 of 51

Generative AI: The near future for graduate students

3/42

Code generation

Email & writing

Displaying results

An important figure from a machine learning research paper with some math – Stable Diffusion

4 of 51

Generative AI: The near future for the world

4/42

Becomes substrate of the internet:

  1. AI generates all content not at the highest quality,
  2. Mix of human-generated prompts and autonomous systems.

the earth with a bunch of wires growing through it – DALLE 2

5 of 51

When machine learning models fall short

5/42

6 of 51

Reducing harms of machine learning systems

How do you create / code a loss function for:

  • What is fair?
  • What is ethical?
  • What is safe?

6/42

7 of 51

Do not write a loss function for preferences, model them!

7/42

Key motivation of reinforcement learning from human feedback (RLHF):

8 of 51

RLHF’s influence: intuition

RLHF re-shapes the model’s deep understanding of the world

  • Style transfer
  • Unlocking potential
  • Safety / ease of use

8/42

9 of 51

Outline

  • Origins of RLHF
  • Conceptual Overview
  • Reproducing ChatGPT
  • Addressable Research Questions

9/42

10 of 51

Review: reinforcement learning basics

10/42

11 of 51

History: RLHF for decision making

11/42

Knox, W. Bradley, and Peter Stone. "Tamer: Training an agent manually via evaluative reinforcement." 2008.

Pre Deep RL

Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." 2017.

For Deep RL

12 of 51

History: early OpenAI experiments with RLHF

12/42

Stiennon, Nisan, et al. "Learning to summarize with human feedback." 2020.

“Three pigs defend themselves from a mean wolf”

13 of 51

History: early OpenAI experiments with RLHF

13/42

Prompt:

To pursue a Computer Sc. PhD or continue working? Especially if one has no real intention to work in academia even after grad school ...

📓Vanilla LM:

I’m considering pursuing a PhD in Computer Science, but I’m worried about the future. I’m currently employed full-time, but I’m worried about the future.

✍️ Human Annotation:

Software Engineer with a job I’m happy at (for now), deciding whether to pursue a PhD to improve qualifications and explore interests and a new challenge.

🤖RLHF Model:

Currently employed, considering pursuing PhD in Computer Science to avoid being stuck with no residency visa ever again. Has anyone pursued a PhD purely for the sake of research, with no intention of joining the academic world?

+ =

14 of 51

Recent history: ChatGPT

(rumors)

  • About 10x spend on human annotation budget ($millions),
  • Order of 10B parameters (API costs2),
  • Uses RLHF, continually release model versions,
  • Quick internal sprint in Fall 2022 (not long-term project)1

14/42

[1]: Bing’s Revenge and Google’s AI Faceplant: Hard Fork, Feb 10th 2023�[2]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis

15 of 51

Outline

  • Origins of RLHF
  • Conceptual Overview
  • Reproducing ChatGPT
  • Addressable Research Questions

15/42

16 of 51

Modern RLHF overview

16/42

17 of 51

  1. Language model pretraining

17/42

  • Uses common training techniques in NLP
  • Base model strength determines need for supervised fine tuning / annotations
  • Open source models ~1.5yr behind closed models (for now!)

18 of 51

2. Reward model training

18/42

How to capture human sentiments in samples and curated text? What is the loss!

Goal: get a model that maps

input text → scalar reward

19 of 51

2. Reward model training

19/42

Input dataset: Prompts for specific use-case model will be used for (E.g. chat questions)

Generations: Often can use multiple models to create diverse ranking system.

20 of 51

2. Reward model training

20/42

21 of 51

2. Reward model training

21/42

22 of 51

2. Reward model training

22/42

Mapping from all possible inputs to a scalar!

Reward model:

- Also transformer based LM,

- Variation in sizes used (relative to policy),

- Multiple techniques to freeze base model.

23 of 51

3. Fine tuning with RL

23/42

24 of 51

3. Fine tuning with RL - using a reward model

24/42

25 of 51

3. Fine tuning with RL - KL penalty

25/42

Constrains the RL fine-tuning to not result in a LM that outputs gibberish (to fool the reward model).

Note: DeepMind did this in RL Loss (not reward), see GopherCite

Kullback–Leibler (KL) divergence:

Distance between distributions

26 of 51

3. Fine tuning with RL - combining rewards

26/42

Option to add additional terms to this reward function. E.g. InstructGPT

Reward to match original human-curation distribution

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

27 of 51

3. Fine tuning with RL - feedback & training

27/42

- Policy gradient updates policy LM directly (memory intensive).

- Often some parameters of policy are frozen.

28 of 51

Playing with the RL of RLHF

Some normal things with meh results:

❌ residual value prediction

❌ approximate KL (http://joschu.net/blog/kl-approx.html)

🤷‍♂️ entropy regularization

Always fighting instabilities: Lower learning rate, look at gradient norm & activation spikes, add model checkpointing for restarts

28/42

29 of 51

Outline

  • Origins of RLHF
  • Conceptual Overview
  • Reproducing ChatGPT
  • Addressable Research Questions

29/42

30 of 51

Reproducing ChatGPT: Overview & challenges

30/42

31 of 51

Reproducing ChatGPT: Prompt & instruction datasets

31/42

  • T0pp: general LM prompts�Sanh et. al 2021�Filtered from Promptsource
  • Instruction datasets
    • Anthropic Harmless, Helpful, Honest�Bai et al. 2022; Ganguli et al. 2022hf.co/datasets/Anthropic/hh-rlhf
    • Self-instruct (GPT-3 generated prompts)Wang et al. 2022
    • Super-Natural Instructions (crowd sourced prompts)Wang et al. 2022

  • Proprietary data (e.g. API usage, HF spaces)
  • Paid curated datasets (more in a few slides)�

Challenges

  • Academically available data focused on benchmarks
  • Need for super high quality data

32 of 51

Reproducing ChatGPT: Language models

32/42

  • Flan-T5 (20B)�Chung et al. 2022; Google�
  • OPT, OPT-IML (6B, 175B)�Zhang et al. 2022, Iyer et al. 2022; Meta

  • LLaMA (65B)�Touvron et al. 2023; Meta

  • GPT-NeoX�Black et al. 2021; EleutherAI�
  • BLOOM, BLOOM-Z�BigScience 2021, 2022; HuggingFace�

Challenges

  • Reproduction of high-quality instruction-tuned model not yet open sourced (at level of GPT 3.5)

33 of 51

Reproducing ChatGPT: Data collection & annotation

33/42

  • Paid vendors: Scale AI, Surge AI, Toloka, Amazon M. Turk, Upwork�
  • Old datasets:

Challenges

  • Expensive to collect data
  • Product, UX, and red-teaming needs at scale

34 of 51

Reproducing ChatGPT: Ongoing (open-source) efforts

  1. HuggingFace: Tools, datasets, models, and tutorials�https://huggingface.co/HuggingFaceH4
    1. Transformer Reinforcement Learning�https://github.com/lvwerra/trl
    2. Training & evaluation code coming soon 🤫�
  2. LAION/Open-assistant: End to end product approach�https://github.com/LAION-AI/Open-Assistant
  3. Other RLHF libraries�TRLX: https://github.com/CarperAI/trlx�RL4LMS: https://github.com/allenai/RL4LMs

34/42

35 of 51

Outline

  • Origins of RLHF
  • Conceptual Overview
  • Reproducing ChatGPT
  • Addressable Research Questions

35/42

36 of 51

Formulating a research agenda on RLHF

Key factor -- one module at a time!

  • Language models (hardest to research): effects of instruction tuning.�
  • Preference modeling: beyond pairwise comparisons, dataset curation, model architecture relative to policy, …�
  • RL algorithms: interpretability, uncertainty management, stability / robustness, fewer hyperparams, is RL needed, inference costs (offline RL) …�
  • System-wide: evaluation, different domains (e.g. diffusion models, code gen)!

36/42

37 of 51

RLHF on outcomes vs. messages

Outcome based rewards

37/42

Individual message rewards

38 of 51

RLHF on outcomes vs. messages

Outcome based rewards

38/42

Leverages rich history of RL as a multi-step optimizer:

  • Optimize on outcome of conversation
  • E.g. if user bought an item or was satisfied with customer service

39 of 51

Thanks

39/42

Nazneen Rajani Lewis Tunstall Thomas Wolf Tristan Thrush Edward Beeching

And more at HuggingFace and the community!

The RLHF team…

Leandro von Werra Younes Belkada

40 of 51

Resources

Related works: hf.co/blog/rlhf#further-reading

Models, datasets, tools: hf.co/HuggingFaceH4

(Training code for language / reward models soon)

Contact:

  • nathan@natolambert.com
  • twitter.com/natolambert

40/42

41 of 51

Conclusions

RLHF Summary:

  • Compelling tool to integrate hard-to-model values into ML systems.
  • Progress towards open reproduction is getting going.�Will feel like nothing and then all at once.

  • Many open research questions!

41/42

42 of 51

Supplementary slides follow

42

43 of 51

Variations on the methodology

Almost all papers to date have tweaks:

  • Anthropic - Claude
    • Initial policy helpfulness, honesty, and harmlessness (HHH) context distillation
    • Preference model pretraining (PMP): Fine-tune LM on dataset of binary rankings
    • Online iterated RLHF, RLAIF, Constitutional AI
  • OpenAI - InstructGPT
    • Humans generated initial LM training text, train RL policy to match this
    • Most extensive human annotation work
  • DeepMind - Sparrow / GopherCite
    • Advantage actor-critic (A2C) instead of PPO, different RL loss
    • Specific rule set for alignment (train on rules and preferences)
  • And more (please add what I missed to the chat)

43/42

44 of 51

Reward model training - feedback interfaces

44/42

45 of 51

Reward model training - feedback interfaces

45/42

46 of 51

Reward model training - feedback interfaces

46/42

47 of 51

Reward model training - feedback interfaces

47/42

best of my data

The opportunity of text-based feedback.

48 of 51

Prominent, recent papers

48

49 of 51

Recapping recent examples - InstructGPT

49/42

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

50 of 51

Recapping recent examples - Anthropic’s RLHF

50/42

Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2204.05862 (2022).

51 of 51

Recapping recent examples - comparison

51/42

Bai, Yuntao, et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2204.05862 (2022).

This is cherry picked (and still instructive)!