1 of 12

RLHF Research Paper Review

Andy Lee

2 of 12

What is RLHF?

  • Reinforcement Learning by Human Feedback (RLHF) is a technique used to fine-tune models to output responses that are more preferable to the human audience.
  • Specifically, we will cover RLHF applied on Large Language Models (LLMs), based on the research paper Training language models to follow instructions with human feedback written by OpenAI.

3 of 12

Overview of RLHF

Step 1

Step 2

Step 3

Fine-tune LLMs on training dataset w/t Supervised Learning

Rank model responses with human evaluators and train reward model

Use reward model to fine-tune the LLMs further with Proximal Policy Optimization (PPO)

4 of 12

Step 1: Fine-tune the model with Supervised Learning

  • The first step is to fine-tune the model with a dataset of prompt-answer pairs using supervised learning, where the answers are provided by humans.
  • Supervised learning trains the model to minimize the difference between model's outputs and human's outputs.
  • This allows the model to learn the domain knowledge and patterns of the dataset before applying RL-based fine-tuning.

5 of 12

Supervised Learning Example

Prompt

Human

Model

"How would you greet someone at an event?"

"Hello! It's so nice to meet you."

"Sup! It's kinda nice to meet you."

6 of 12

Step 2: Train the Reward Model

  • The second step is to have human evaluators rank the outputs of the model from best to worst.
  • This data is used to train a reward model, which is a separate model that rates the quality of the model's output as numeric scoring.
  • The paper used fine-tuned 6B GPT model as the reward model.

7 of 12

Reward Model Example

Prompt

"How should you politely decline a job offer?"

Good Response

"Thank you for the consideration, but I will have to decline."

Bad Response

"Nah, not interested lol."

0.9

0.2

8 of 12

Step 3: Fine-tune the model with PPO

  • The last step is to use the reward model to fine-tune the Large Language Model with Proximal Policy Optimization (PPO).
  • The Large Language Model will first output some response to the given prompt, which will be evaluated and given numeric scores by the reward model.
  • Using the numeric scores, the PPO fine-tuning will update the model to more likely to produce high-reward outputs.
  • PPO is a variation of Reinforcement Learning algorithm that aims to maximize expected reward from a model, while clipping the amount of updates that will be made to the model to ensure training stability.

9 of 12

Experiments by OpenAI

GPT-3

175B Parameter

InstructGPT

1.3B Parameter

vs.

10 of 12

Performance Comparison

PPO-ptx: InstructGPT

SFT:

GPT-3

11 of 12

Why is RLHF Important

Smaller model (1.3B) with RLHF hallucinated 21% of the time, compared to larger model (175B) without RLHF at 41% (Ouyang, Wu, Jiang, al., 2022).

Human evaluators preferred outputs from a smaller model (1.3B) with RLHF over the much larger model (175B) without RLHF 85% of the time (Ouyang, Wu, Jiang, al., 2022) for public NLP dataset.

Smaller model (1.3B) with RLHF generated 25% less toxic responses compared to larger model (175B) without RLHF (Ouyang, Wu, Jiang, al., 2022).

More Preferable Responses

More Truthful Responses

Less Toxic Responses

12 of 12

Thank You!