1 of 12

RLHF Research Paper Review

Andy Lee

2 of 12

What is RLHF?

Reinforcement Learning by Human Feedback (RLHF) is a technique used to fine-tune models to output responses that are more preferable to the human audience.
Specifically, we will cover RLHF applied on Large Language Models (LLMs), based on the research paper Training language models to follow instructions with human feedback written by OpenAI.

3 of 12

Overview of RLHF

Step 1

Step 2

Step 3

Fine-tune LLMs on training dataset w/t Supervised Learning

Rank model responses with human evaluators and train reward model

Use reward model to fine-tune the LLMs further with Proximal Policy Optimization (PPO)

4 of 12

Step 1: Fine-tune the model with Supervised Learning

The first step is to fine-tune the model with a dataset of prompt-answer pairs using supervised learning, where the answers are provided by humans.
Supervised learning trains the model to minimize the difference between model's outputs and human's outputs.
This allows the model to learn the domain knowledge and patterns of the dataset before applying RL-based fine-tuning.

5 of 12

Supervised Learning Example

Prompt

Human

Model

"How would you greet someone at an event?"

"Hello! It's so nice to meet you."

"Sup! It's kinda nice to meet you."

6 of 12

Step 2: Train the Reward Model

The second step is to have human evaluators rank the outputs of the model from best to worst.
This data is used to train a reward model, which is a separate model that rates the quality of the model's output as numeric scoring.
The paper used fine-tuned 6B GPT model as the reward model.

7 of 12

Reward Model Example

Prompt

"How should you politely decline a job offer?"

Good Response

"Thank you for the consideration, but I will have to decline."

Bad Response

"Nah, not interested lol."

0.9

0.2

8 of 12

Step 3: Fine-tune the model with PPO

The last step is to use the reward model to fine-tune the Large Language Model with Proximal Policy Optimization (PPO).
The Large Language Model will first output some response to the given prompt, which will be evaluated and given numeric scores by the reward model.
Using the numeric scores, the PPO fine-tuning will update the model to more likely to produce high-reward outputs.
PPO is a variation of Reinforcement Learning algorithm that aims to maximize expected reward from a model, while clipping the amount of updates that will be made to the model to ensure training stability.

9 of 12

Experiments by OpenAI

GPT-3

175B Parameter

InstructGPT

1.3B Parameter

vs.

10 of 12

Performance Comparison

PPO-ptx: InstructGPT

SFT:

GPT-3

11 of 12

Why is RLHF Important

Smaller model (1.3B) with RLHF hallucinated 21% of the time, compared to larger model (175B) without RLHF at 41% (Ouyang, Wu, Jiang, al., 2022).

Human evaluators preferred outputs from a smaller model (1.3B) with RLHF over the much larger model (175B) without RLHF 85% of the time (Ouyang, Wu, Jiang, al., 2022) for public NLP dataset.

Smaller model (1.3B) with RLHF generated 25% less toxic responses compared to larger model (175B) without RLHF (Ouyang, Wu, Jiang, al., 2022).

More Preferable Responses

More Truthful Responses

Less Toxic Responses

12 of 12

Thank You!

OpenAI RLHF Paper: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf