LLM RLHF
RLTHF: Targeted Human Feedback for LLM Alignment
Jongmin Park
Sungmin Mun
2025.05.07
Motivation
01
Contents
Instruct GPT
02
RLTHF
03
Deepseek–R1
04
Motivation
01
AI and implications
· “The field of AI is often thought of as having four distinct approaches, which can be described as thinking humanly, thinking rationally, acting humanly, and acting rationally.”
· Implication: If machines must behave like humans and need to do so, then one must grapple with learning from human feedback
- “Artificial Intelligence: A Modern Approach” (S. Russell and P.Norvig)
AI agent
x ∈ 𝒳
Agent:
Performs a useful task by mapping input → output
y ∈ 𝒴
Input
Output
AI agent
x ∈ 𝒳
y ∈ 𝒴
Input
Output
Output space is large and complex
AI agent
x ∈ 𝒳
y ∈ 𝒴
Input
Output
Feedback about output
Task: Image recognition
“Cat”
AI Agent
Task: Medical diagnosis
Symptom 1
Symptom 2
Symptom 3
Body temperature
Blood pressure
Blood sugar level
SpO2 level
Diagnosis:
Type-2
Diabetes
AI Agent
Task: (personalized) Content recommendation
User features
User history
Product catalogue
Movies that you like
AI Agent
Task: (personalized) Content recommendation
User features
User history
Product catalogue
Movies that you like
AI Agent
Task: Game playing
Best Move: C6
Eval: +2.6
AI Agent
Task: Question-answering (chatbot)
Where is GIST?
The Gwangju Institute of Science and Technology (GIST) is located in Gwangju, a metropolitan city in the southwestern region of South Korea.
Address: Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju, South Korea
AI Agent
Human feedback to train the agent
x ∈ 𝒳
y ∈ 𝒴
Input
Output
Feedback by Demonstration(chatbot)
Prompt:
Write a poem to help me remember the first 10 elements on the periodic table, giving each element its own line.
Response:�Hydrogen comes first as element number one.�Helium is second for balloons to have fun!�In third place is Lithium to hold battery charge,�Followed by Beryllium in emeralds small and large.�Boron's number five to help us keep things clean.�Carbon’s next at six, and it’s found in every gene.�Nitrogen is seven, found in every breath we take,�More than eight (or Oxygen) in atmospheric make.�Number nine is Fluorine, helping polish up our teeth.�Neon gives us glowing signs to read and stand beneath.
Feedback by Demonstration(chatbot)
Prompt:
Write a poem to help me remember the first 10 elements on the periodic table, giving each element its own line.
Response:�Hydrogen comes first as element number one.�Helium is second for balloons to have fun!�In third place is Lithium to hold battery charge,�Followed by Beryllium in emeralds small and large.�Boron's number five to help us keep things clean.�Carbon’s next at six, and it’s found in every gene.�Nitrogen is seven, found in every breath we take,�More than eight (or Oxygen) in atmospheric make.�Number nine is Fluorine, helping polish up our teeth.�Neon gives us glowing signs to read and stand beneath.
Feedback by Numerical scoring(rec systems)
User preferences,
context
Recommendation:
Stay at Hotel ABC
Feedback by Comparison(LLM fine tuning)
Prompt:
Please generate a description of …
Modern LLMs ask for preference feedback
Chain of Thought
Q: There are three boxes. One contains apples, one oranges, and one both. Each is labeled incorrectly. You can check one fruit. Which box should you choose to label them correctly?
A: The one labeled “both.”
B: Choose the box labeled “both,” because it's incorrect.
If it has only apples, then the one labeled "oranges" must have both, and the other has oranges.
The case for fine-tuning with preferences
Ex) A. "That’s incorrect. Please refer to the documentation."� B. "Good question! It seems like there might be a misunderstanding—here’s the correct information :)"
Reward design, misspecfication & hacking
Source: https://bair.berkeley.edu/blog/2022/04/29/reward-reports/
Training language models to follow instructions with human feedback
02
OpenAI, 2022
(Instruct GPT)
Motivation & Problem
- “How can we align language models more closely with human intent?”
RLHF!
Three steps of method
Reward modeling (RM)
RM is trained on a dataset of comparisons between two model outputs in the same input.
Loss function:
Data collection
Evaluation – Human Preference
Results
Preference
Likert score
Results
Toxicity
score
Error Analysis - Remaining Issues
RLTHF
03
Comparison With RLHF
RLTHF, Math Shepherd
DPO
GPRO
Weakness of RLHF and RLAIF
RLHF
RLAIF
Cost
Performance ?
High
Low
Weakness of RLHF and RLAIF
RLAIF’s weakness
and
Some issues
1. Is reinforcement learning really necessary if Critic and Teacher are based on same model
2. If the student model’s distribution is weak, the preference data fails to explore enough high-quality actions .
A Critical Evaluation of AI Feedback for Aligning Large Language Models : https://proceedings.neurips.cc/paper_files/paper/2024/file/33870b3e099880cd8e705cd07173ac27-Paper-Conference.pdf
Idea
Human
annotation
Hard samples
Easy samples
AI annotation
Details & Results
DeepSeek-R1
04
DeepSeek-R1: Incentivizing Reasoning via RL
Contribution
Post-training : Large-Scale Reinforcement Learning on the Base Model
Distillation: Smaller Models Can Be Powerful Too
A model distilled from a large model
Small model with RL
>
PPO & GPRO & DPO
No Critic
PPO & GPRO & DPO
PPO
GPRO
cf) generalized advantage estimation
Outcome supervision
Process supervision?
Value function (the model being trained)
Importance sampling
Clipping
GPRO
So, this provides the motivation for introducing importance sampling.
PPO & GPRO & DPO
cf ) DPO (Not RL)
Motivation : To learn policy without reward modeling
=
Premise of both DPO and PPO
You should just accept!
MLE
Reinforcement Learning
This is PPO
PPO & GPRO & DPO
cf ) DPO (Not RL)
Motivation : To learn policy without reward modeling
=
Premise of both DPO and PPO
MLE
=
This is DPO
It sounds like the question: Why do we extract preference data but not directly use it in policy ?
Substitute
PPO & GPRO & DPO
Cf) Details
Why?
Distribution
Distribution
By introducing the KL term, the lower bound can be derived analytically.
But z(x) is hard to calculate, we do not derive policy analytically
PPO & GPRO & DPO
Cf ) Details
Because target is policy , we will use policy to represent the reward in Premise
(again familiar premise)
ORM & PRM (Math-Shepherd)
Reference . Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
ORM
PRM (human annotated)
Q
s_1
s_2
s_3
s_4
Correct Answer
positive
Q
s_1
s_2
s_3
s_4
Correct Answer
negative
negative
positive
positive
ORM & PRM (Math-Shepherd)
Reference . Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
PRM (automated)
Back to DeepSeek-R1 Zero
Instead, Rule Based Reward model
2. Format Reward
Not neural reward model
R1-Zero
1. Reward hacking
2. Training resource
3. Complexity of training pipeline
Results
DeepSeek R1
R1
Difference with R1-zero
Although Deepseek R1-zero starts with baseline model, DeepSeek R1 collected a small amount data of long COT data and fine-tuned the models .
Limitation of pure Reinforcement Learning
Low readability , Language mixing
Approach
Human annotated data (preference for good readability)
Add reward for language consistency
DeepSeek R1
R1
Reasoning oriented Reinforcement Learning
Reinforcement Learning for all Scenarios (for generality)
Reasoning data : Rule based reward model
General data : Neural network based reward model ( Deepseek v3 fine-tuned with Curated data)
It would be almost same process with DeepSeek R1-zero
Rejection sampling and Fine-Tuning
Incorporate data with rejection sampling and SFT
Results