1 of 53

LLM RLHF

RLTHF: Targeted Human Feedback for LLM Alignment

Jongmin Park

Sungmin Mun

2025.05.07

2 of 53

Motivation

01

Contents

Instruct GPT

02

RLTHF

03

Deepseek–R1

04

3 of 53

Motivation

01

4 of 53

AI and implications

· “The field of AI is often thought of as having four distinct approaches, which can be described as thinking humanly, thinking rationally, acting humanly, and acting rationally.”

· Implication: If machines must behave like humans and need to do so, then one must grapple with learning from human feedback

- “Artificial Intelligence: A Modern Approach” (S. Russell and P.Norvig)

5 of 53

AI agent

x ∈ 𝒳

Agent:

Performs a useful task by mapping input → output

y ∈ 𝒴

Input

Output

6 of 53

AI agent

x ∈ 𝒳

y ∈ 𝒴

Input

Output

Output space is large and complex

7 of 53

AI agent

x ∈ 𝒳

y ∈ 𝒴

Input

Output

Feedback about output

8 of 53

Task: Image recognition

“Cat”

AI Agent

9 of 53

Task: Medical diagnosis

Symptom 1

Symptom 2

Symptom 3

Body temperature

Blood pressure

Blood sugar level

SpO2 level

Diagnosis:

Type-2

Diabetes

AI Agent

10 of 53

Task: (personalized) Content recommendation

User features

User history

Product catalogue

Movies that you like

AI Agent

11 of 53

Task: (personalized) Content recommendation

User features

User history

Product catalogue

Movies that you like

AI Agent

12 of 53

Task: Game playing

Best Move: C6

Eval: +2.6

AI Agent

13 of 53

Task: Question-answering (chatbot)

Where is GIST?

The Gwangju Institute of Science and Technology (GIST) is located in Gwangju, a metropolitan city in the southwestern region of South Korea.

Address: Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju, South Korea

AI Agent

14 of 53

Human feedback to train the agent

x ∈ 𝒳

y ∈ 𝒴

Input

Output

15 of 53

Feedback by Demonstration(chatbot)

Prompt:

Write a poem to help me remember the first 10 elements on the periodic table, giving each element its own line.

Response:�Hydrogen comes first as element number one.�Helium is second for balloons to have fun!�In third place is Lithium to hold battery charge,�Followed by Beryllium in emeralds small and large.�Boron's number five to help us keep things clean.�Carbon’s next at six, and it’s found in every gene.�Nitrogen is seven, found in every breath we take,�More than eight (or Oxygen) in atmospheric make.�Number nine is Fluorine, helping polish up our teeth.�Neon gives us glowing signs to read and stand beneath.

16 of 53

Feedback by Demonstration(chatbot)

Prompt:

Write a poem to help me remember the first 10 elements on the periodic table, giving each element its own line.

Response:�Hydrogen comes first as element number one.�Helium is second for balloons to have fun!�In third place is Lithium to hold battery charge,�Followed by Beryllium in emeralds small and large.�Boron's number five to help us keep things clean.�Carbon’s next at six, and it’s found in every gene.�Nitrogen is seven, found in every breath we take,�More than eight (or Oxygen) in atmospheric make.�Number nine is Fluorine, helping polish up our teeth.�Neon gives us glowing signs to read and stand beneath.

  • Pro: Very informative
  • Con: Costly, time-consuming

17 of 53

Feedback by Numerical scoring(rec systems)

User preferences,

context

Recommendation:

Stay at Hotel ABC

  • Pro: informative
  • Con: Somewhat Costly, �burdensome

18 of 53

Feedback by Comparison(LLM fine tuning)

Prompt:

Please generate a description of …

  • Pro: Easy, Quick, Lightweight
  • Con: Coarse, Not too informative

19 of 53

Modern LLMs ask for preference feedback

20 of 53

Chain of Thought

Q: There are three boxes. One contains apples, one oranges, and one both. Each is labeled incorrectly. You can check one fruit. Which box should you choose to label them correctly?

A: The one labeled “both.”

B: Choose the box labeled “both,” because it's incorrect.

If it has only apples, then the one labeled "oranges" must have both, and the other has oranges.

21 of 53

The case for fine-tuning with preferences

  • Emotions and Feeling are often hard to quantify in numbers: Toxicity, friendliness(tone of writing), individuals writing style, etc.

  • A warm start (with reward / loss based supervised learning) helps to reduce the sample complexity with preference feedback

  • Even if both responses are factually correct, humans may prefer one that feels more natural and trustworthy—an important but hard-to-quantify distinction that preference feedback can effectively capture.

Ex) A. "That’s incorrect. Please refer to the documentation."� B. "Good question! It seems like there might be a misunderstanding—here’s the correct information :)"

22 of 53

Reward design, misspecfication & hacking

Source: https://bair.berkeley.edu/blog/2022/04/29/reward-reports/

23 of 53

Training language models to follow instructions with human feedback

02

OpenAI, 2022

(Instruct GPT)

24 of 53

Motivation & Problem

  • GPT-3 is Large Languege Model, but they do not reliably follow user instruction.

  • Problem: hallucination, bias, toxic content, instruction misalignment

  • Basic LM train objective: “ predict next token” → misaligned with user intention

- “How can we align language models more closely with human intent?”

RLHF!

25 of 53

Three steps of method

26 of 53

Reward modeling (RM)

RM is trained on a dataset of comparisons between two model outputs in the same input.

Loss function:

27 of 53

Data collection

  • 40 labelers

  • Prompt – Reponse pair

  • 3 dataset (SFT, RM, PPO)

28 of 53

Evaluation – Human Preference

  • 3H(Helpful, Honest, Harmless)

  • Metrics – Winrate, Likert scale, Hallucination, Toxicity Score

  • API Prompt Set, Public NLP Set

29 of 53

Results

Preference

Likert score

30 of 53

Results

Toxicity

score

31 of 53

Error Analysis - Remaining Issues

  • Hedging

  • False Premises

  • Constraint violation

32 of 53

RLTHF

03

33 of 53

Comparison With RLHF

RLTHF, Math Shepherd

DPO

GPRO

34 of 53

Weakness of RLHF and RLAIF

RLHF

RLAIF

Cost

Performance ?

High

Low

35 of 53

Weakness of RLHF and RLAIF

  1. Prompt optimization
  2. Task complexity
  3. Model bias
  4. Generator-discriminator gap
  5. Capability of judge model

RLAIF’s weakness

and

Some issues

1. Is reinforcement learning really necessary if Critic and Teacher are based on same model

2. If the student model’s distribution is weak, the preference data fails to explore enough high-quality actions .

A Critical Evaluation of AI Feedback for Aligning Large Language Models : https://proceedings.neurips.cc/paper_files/paper/2024/file/33870b3e099880cd8e705cd07173ac27-Paper-Conference.pdf

36 of 53

Idea

Human

annotation

Hard samples

Easy samples

AI annotation

37 of 53

Details & Results

38 of 53

DeepSeek-R1

04

39 of 53

DeepSeek-R1: Incentivizing Reasoning via RL

Contribution

Post-training : Large-Scale Reinforcement Learning on the Base Model

Distillation: Smaller Models Can Be Powerful Too

A model distilled from a large model

Small model with RL

>

40 of 53

PPO & GPRO & DPO

No Critic

41 of 53

PPO & GPRO & DPO

PPO

GPRO

cf) generalized advantage estimation

Outcome supervision

Process supervision?

Value function (the model being trained)

Importance sampling

Clipping

42 of 53

GPRO

So, this provides the motivation for introducing importance sampling.

43 of 53

PPO & GPRO & DPO

cf ) DPO (Not RL)

Motivation : To learn policy without reward modeling

=

Premise of both DPO and PPO

You should just accept!

MLE

Reinforcement Learning

This is PPO

44 of 53

PPO & GPRO & DPO

cf ) DPO (Not RL)

Motivation : To learn policy without reward modeling

=

Premise of both DPO and PPO

MLE

=

This is DPO

It sounds like the question: Why do we extract preference data but not directly use it in policy ?

Substitute

45 of 53

PPO & GPRO & DPO

Cf) Details

Why?

Distribution

Distribution

By introducing the KL term, the lower bound can be derived analytically.

But z(x) is hard to calculate, we do not derive policy analytically

46 of 53

PPO & GPRO & DPO

Cf ) Details

Because target is policy , we will use policy to represent the reward in Premise

(again familiar premise)

47 of 53

ORM & PRM (Math-Shepherd)

Reference . Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

ORM

PRM (human annotated)

Q

s_1

s_2

s_3

s_4

Correct Answer

positive

Q

s_1

s_2

s_3

s_4

Correct Answer

negative

negative

positive

positive

48 of 53

ORM & PRM (Math-Shepherd)

Reference . Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

PRM (automated)

49 of 53

Back to DeepSeek-R1 Zero

Instead, Rule Based Reward model

  1. Accuracy Reward

2. Format Reward

Not neural reward model

R1-Zero

1. Reward hacking

2. Training resource

3. Complexity of training pipeline

50 of 53

Results

51 of 53

DeepSeek R1

R1

Difference with R1-zero

Although Deepseek R1-zero starts with baseline model, DeepSeek R1 collected a small amount data of long COT data and fine-tuned the models .

Limitation of pure Reinforcement Learning

Low readability , Language mixing

Approach

Human annotated data (preference for good readability)

Add reward for language consistency

52 of 53

DeepSeek R1

R1

Reasoning oriented Reinforcement Learning

Reinforcement Learning for all Scenarios (for generality)

Reasoning data : Rule based reward model

General data : Neural network based reward model ( Deepseek v3 fine-tuned with Curated data)

It would be almost same process with DeepSeek R1-zero

Rejection sampling and Fine-Tuning

Incorporate data with rejection sampling and SFT

53 of 53

Results