1 of 62

CSCI-SHU 376: Natural Language Processing

Hua Shen

2026-03-10

Spring 2026

Lecture 11: In-context Learning and RLHF

2 of 62

Today’s Plan

  • Prompting and In-context Learning
  • RLHF

3 of 62

What is Prompting?

  • Definition: Encouraging a pre-trained model to make predictions by textual prompt to specify the task to be done

4 of 62

Basic Prompting

  • Append a textual string to the beginning of the sequence and complete

5 of 62

Standard prompting workflow

  • Fill a prompt template
  • Predict the answer
  • Post-process the answer

6 of 62

Prompt Templates

  • A template where you fill in with an actual input

7 of 62

Answer Prediction

  • Given a prompt, predict the answer

8 of 62

Post-processing

  • Select the actual output based on the answer
  • E.g., formatting the output for easy visualization

9 of 62

Post-processing

  • Given an answer, map it into a class label or continuous value

  • Often map many extracted words onto a single class

10 of 62

Few-shot Prompting

  • Provide a few examples of the task together with the instruction

11 of 62

Empirical results on In-context Learning

  • Sometimes only giving the inputs works better

12 of 62

Empirical results on In-context Learning

  • Sometimes performance can decrease with too many examples

13 of 62

LMs are sensitive to Small changes

14 of 62

Prompt Engineering: Design of Prompts

  • Manual
    • Configure a manual template based on the characteristics of the task
    • Configure prompts based on intuition about a task

  • Automated search: Find the (hopefully) optimal prompts

15 of 62

Prompt Engineering: Design of Prompts

  • Manual
    • Configure a manual template based on the characteristics of the task
    • Configure prompts based on intuition about a task

  • Automated search: Find the (hopefully) optimal prompts

16 of 62

Prompt Engineering: Format

  • Make sure that the format matches that of a trained model
  • Could have large effect on models!

17 of 62

Prompt Engineering: Instruction

  • Instructions should be clear, concise and easy to understand
  • See https://www.promptingguide.ai/introduction/tips

18 of 62

Chain-of-thought Prompting

  • Get the model to explain its reasoning before making an answer

19 of 62

Today’s Plan

  • Prompting and In-context Learning
  • RLHF

20 of 62

Limitations of Instruction finetuning

  • It is expensive to collect ground-truth data for tasks
  • Some tasks like open-ended creative generation have no right answer
    • E.g., write a story about a lion
  • Language modelling penalizes all token-level mistakes equally, but some are worse than others
  • Can we try to satisfy human preferences?

21 of 62

Reinforcement Learning from Human Feedback (RLHF)

22 of 62

Reinforcement Learning from Human Feedback (RLHF)

  • Instruction tuning first
  • Then maximize reward

23 of 62

24 of 62

Recap: SFT Recap

Training:

<prompter> How do you make tea?

<assistant> Remove the tea bag and enjoy.

Inference:

<prompter> What country won the most gold medals in 2008 Olympics?

<assistant>

  • SFT = next-token prediction on instruction–response pairs�
  • Special tokens define speaker roles (prompter / assistant)
  • Serve as “cold start” in RLHF
  • After SFT, we get a reference model (policy)

Question: Can we get a better/optimal policy?

25 of 62

What is an “Optimal Policy”?

A simple formula:

Key question: how to get reward for each LLM generation?

26 of 62

P A R T 0 2

27 of 62

Comparison Data

P A R T 0 2

Challenge: What is a good measure of human preference?

Answer: Use pairwise comparison data — humans can more reliably judge which of two responses is better than assign absolute scores.

Data: (x, y_win, y_lost)

Reward model: R(x, y_win) > R(x, y_lost)

Question: Why is pairwise data a desired form?

28 of 62

Long before LLM: ELO Rating

  • Elo rating is widely adopted in chess, online games (e.g., League of Legends), etc…

  • Elo is an online learning algorithm

  • In preference optimization, however, we often seek a global estimation of latent rewards from all pairwise data.

29 of 62

Bradley-Terry (BT) model

Preference Distribution :

Parametrize + + maximum likelihood

Frame the problem as a binary classification we have the negative log-likelihood loss:

Tips: normalize the rewards for low variance

30 of 62

RLHF

During the RL phase, the learned reward function is used to provide feedback to the language model. The optimization is formulated as:

Due to the discrete nature of language generation, this objective is not differentiable and is typically optimized with reinforcement learning.

31 of 62

P A R T 0 2

32 of 62

Preparation:RL concepts

A common RL trajectory:

S0, A0, R1, S1, A1, R2, S2, A2, R3…

Start from initial state S0

Take action A0

Get Reward R1

Go to state S1

33 of 62

What is Agent and Environment in RL?

Agent:

Input: State St

Output: Action At

Usually formalized as policy: π(a | s)

Environment:

Input: current State St, Action At

Output: next State St+1, Reward

Usually formalized as: p(s’,r | s,a)

34 of 62

Example: Blackjack/21 points

You are initially given one card. You may take more cards, one at a time if you like. Stop anytime before lose.

  • jack = queen = king = 10; ace = 11; No joker included
  • You get 0 return if go over 21;
  • Return sum of your cards, otherwise.

Goal: close to 21 without going over 21.

35 of 62

Blackjack: Reward, State, Action

Let X be sum of your card,

Reward: [0, 21]

0 for non-terminal state

X if X <= 21, or 0 if X > 21, in terminal state.

State: X

Action: {0, 1}

0 for “don’t take more”, 1

Stop if choose 0 or X exceed 21.

36 of 62

Blackjack: Possible trajectories

τ1 : S0 = 7, A0 = 1, R1 = 0, S1 = 12, A1 =1, R2 = 0, S2 = 20, A2 = 0, R3 = 20

τ2: S0 = 11, A0 = 1, R1 = 0, S1 = 15, A1 =1, R2 = 0, S2 = 25, A2 = 0, R3 = 0

Finally, which one is better?

Return value G ()

G1 = 0 + 0 + 20 = 20 😃

G2 = 0 + 0 + 0 = 0 😭

Episodic: only gives (useful) reward at end

p.s. Go, Chess, and LLM usually holds Episodic

37 of 62

Modeling Reward: Policy, Value, and Action-Value

Policy: π(a | s)

  • Given I got 20 points, I should “stop”
  • I want to take action according to this state

Value: vπ(s) = Eπ[ Gt | St = s ]

  • Eπ [Rt +1 + 𝛾Rt+2 + …. | St = s ], For episodic, R: 0, 0, 0, ..., x
  • I want to know how good/bad my state is

Action Value: qπ(s,a) := Eπ[ Gt| St = s, At = a]

  • I want to know how good if I take this action at this state

38 of 62

Policy Gradient/REINFORCE algorithm

RL: Policy: π(a | s)

DRL: Policy Network

  • Can be just MLP
  • Input: a state vector
  • Output: logit over all actions

After softmax, it becomes “probability”

39 of 62

Policy Gradient/REINFORCE algorithm

Recall that RL trajectory:

S0, A0, R1, S1, A1, R2, S2, A2, R3…

Now At = argmax Ө(a| St)), all a in Action space

We can then do Gradient Ascent to raise Return G:

40 of 62

Policy Gradient/REINFORCE algorithm

By Policy Gradient Theorem, the gradient can be written like RHS

Thus, Deep learning training can be carried out

41 of 62

Policy Gradient/REINFORCE algorithm

42 of 62

Simplified LLM

current state:

Sentence input

LLM

(include tokenizer)

Logit output

(Over vocab)

Probability output

(Over vocab)

43 of 62

RL formalization of LLM problem

Episodic RL problem:

- initial state S0 = prompt

- action space A = vocabulary

- action At = token in vocabulary

- state St = prompt + all response tokens generated so far

- policy πӨ(a| St) = LLM output prob

- episode ends when <|endoftext|> token is sampled

Reward model would assign rφ(x,y)

Try to maximize Ex~D,y~πθ [ rφ(x,y) ]

P A R T 0 1

44 of 62

Reinforcement Learning (RL) Summary

  • The field of reinforcement learning has studied these problems for many years
  • Circa 2013: resurgence of interest in RL applied to deep learning in game playing
  • New area: Applying RL to modern LMs

45 of 62

Reinforcement Learning (RL)

 

46 of 62

RL: Language Generation

  • State: a prompt and tokens-generated so far
  • Action: generate a token
  • Policy: generator (e.g., Language Model)
  • Environment: append token
  • Reward: how good is the current token?

-> decomposed from reward on the full sequence

47 of 62

RL: One step Language Generation

  • State: a prompt
  • Action: generate a full response
  • Policy: generator (e.g., Language Model)
  • Reward: reward on the full sequence

48 of 62

Reinforcement Learning (RL)

  • The task criteria / preference is directly optimized via the reward
  • Data is generated by the model, and a reward tells us how to use data for training
  • Model generations are in the training loop

49 of 62

  • Human-in-the-loop is expensive!
  • Instead of directly asking humans for preference, model their preferences as a separate NLP problem

    • Rule-based rewards

    • Model-based rewards

How do we get rewards?

50 of 62

  • A verifiable property of output
  • Example: reward for solving a math problem

Rule-based rewards

51 of 62

  • A verifiable property of output
  • Example: write a program that passes test cases

Rule-based rewards

52 of 62

Model-based rewards

  • Model R(x, y) that scores output sequences
  • Example: classify whether the output is “helpful”

53 of 62

Model-based rewards: Preference

  • Human judgments are noisy and miscalibrated!
  • Instead of directly asking for ratings, ask for pairwise comparisons that are more reliable

54 of 62

  • Human judgments are noisy and miscalibrated!
  • Instead of directly asking for ratings, ask for pairwise comparisons that are more reliable

Model-based rewards : Preference

55 of 62

Evaluating reward models

56 of 62

Optimizing for human preferences

 

57 of 62

Optimizing for human preferences

  • How do we actually change our LM parameters to maximize this?

  • Policy gradient methods in RL give us tools for estimating and optimizing this objection

58 of 62

RLHF provides additional gains

59 of 62

RLHF Summary

 

60 of 62

Instruct GPT: scaling up RLHF to many tasks

61 of 62

Instruct GPT: scaling up RLHF to many tasks

  • Labeler collected tasks

62 of 62

Instruct GPT: scaling up RLHF to many tasks