1 of 62

CSCI-SHU 376: Natural Language Processing

Hua Shen

Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule

2026-03-10

Spring 2026

Lecture 11: In-context Learning and RLHF

2 of 62

Today’s Plan

Prompting and In-context Learning
RLHF

3 of 62

What is Prompting?

Definition: Encouraging a pre-trained model to make predictions by textual prompt to specify the task to be done

4 of 62

Basic Prompting

Append a textual string to the beginning of the sequence and complete

5 of 62

Standard prompting workflow

Fill a prompt template
Predict the answer
Post-process the answer

6 of 62

Prompt Templates

A template where you fill in with an actual input

7 of 62

Answer Prediction

Given a prompt, predict the answer

8 of 62

Post-processing

Select the actual output based on the answer
E.g., formatting the output for easy visualization

9 of 62

Post-processing

Given an answer, map it into a class label or continuous value

Often map many extracted words onto a single class

10 of 62

Few-shot Prompting

Provide a few examples of the task together with the instruction

11 of 62

Empirical results on In-context Learning

Sometimes only giving the inputs works better

12 of 62

Empirical results on In-context Learning

Sometimes performance can decrease with too many examples

13 of 62

LMs are sensitive to Small changes

14 of 62

Prompt Engineering: Design of Prompts

Manual

Configure a manual template based on the characteristics of the task
Configure prompts based on intuition about a task

Automated search: Find the (hopefully) optimal prompts

15 of 62

Prompt Engineering: Design of Prompts

Manual

Configure a manual template based on the characteristics of the task
Configure prompts based on intuition about a task

Automated search: Find the (hopefully) optimal prompts

16 of 62

Prompt Engineering: Format

Make sure that the format matches that of a trained model
Could have large effect on models!

17 of 62

Prompt Engineering: Instruction

Instructions should be clear, concise and easy to understand
See https://www.promptingguide.ai/introduction/tips

18 of 62

Chain-of-thought Prompting

Get the model to explain its reasoning before making an answer

19 of 62

Today’s Plan

Prompting and In-context Learning
RLHF

20 of 62

Limitations of Instruction finetuning

It is expensive to collect ground-truth data for tasks
Some tasks like open-ended creative generation have no right answer

E.g., write a story about a lion

Language modelling penalizes all token-level mistakes equally, but some are worse than others
Can we try to satisfy human preferences?

21 of 62

Reinforcement Learning from Human Feedback (RLHF)

22 of 62

Reinforcement Learning from Human Feedback (RLHF)

Instruction tuning first
Then maximize reward

24 of 62

Recap: SFT Recap

Training:

<prompter> How do you make tea?

<assistant> Remove the tea bag and enjoy.

Inference:

<prompter> What country won the most gold medals in 2008 Olympics?

SFT = next-token prediction on instruction–response pairs�
Special tokens define speaker roles (prompter / assistant)
Serve as “cold start” in RLHF
After SFT, we get a reference model (policy)

Question: Can we get a better/optimal policy?

25 of 62

What is an “Optimal Policy”?

A simple formula:

Key question: how to get reward for each LLM generation?

26 of 62

P A R T 0 2

27 of 62

Comparison Data

P A R T 0 2

Challenge: What is a good measure of human preference?

Answer: Use pairwise comparison data — humans can more reliably judge which of two responses is better than assign absolute scores.

Data: (x, y_win, y_lost)

Reward model: R(x, y_win) > R(x, y_lost)

Question: Why is pairwise data a desired form?

28 of 62

Long before LLM: ELO Rating

Elo rating is widely adopted in chess, online games (e.g., League of Legends), etc…

Elo is an online learning algorithm

In preference optimization, however, we often seek a global estimation of latent rewards from all pairwise data.

29 of 62

Bradley-Terry (BT) model

Preference Distribution :

Parametrize + + maximum likelihood

Frame the problem as a binary classification we have the negative log-likelihood loss:

Tips: normalize the rewards for low variance

30 of 62

RLHF

During the RL phase, the learned reward function is used to provide feedback to the language model. The optimization is formulated as:

Due to the discrete nature of language generation, this objective is not differentiable and is typically optimized with reinforcement learning.

31 of 62

P A R T 0 2

32 of 62

Preparation：RL concepts

A common RL trajectory:

S0, A0, R1, S1, A1, R2, S2, A2, R3…

Start from initial state S0

Take action A0

Get Reward R1

Go to state S1

33 of 62

What is Agent and Environment in RL?

Agent:

Input: State St

Output: Action At

Usually formalized as policy: π(a | s)

Environment:

Input: current State St, Action At

Output: next State St+1, Reward

Usually formalized as: p(s’,r | s,a)

34 of 62

Example: Blackjack/21 points

You are initially given one card. You may take more cards, one at a time if you like. Stop anytime before lose.

jack = queen = king = 10; ace = 11; No joker included
You get 0 return if go over 21;
Return sum of your cards, otherwise.

Goal: close to 21 without going over 21.

35 of 62

Blackjack: Reward, State, Action

Let X be sum of your card,

Reward: [0, 21]

0 for non-terminal state

X if X <= 21, or 0 if X > 21, in terminal state.

State: X

Action: {0, 1}

0 for “don’t take more”, 1

Stop if choose 0 or X exceed 21.

36 of 62

Blackjack: Possible trajectories

τ1 : S0 = 7, A0 = 1, R1 = 0, S1 = 12, A1 =1, R2 = 0, S2 = 20, A2 = 0, R3 = 20

τ2: S0 = 11, A0 = 1, R1 = 0, S1 = 15, A1 =1, R2 = 0, S2 = 25, A2 = 0, R3 = 0

Finally, which one is better?

Return value G ()

G1 = 0 + 0 + 20 = 20 😃

G2 = 0 + 0 + 0 = 0 😭

Episodic: only gives (useful) reward at end

p.s. Go, Chess, and LLM usually holds Episodic

37 of 62

Modeling Reward: Policy, Value, and Action-Value

Policy: π(a | s)

Given I got 20 points, I should “stop”
I want to take action according to this state

Value: vπ(s) = Eπ[ Gt | St = s ]

Eπ [Rt +1 + 𝛾Rt+2 + …. | St = s ], For episodic, R: 0, 0, 0, ..., x
I want to know how good/bad my state is

Action Value: qπ(s,a) := Eπ[ Gt| St = s, At = a]

I want to know how good if I take this action at this state

38 of 62

Policy Gradient/REINFORCE algorithm

RL: Policy: π(a | s)

DRL: Policy Network

Can be just MLP
Input: a state vector
Output: logit over all actions

After softmax, it becomes “probability”

39 of 62

Policy Gradient/REINFORCE algorithm

Recall that RL trajectory:

S0, A0, R1, S1, A1, R2, S2, A2, R3…

Now At = argmax (πӨ(a| St)), all a in Action space

We can then do Gradient Ascent to raise Return G:

40 of 62

Policy Gradient/REINFORCE algorithm

By Policy Gradient Theorem, the gradient can be written like RHS

Thus, Deep learning training can be carried out

41 of 62

Policy Gradient/REINFORCE algorithm

42 of 62

Simplified LLM

current state:

Sentence input

LLM

(include tokenizer)

Logit output

(Over vocab)

Probability output

(Over vocab)

43 of 62

RL formalization of LLM problem

Episodic RL problem:

- initial state S0 = prompt

- action space A = vocabulary

- action At = token in vocabulary

- state St = prompt + all response tokens generated so far

- policy πӨ(a| St) = LLM output prob

- episode ends when <|endoftext|> token is sampled

Reward model would assign rφ(x,y)

Try to maximize Ex~D,y~πθ [ rφ(x,y) ]

P A R T 0 1

44 of 62

Reinforcement Learning (RL) Summary

The field of reinforcement learning has studied these problems for many years
Circa 2013: resurgence of interest in RL applied to deep learning in game playing
New area: Applying RL to modern LMs

45 of 62

Reinforcement Learning (RL)

46 of 62

RL: Language Generation

State: a prompt and tokens-generated so far
Action: generate a token
Policy: generator (e.g., Language Model)
Environment: append token
Reward: how good is the current token?

-> decomposed from reward on the full sequence

47 of 62

RL: One step Language Generation

State: a prompt
Action: generate a full response
Policy: generator (e.g., Language Model)
Reward: reward on the full sequence

48 of 62

Reinforcement Learning (RL)

The task criteria / preference is directly optimized via the reward
Data is generated by the model, and a reward tells us how to use data for training
Model generations are in the training loop

49 of 62

Human-in-the-loop is expensive!
Instead of directly asking humans for preference, model their preferences as a separate NLP problem

Rule-based rewards

Model-based rewards

How do we get rewards?

50 of 62

A verifiable property of output
Example: reward for solving a math problem

Rule-based rewards

51 of 62

A verifiable property of output
Example: write a program that passes test cases

Rule-based rewards

52 of 62

Model-based rewards

Model R(x, y) that scores output sequences
Example: classify whether the output is “helpful”

53 of 62

Model-based rewards: Preference

Human judgments are noisy and miscalibrated!
Instead of directly asking for ratings, ask for pairwise comparisons that are more reliable

54 of 62

Human judgments are noisy and miscalibrated!
Instead of directly asking for ratings, ask for pairwise comparisons that are more reliable

Model-based rewards : Preference

55 of 62

Evaluating reward models

56 of 62

Optimizing for human preferences

57 of 62

Optimizing for human preferences

How do we actually change our LM parameters to maximize this?

Policy gradient methods in RL give us tools for estimating and optimizing this objection

1 of 62

2 of 62

3 of 62

4 of 62

5 of 62

6 of 62

7 of 62

8 of 62

9 of 62

10 of 62

11 of 62

12 of 62

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

18 of 62

19 of 62

20 of 62

21 of 62

22 of 62

23 of 62

24 of 62

25 of 62

26 of 62

27 of 62

28 of 62

29 of 62

30 of 62

31 of 62

32 of 62

33 of 62

34 of 62

35 of 62

36 of 62

37 of 62

38 of 62

39 of 62

40 of 62

41 of 62

42 of 62

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

48 of 62

49 of 62

50 of 62

51 of 62

52 of 62

53 of 62

54 of 62

55 of 62

56 of 62

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

62 of 62