CSCI-SHU 376: Natural Language Processing
Hua Shen
Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule
2026-03-10
Spring 2026
Lecture 11: In-context Learning and RLHF
Today’s Plan
What is Prompting?
Basic Prompting
Standard prompting workflow
Prompt Templates
Answer Prediction
Post-processing
Post-processing
Few-shot Prompting
Empirical results on In-context Learning
Empirical results on In-context Learning
LMs are sensitive to Small changes
Prompt Engineering: Design of Prompts
Prompt Engineering: Design of Prompts
Prompt Engineering: Format
Prompt Engineering: Instruction
Chain-of-thought Prompting
Today’s Plan
Limitations of Instruction finetuning
Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF)
Recap: SFT Recap
Training:
<prompter> How do you make tea?
<assistant> Remove the tea bag and enjoy.
Inference:
<prompter> What country won the most gold medals in 2008 Olympics?
<assistant>
Question: Can we get a better/optimal policy?
What is an “Optimal Policy”?
A simple formula:
Key question: how to get reward for each LLM generation?
P A R T 0 2
Comparison Data
P A R T 0 2
Challenge: What is a good measure of human preference?
Answer: Use pairwise comparison data — humans can more reliably judge which of two responses is better than assign absolute scores.
Data: (x, y_win, y_lost)
Reward model: R(x, y_win) > R(x, y_lost)
Question: Why is pairwise data a desired form?
Long before LLM: ELO Rating
Bradley-Terry (BT) model
Preference Distribution :
Parametrize + + maximum likelihood
Frame the problem as a binary classification we have the negative log-likelihood loss:
Tips: normalize the rewards for low variance
RLHF
During the RL phase, the learned reward function is used to provide feedback to the language model. The optimization is formulated as:
Due to the discrete nature of language generation, this objective is not differentiable and is typically optimized with reinforcement learning.
P A R T 0 2
Preparation:RL concepts
A common RL trajectory:
S0, A0, R1, S1, A1, R2, S2, A2, R3…
Start from initial state S0
Take action A0
Get Reward R1
Go to state S1
What is Agent and Environment in RL?
Agent:
Input: State St
Output: Action At
Usually formalized as policy: π(a | s)
Environment:
Input: current State St, Action At
Output: next State St+1, Reward
Usually formalized as: p(s’,r | s,a)
Example: Blackjack/21 points
You are initially given one card. You may take more cards, one at a time if you like. Stop anytime before lose.
Goal: close to 21 without going over 21.
Blackjack: Reward, State, Action
Let X be sum of your card,
Reward: [0, 21]
0 for non-terminal state
X if X <= 21, or 0 if X > 21, in terminal state.
State: X
Action: {0, 1}
0 for “don’t take more”, 1
Stop if choose 0 or X exceed 21.
Blackjack: Possible trajectories
τ1 : S0 = 7, A0 = 1, R1 = 0, S1 = 12, A1 =1, R2 = 0, S2 = 20, A2 = 0, R3 = 20
τ2: S0 = 11, A0 = 1, R1 = 0, S1 = 15, A1 =1, R2 = 0, S2 = 25, A2 = 0, R3 = 0
Finally, which one is better?
Return value G ()
G1 = 0 + 0 + 20 = 20 😃
G2 = 0 + 0 + 0 = 0 😭
Episodic: only gives (useful) reward at end
p.s. Go, Chess, and LLM usually holds Episodic
Modeling Reward: Policy, Value, and Action-Value
Policy: π(a | s)
Value: vπ(s) = Eπ[ Gt | St = s ]
Action Value: qπ(s,a) := Eπ[ Gt| St = s, At = a]
Policy Gradient/REINFORCE algorithm
RL: Policy: π(a | s)
DRL: Policy Network
After softmax, it becomes “probability”
Policy Gradient/REINFORCE algorithm
Recall that RL trajectory:
S0, A0, R1, S1, A1, R2, S2, A2, R3…
Now At = argmax (πӨ(a| St)), all a in Action space
We can then do Gradient Ascent to raise Return G:
Policy Gradient/REINFORCE algorithm
By Policy Gradient Theorem, the gradient can be written like RHS
Thus, Deep learning training can be carried out
Policy Gradient/REINFORCE algorithm
Simplified LLM
current state:
Sentence input
LLM
(include tokenizer)
Logit output
(Over vocab)
Probability output
(Over vocab)
RL formalization of LLM problem
Episodic RL problem:
- initial state S0 = prompt
- action space A = vocabulary
- action At = token in vocabulary
- state St = prompt + all response tokens generated so far
- policy πӨ(a| St) = LLM output prob
- episode ends when <|endoftext|> token is sampled
Reward model would assign rφ(x,y)
Try to maximize Ex~D,y~πθ [ rφ(x,y) ]
P A R T 0 1
Reinforcement Learning (RL) Summary
Reinforcement Learning (RL)
RL: Language Generation
-> decomposed from reward on the full sequence
RL: One step Language Generation
Reinforcement Learning (RL)
How do we get rewards?
Rule-based rewards
Rule-based rewards
Model-based rewards
Model-based rewards: Preference
Model-based rewards : Preference
Evaluating reward models
Optimizing for human preferences
Optimizing for human preferences
RLHF provides additional gains
RLHF Summary
Instruct GPT: scaling up RLHF to many tasks
Instruct GPT: scaling up RLHF to many tasks
Instruct GPT: scaling up RLHF to many tasks