Policy Gradient, GRPO, DeepSeek-R1
Prof. Pavel Izmailov
Today
RL
Environment�state s
Action�a
Reward�r
Next environment s’
RL for reasoning
Environment�state s
Action�a
Reward�r
Math Problem
Text: Solution Attempt
1 if correct else 0
RL for reasoning
Environment�state s
Action�a
Reward�r
Math Problem
Text: Solution Attempt
1 if correct else 0
RL for reasoning
Environment�state s
Action�a
Reward�r
Math Problem
Text: Solution Attempt
1 if correct else 0
Can think of each token as action, can think of tool calls as changing the state of the environment, reward could use reward model…
Policy Gradient
Policy Gradient
Expand expectation wrt a
Policy Gradient
Expand expectation wrt a
Move gradient into sum
Policy Gradient
Expand expectation wrt a
Move gradient into sum
Move reward r out of gradient
Policy Gradient
Expand expectation wrt a
Move gradient into sum
Move reward r out of gradient
Multiply and divide to get expectation wrt a
Policy Gradient
Expand expectation wrt a
Move gradient into sum
Move reward r out of gradient
Multiply and divide to get expectation wrt a
Replace sum wrt a with expectation
Policy Gradient
Expand expectation wrt a
Move gradient into sum
Move reward r out of gradient
Multiply and divide to get expectation wrt a
Replace sum wrt a with expectation
Notice f’ / f = (log f)’
Policy Gradient
Policy Gradient
Policy Gradient
Policy Gradient
Reinforce: SFT policy on successful solutions
Logic So Far
�
Policy Gradient: Baselining
Corollary:
Baseline can depend on s or policy, but not on the actions a!
Break into two expectations
Expectation → sum
Reverse gradient of log trick
Move gradient out of the sum
Gradient of constant is 0
Policy Gradient: Baselining
SFT policy on all solutions with weights
0.5
0.5
0.5
0.5
+0.5
-0.5
-0.5
+0.5
Logic So Far
Policy Gradient: Baselining
Fix some s, denote
Corollary:
We want to minimize variance of gradient wrt B
Policy Gradient: Baselining
Fix some s, denote
Corollary:
We want to minimize variance of gradient wrt B
Expand
Policy Gradient: Baselining
Fix some s, denote
Corollary:
We want to minimize variance of gradient wrt B
Expand
Drop constant term
E [B g] = 0
Policy Gradient: Baselining
Fix some s, denote
Corollary:
We want to minimize variance of gradient wrt B
Expand
Drop constant term
Doesn’t depend on B
Policy Gradient: Baselining
Fix some s, denote
Corollary:
We want to minimize variance of gradient wrt B
Expand
Drop constant term
Doesn’t depend on B
Differentiate wrt B
Logic So Far
Logic So Far
Policy Gradient: Expected Reward Baselining
SFT policy on all solutions with weights
0.7
0.1
0.2
0.6
+0.3
-0.1
-0.2
+0.4
Policy Gradient: Baselines
Expected Reward Baselining
How to estimate the baseline?
PPO: learn a value function that predicts reward from state; typically PPO computes baselines per-token
GRPO: sample G attempts per problem, use average reward; the baseline is computed per-sequence
GRPO
SFT policy on all solutions with weights
…
2/3
2/3
2/3
1/3
-2/3
1/3
Logic So Far
Logic So Far
GRPO
GRPO: Normalization
They introduce two extra normalization constants:
GRPO: Normalization
They introduce two extra normalization constants:
From PG theorem: second normalization puts extra weight on prompts with lower reward spread; first puts lower weight on long sequences.
GRPO: PPO Clipping Trick
If we want to do multiple policy update steps on the same data:
GRPO: PPO Clipping Trick
If we want to do multiple policy update steps on the same data:
New Axis of Scaling
Big picture
Training
Sampling
Updated params
Data
Talk by Jimmi Ba: https://www.youtube.com/live/AvWqX_xmo8Q?si=EmeWr-YV43RhLhZ9
Big picture
Training
Sampling
Updated params
Data
Can scale sampling up!
Talk by Jimmi Ba: https://www.youtube.com/live/AvWqX_xmo8Q?si=EmeWr-YV43RhLhZ9
Practical Training:
DeepSeek-R1
Pretraining
Pretraining on math-relevant data is extremely important.
120B Math tokens
Pretraining
Pretraining on math-relevant data is extremely important.
1.7T Math tokens
Reasoning pipeline
��Pretraining with lots of math tokens
Small SFT on long CoT reasoning examples
RL with verifiable rewards via GRPO
Middle step is very important too.
DeepSeek-R1
AIME performance keeps improving with RL steps.
DeepSeek-R1
Average sequence length increases with RL steps.
DeepSeek-R1
DeepSeek-R1
e3: test-time scaling
Models trained in this way demonstrate test-time scaling.
s1: test-time scaling
Models trained in this way demonstrate test-time scaling.
s1: test-time scaling
Open Questions?
Today
Training
Sampling
Updated params
Data