1 of 54

Policy Gradient, GRPO, DeepSeek-R1

Prof. Pavel Izmailov

2 of 54

Today

  • Policy Gradient Theorem
  • GRPO
  • DeepSeekMath
  • DeepSeek-R1

3 of 54

RL

Environment�state s

Action�a

Reward�r

Next environment s’

4 of 54

RL for reasoning

Environment�state s

Action�a

Reward�r

Math Problem

Text: Solution Attempt

1 if correct else 0

5 of 54

RL for reasoning

Environment�state s

Action�a

Reward�r

Math Problem

Text: Solution Attempt

1 if correct else 0

6 of 54

RL for reasoning

Environment�state s

Action�a

Reward�r

Math Problem

Text: Solution Attempt

1 if correct else 0

Can think of each token as action, can think of tool calls as changing the state of the environment, reward could use reward model…

7 of 54

Policy Gradient

8 of 54

Policy Gradient

Expand expectation wrt a

9 of 54

Policy Gradient

Expand expectation wrt a

Move gradient into sum

10 of 54

Policy Gradient

Expand expectation wrt a

Move gradient into sum

Move reward r out of gradient

11 of 54

Policy Gradient

Expand expectation wrt a

Move gradient into sum

Move reward r out of gradient

Multiply and divide to get expectation wrt a

12 of 54

Policy Gradient

Expand expectation wrt a

Move gradient into sum

Move reward r out of gradient

Multiply and divide to get expectation wrt a

Replace sum wrt a with expectation

13 of 54

Policy Gradient

Expand expectation wrt a

Move gradient into sum

Move reward r out of gradient

Multiply and divide to get expectation wrt a

Replace sum wrt a with expectation

Notice f’ / f = (log f)’

14 of 54

Policy Gradient

15 of 54

Policy Gradient

16 of 54

Policy Gradient

17 of 54

Policy Gradient

Reinforce: SFT policy on successful solutions

18 of 54

Logic So Far

  • Policy Gradient Theorem

    • Can update policy via gradient ascent on expected reward
    • For {0, 1} rewards ⇒ Expert Iteration

19 of 54

Policy Gradient: Baselining

Corollary:

Baseline can depend on s or policy, but not on the actions a!

Break into two expectations

Expectation → sum

Reverse gradient of log trick

Move gradient out of the sum

Gradient of constant is 0

20 of 54

Policy Gradient: Baselining

SFT policy on all solutions with weights

0.5

0.5

0.5

0.5

+0.5

-0.5

-0.5

+0.5

21 of 54

Logic So Far

  • Policy Gradient Theorem
    • Can update policy via gradient ascent on expected reward
    • For {0, 1} rewards ⇒ Expert Iteration
  • Can subtract baselines from reward

22 of 54

Policy Gradient: Baselining

Fix some s, denote

Corollary:

We want to minimize variance of gradient wrt B

23 of 54

Policy Gradient: Baselining

Fix some s, denote

Corollary:

We want to minimize variance of gradient wrt B

Expand

24 of 54

Policy Gradient: Baselining

Fix some s, denote

Corollary:

We want to minimize variance of gradient wrt B

Expand

Drop constant term

E [B g] = 0

25 of 54

Policy Gradient: Baselining

Fix some s, denote

Corollary:

We want to minimize variance of gradient wrt B

Expand

Drop constant term

Doesn’t depend on B

26 of 54

Policy Gradient: Baselining

Fix some s, denote

Corollary:

We want to minimize variance of gradient wrt B

Expand

Drop constant term

Doesn’t depend on B

Differentiate wrt B

27 of 54

Logic So Far

  • Policy Gradient Theorem
    • Can update policy via gradient ascent on expected reward
    • For {0, 1} rewards ⇒ Expert Iteration
  • Can subtract baselines from reward���
    • Optimal Baseline is given by the expected reward

28 of 54

Logic So Far

  • Policy Gradient Theorem
    • Can update policy via gradient ascent on expected reward
    • For {0, 1} rewards ⇒ Expert Iteration
  • Can subtract baselines from reward���
    • Optimal Baseline is given by the expected reward

29 of 54

Policy Gradient: Expected Reward Baselining

SFT policy on all solutions with weights

0.7

0.1

0.2

0.6

+0.3

-0.1

-0.2

+0.4

30 of 54

Policy Gradient: Baselines

Expected Reward Baselining

How to estimate the baseline?

PPO: learn a value function that predicts reward from state; typically PPO computes baselines per-token

GRPO: sample G attempts per problem, use average reward; the baseline is computed per-sequence

31 of 54

GRPO

SFT policy on all solutions with weights

2/3

2/3

2/3

1/3

-2/3

1/3

32 of 54

Logic So Far

  • Policy Gradient Theorem
    • Can update policy via gradient ascent on expected reward
    • For {0, 1} rewards ⇒ Expert Iteration
  • Can subtract baselines from reward
    • Optimal Baseline is given by the expected reward
    • PPO learns a value model to predict expected reward
    • GRPO uses the mean reward in the group as baseline

33 of 54

Logic So Far

  • Policy Gradient Theorem
    • Can update policy via gradient ascent on expected reward
    • For {0, 1} rewards ⇒ Expert Iteration
  • Can subtract baselines from reward
    • Optimal Baseline is given by the expected reward
    • PPO learns a value model to predict expected reward
    • GRPO uses the mean reward in the group as baseline
  • Regardless of the baseline B, we are doing policy gradient updates. Expected gradient doesn’t depend on B

34 of 54

GRPO

35 of 54

GRPO: Normalization

They introduce two extra normalization constants:

  • Normalize by number of tokens
  • Normalize reward spread

36 of 54

GRPO: Normalization

They introduce two extra normalization constants:

  • Normalize by number of tokens
  • Normalize reward spread

From PG theorem: second normalization puts extra weight on prompts with lower reward spread; first puts lower weight on long sequences.

37 of 54

GRPO: PPO Clipping Trick

If we want to do multiple policy update steps on the same data:

  • Importance weighting to adjust for change in sampling distribution
  • Pessimistic clipping objective from PPO to remove incentive from moving policy far from the policy used to sample data

38 of 54

GRPO: PPO Clipping Trick

If we want to do multiple policy update steps on the same data:

  • Importance weighting to adjust for change in sampling distribution
  • Pessimistic clipping objective from PPO to remove incentive from moving policy far from the policy used to sample data

39 of 54

New Axis of Scaling

40 of 54

Big picture

Training

Sampling

Updated params

Data

41 of 54

Big picture

Training

Sampling

Updated params

Data

Can scale sampling up!

42 of 54

Practical Training:

DeepSeek-R1

43 of 54

Pretraining

Pretraining on math-relevant data is extremely important.

120B Math tokens

44 of 54

Pretraining

Pretraining on math-relevant data is extremely important.

1.7T Math tokens

45 of 54

Reasoning pipeline

��Pretraining with lots of math tokens

Small SFT on long CoT reasoning examples

RL with verifiable rewards via GRPO

Middle step is very important too.

46 of 54

DeepSeek-R1

AIME performance keeps improving with RL steps.

47 of 54

DeepSeek-R1

Average sequence length increases with RL steps.

48 of 54

DeepSeek-R1

49 of 54

DeepSeek-R1

50 of 54

e3: test-time scaling

Models trained in this way demonstrate test-time scaling.

51 of 54

s1: test-time scaling

Models trained in this way demonstrate test-time scaling.

52 of 54

s1: test-time scaling

53 of 54

Open Questions?

54 of 54

Today

  • How can we scale up training?
  • How off-policy can we go?
  • How can we get more environments to train on?
  • What is RL actually doing?
  • Pre-training vs RL
  • CoT faithfulness

Training

Sampling

Updated params

Data