1 of 41

Spurious Rewards

Rethinking Training Signals in RLVR

5/11/2025

2 of 41

Motivation

  • RL with Verifiable Rewards (RLVR) is has emerged as highly effective LLM post-training paradigm, especially for Math and Coding
  • RLVR (was thought to) require high-euqlity ground-truth supervision in the form of correct answer (Math) or correctness tests (Coding)
  • Many progresses recently on Qwen models—it’s easy to get gains from RL

3 of 41

Motivation

  • Recent works show light supervision can work for RLVR
    • Training on one example (One-shot RL)
    • Unsupervised RL (TTRL)

Does RL actually teach the model anything?

Do we still need to scale RL?

4 of 41

Spurious Rewards

  • So what is a minimum supervision for RLVR?
    • We design progressively weak or even spurious rewards:

Weaker and weaker supervision.

5 of 41

Spurious Rewards

  • So what is a minimum supervision for RLVR?

Weaker and weaker supervision.

6 of 41

Qwen2.5-Math across model sizes and datasets

  • All proposed rewards improve performance within 50 steps
  • Ground truth reward are best, but alternative rewards are not far behind (except random)

7 of 41

Did we “solve RL”? …No

Extending spurious rewards to other models yields different trends

  • The end of RLVR?
  • OR, just an artifact?

8 of 41

Other models across model sizes and datasets

  • Ground truth labels yield better performance than other rewards

9 of 41

What does this mean?

  • What does it mean in practice?
    • Verify proposed algorithm on multiple model families.
    • Verify the proposed algorithm against spurious baselines.
    • Transparency of the training data is important for research.

10 of 41

What works on one model doesn’t necessarily generalize

  • Previous Qwen-centric methods that show weak supervision works mainly only work for Qwen(-Math) using the same training setup.

11 of 41

How do spurious rewards work?

An illuminating case study: code reasoning

  • Our hypothesis (aligning with some previous works):
    • RLVR (at current academic scales) works by eliciting existing behaviors that have already been learned during pre-training.
    • We take one example of such behavior, code reasoning, for illustration.

12 of 41

How do spurious rewards work?

An illuminating case: code reasoning

  • What is code reasoning?

13 of 41

The effectiveness of code reasoning

Robustness to numerical perturbations

14 of 41

The effectiveness of code reasoning

(Lack of) robustness to semantic perturbations

15 of 41

How do spurious rewards work?

Qwen-Math is good at using code out-of-the box, RL increases code frequency.

16 of 41

How do spurious rewards work?

Relationship between code frequency and performance

17 of 41

How do spurious rewards work?

An illuminating case: code reasoning

18 of 41

How do spurious rewards work?

An illuminating case: code reasoning

19 of 41

How do spurious rewards work?

    • If our hypothesis is true: RLVR w/ spurious rewards elicit existing behaviors that correlate with good performance, then we should be able to achieve so by eliciting the same behavior using other approaches.
      • Prompting: “Let’s solve this using Python.”
      • RLVR w/ Python format rewards

An illuminating case: code reasoning

20 of 41

How do spurious rewards work?

    • Prompting: “Let’s solve this using Python.”

An illuminating case: code reasoning

21 of 41

How do spurious rewards work?

    • RLVR w/ Python format rewards

An illuminating case: code reasoning

22 of 41

How do spurious rewards work?

    • One more verification:
      • What if we force the model to learn without code reasoning?
      • Compound rewards:
        • X + No Python Rewards

An illuminating case: code reasoning

23 of 41

How do spurious rewards work?

Other possible behavior: repetition

  • No repetition reward

24 of 41

Training Signals from Incorrect Rewards

Hypothesis on why incorrect signals lead to better results.

  • Partially function as format rewards.
  • Obtained by some reasonable traces.
  • A few are similar to or variants of correct answers (answer extractor issue).

25 of 41

Training Signals from Random Rewards

Clipping Bias.

  • Improvement holds for all non-zero probabilities.

26 of 41

Why does random reward work?

One intuitive (but inaccurate) hypothesis

  • There are more code reasoning than language reasoning, a reward assigned at uniform random to responses tend to reward code reasoning more than language reasoning (?)

27 of 41

Why does random reward work?

One intuitive (but inaccurate) hypothesis

  • There are more code reasoning than language reasoning, a reward assigned at uniform random to responses tend to reward code reasoning more than language reasoning (?)
  • However…
  • The advantage calculation in GRPO

28 of 41

Why does random reward work?

One intuitive (but inaccurate) hypothesis

  • There are more code reasoning than language reasoning, a reward assigned at uniform random to responses tend to reward code reasoning more than language reasoning (?)
  • However…
  • The advantage calculation in GRPO
    • Contains a normalization term;

29 of 41

Why does random reward work?

One intuitive (but inaccurate) hypothesis

  • There are more code reasoning than language reasoning, a reward assigned at uniform random to responses tend to reward code reasoning more than language reasoning (?)
  • However…
  • The advantage calculation in GRPO
    • Contains a normalization term;
    • Reward of 0 penalizes the response code reasoning is equally rewarded and penalized compared to language reasoning.
  • E[A|x,y]=0

30 of 41

Training Signals from Random Rewards

Clipping Bias.

  • Closer look at the gradients.
    • GRPO objective.

Unused

31 of 41

Training Signals from Random Rewards

  • Assume rewards are i.i.d. random Bernoulli. Then E[A|x,y]=0
  • Gradient bias, defined as difference between expected gradient with and without clipping:���

  • Clipping bias discourages the model from leaving the clipping region

Clipping Bias.

32 of 41

Training Signals from Random Rewards

  • pi_old(y_t)=0.85, eps=0.2. pi_old(y_t)*(1+eps)=1.02 - never reached
  • The bias is non-negative for this token, pushes the model to increase the probability of this token even further

Figure CR: Alex Nikulkov

Clipping Bias.

33 of 41

Training Signals from Random Rewards

Clipping Bias.

  • Removing clipping results in more stochastic performance; little or no improvement observed on average.

Different ways of removing/avoiding clipping.

34 of 41

Effect of Clipping

on token probability and code frequency.

35 of 41

Prompt engineering to elicit prior knowledge

Model sensitivity to various prompts

36 of 41

The Impact of Prompts

37 of 41

The Impact of Prompts

Spurious prompts.

Latex placeholder generated by \lipsum

38 of 41

The Impact of Prompts

Format-following and accuracy.

39 of 41

What this means for RL/post-training?

  • Proposed RLVR methods should be
    1. Validated across model families.
    2. Compared against spurious baselines.
  • Much of the gains from academic-scale RL might be from eliciting pre-existing capabilities.
  • Installing desirable capabilities during the pre-training and mid-training stage could enable more effective post-training.

40 of 41

Thank you!

Blogpost

41 of 41

Lorem Ipsum is all you need 😈

Information-less eval prompt boosts MATH-500 performance by 19.4% on Qwen-Math-7B