Spurious Rewards
Rethinking Training Signals in RLVR
5/11/2025
Motivation
Motivation
Does RL actually teach the model anything?
Do we still need to scale RL?
Spurious Rewards
Weaker and weaker supervision.
Spurious Rewards
Weaker and weaker supervision.
Qwen2.5-Math across model sizes and datasets
Did we “solve RL”? …No
Extending spurious rewards to other models yields different trends
Other models across model sizes and datasets
What does this mean?
What works on one model doesn’t necessarily generalize
How do spurious rewards work?
An illuminating case study: code reasoning
How do spurious rewards work?
An illuminating case: code reasoning
The effectiveness of code reasoning
Robustness to numerical perturbations
The effectiveness of code reasoning
(Lack of) robustness to semantic perturbations
How do spurious rewards work?
Qwen-Math is good at using code out-of-the box, RL increases code frequency.
How do spurious rewards work?
Relationship between code frequency and performance
How do spurious rewards work?
An illuminating case: code reasoning
How do spurious rewards work?
An illuminating case: code reasoning
How do spurious rewards work?
An illuminating case: code reasoning
How do spurious rewards work?
An illuminating case: code reasoning
How do spurious rewards work?
An illuminating case: code reasoning
How do spurious rewards work?
An illuminating case: code reasoning
How do spurious rewards work?
Other possible behavior: repetition
Training Signals from Incorrect Rewards
Hypothesis on why incorrect signals lead to better results.
Training Signals from Random Rewards
Clipping Bias.
Why does random reward work?
One intuitive (but inaccurate) hypothesis
Why does random reward work?
One intuitive (but inaccurate) hypothesis
Why does random reward work?
One intuitive (but inaccurate) hypothesis
Why does random reward work?
One intuitive (but inaccurate) hypothesis
Training Signals from Random Rewards
Clipping Bias.
Unused
Training Signals from Random Rewards
Clipping Bias.
Training Signals from Random Rewards
Figure CR: Alex Nikulkov
Clipping Bias.
Training Signals from Random Rewards
Clipping Bias.
Different ways of removing/avoiding clipping.
Effect of Clipping
on token probability and code frequency.
Prompt engineering to elicit prior knowledge
Model sensitivity to various prompts
The Impact of Prompts
The Impact of Prompts
Spurious prompts.
Latex placeholder generated by \lipsum
The Impact of Prompts
Format-following and accuracy.
What this means for RL/post-training?
Thank you!
Blogpost
Lorem Ipsum is all you need 😈
Information-less eval prompt boosts MATH-500 performance by 19.4% on Qwen-Math-7B