Simpler is Better:
Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models
PIs: Prof. Wei Hu
Presenters: Zichen Zhang, Junkuan Liu, Luning Wang
April 21, 2025
University of Michigan
Outline
2
University of Michigan
Introduction and Background
3
[1] Kaufmann, Timo, et al. "A survey of reinforcement learning from human feedback." arXiv preprint arXiv:2312.14925 10 (2023).
[2] Gao, Leo, John Schulman, and Jacob Hilton. "Scaling laws for reward model overoptimization." International Conference on Machine Learning. PMLR, 2023.
[3] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
University of Michigan
Introduction and Background
4
[1] Yeo, Edward, et al. "Demystifying Long Chain-of-Thought Reasoning in LLMs." arXiv preprint arXiv:2502.03373 (2025).
University of Michigan
Introduction and Background
5
University of Michigan
Outline
6
University of Michigan
Classic/Normal Reward
7
University of Michigan
Problems with Classic Reward
8
University of Michigan
Cosine Reward in Yeo, Edward, et al (2025)
9
University of Michigan
Why Cosine Rewards?
10
University of Michigan
Cosine rewards often …
leads to unhealthy long CoT reasoning at the start of the training
An N-gram repetition penalty was developed to mitigate this issue,
11
University of Michigan
N-gram Repetition Penalty in Cosine Reward
12
University of Michigan
Weight of Rep Penalty is Constant
Can we dynamically adjust the weight of repetition penalty?
13
University of Michigan
Dynamic Rewards
14
University of Michigan
Other associated rewards
Int Reward
Format/XML Reward
15
University of Michigan
Experimental Setup
Qwen2.5-3B-Instruct
Unsloth library with LoRA + GRPO Trainer
Trained on GSM8K and Tested on its test split
Trained with 500 steps, eval every 25 steps
Max completion length of 1,024 tokens
GRPO group size of 8 question-responses
Rep reward n-gram with n=20
16
University of Michigan
Outline
17
University of Michigan
Experiment setup
Model
Dataset
Training hyperparameters
Reward Functions
18
University of Michigan
Reasoning Length
19
Cosine and dynamic rewards effectively reduced the reasoning length, which matches our expectations.
University of Michigan
Accuracy
20
Both cosine and dynamic rewards led to a drop in validation accuracy compared to the normal reward setting
University of Michigan
Rewards & repetition penalty
21
University of Michigan
Aha words
22
The occurrence of "aha" words (e.g., 'wait', 'recheck', 'however') was infrequent across all experiments.
University of Michigan
Discussion: The Challenge of RL for SLMs
23
Reward Sparsity: Because SLMs frequently produce incorrect or improperly formatted outputs, they often fail to receive any positive outcome-based reward. This makes the reward signal extremely sparse, making it difficult to explore the solution space effectively
Overfitting to Outcome Reward: Relying on simple, rule-based rewards can cause the model to overfit to easy signals while neglecting the actual reasoning task.
University of Michigan
Limitations
Limited Model and Dataset Scope:�Our experiments were conducted on a single model and a single dataset. To generalize our conclusions, broader evaluations across diverse models and datasets are necessary.�
Resource-Constrained Fine-Tuning Approach:�Due to resource limitations, we adopted LoRA fine-tuning, whereas many related works employed full fine-tuning. This difference may affect direct comparability with other studies.
24
University of Michigan
Thank you!
25
University of Michigan