1 of 25

Simpler is Better:

Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models

PIs: Prof. Wei Hu

Presenters: Zichen Zhang, Junkuan Liu, Luning Wang

April 21, 2025

University of Michigan

2 of 25

Outline

Introduction and Background
Methodology
Experiments and Results
Discussion
Limitations and Future work

University of Michigan

3 of 25

Introduction and Background

Most RLHF methods before 2025 rely on model-based rewards[1], which might suffer from reward hacking[2][3] and complicate the training process.
Deepseek-R1[3] proposes to use verifiable rewards for better complex reasoning abilities, which is surprisingly successful。

Accuracy reward: sparse reward for correct final answer.
Format reward: sparse reward for the desired output format.

[1] Kaufmann, Timo, et al. "A survey of reinforcement learning from human feedback." arXiv preprint arXiv:2312.14925 10 (2023).

[2] Gao, Leo, John Schulman, and Jacob Hilton. "Scaling laws for reward model overoptimization." International Conference on Machine Learning. PMLR, 2023.

[3] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).

University of Michigan

4 of 25

Introduction and Background

Subsequent paper[1] further discusses the mechanics of long CoT reasoning, and propose to use reward shaping to stabilize and control CoT length while improving accuracy.

Cosine reward with a repetition penalty: stabilizes CoT growth while encouraging emergent reasoning behaviors such as branching and backtracking

[1] Yeo, Edward, et al. "Demystifying Long Chain-of-Thought Reasoning in LLMs." arXiv preprint arXiv:2502.03373 (2025).

University of Michigan

5 of 25

Introduction and Background

However, there lacks the discussion on the effect of those different types of rewards on small language models (SLMs), which typically have a parameter size <7B and could behave differently from those large models.

We focus on the effect of different rewards on small models (~3B)

We first propose a dynamic reward that extends the concept of cosine reward, and then experiment several kinds of rewards (normal & cosine & dynamic) on the chosen SLMs.
With careful observation and analysis, we provide several key insights which could benefit future studies in this field.

University of Michigan

6 of 25

Outline

Introduction and Background
Methodology
Experiments and Results
Discussion
Limitations and Future work

University of Michigan

7 of 25

Classic/Normal Reward

University of Michigan

8 of 25

Problems with Classic Reward

some observed unstable CoT length scaling, in which the model exceeds the allowable window size, leading to worse performance
seen in for large models, both Llama3.1-8B and Qwen2.5-Math-7B
what about small models?

University of Michigan

9 of 25

Cosine Reward in Yeo, Edward, et al (2025)

University of Michigan

10 of 25

Why Cosine Rewards?

For correct answers, shorter completions are preferred

For incorrect answers, longer completions are less punished

University of Michigan

11 of 25

Cosine rewards often …

leads to unhealthy long CoT reasoning at the start of the training

An N-gram repetition penalty was developed to mitigate this issue,

better performance
shorter CoTs

University of Michigan

12 of 25

N-gram Repetition Penalty in Cosine Reward

University of Michigan

13 of 25

Weight of Rep Penalty is Constant

At the start of the training, when accuracy is low, the Cosine Reward, by definition, incentivizes the model to think longer. This leads to increased reward hacking through repetition, demanding a stronger repetition penalty.
In the middle and the end of the training, when accuracy becomes higher, reward hacking becomes less likely, requiring a weaker repetition penalty.

Can we dynamically adjust the weight of repetition penalty?

University of Michigan

14 of 25

Dynamic Rewards

University of Michigan

15 of 25

Other associated rewards

Int Reward

Answer must be a single integer (GSM8K)

Format/XML Reward

Answer must use <reasoning> … </reasoning> and <answer> … </answer> tags

University of Michigan

16 of 25

Experimental Setup

Qwen2.5-3B-Instruct

Unsloth library with LoRA + GRPO Trainer

Trained on GSM8K and Tested on its test split

Trained with 500 steps, eval every 25 steps

Max completion length of 1,024 tokens

GRPO group size of 8 question-responses

Rep reward n-gram with n=20

University of Michigan

17 of 25

Outline

Introduction and Background
Methodology
Experiments and Results
Discussion
Limitations and Future work

University of Michigan

18 of 25

Experiment setup

Model

Base Model: Qwen2.5-3B-Instruct
Fine-tuning: LoRA (rank=64)

Dataset

Training dataset: GSM8K
Evaluation dataset: GSM8K

Training hyperparameters

Training: 500 steps, group_size = 8, lr = 5e-6
Tracked Metrics: Reasoning Length, Accuracy, Rewards, Repetition Penalty, Aha Words

Reward Functions

Normal: Basic format & correctness rewards
Cosine: Cosine similarity rewards + repetition penalty
Dynamic: Adaptive weighting of cosine & repetition penalties

University of Michigan

19 of 25

Reasoning Length

Cosine and dynamic rewards effectively reduced the reasoning length, which matches our expectations.

University of Michigan

20 of 25

Accuracy

Both cosine and dynamic rewards led to a drop in validation accuracy compared to the normal reward setting

University of Michigan

21 of 25

Rewards & repetition penalty

Repetition was rare across all settings.
The model quickly learned to output in the correct format.
Overall reward increased slowly for the normal reward setting. In contrast, rewards for cosine and dynamic settings fluctuated more throughout training.

University of Michigan

22 of 25

Aha words

The occurrence of "aha" words (e.g., 'wait', 'recheck', 'however') was infrequent across all experiments.

University of Michigan

23 of 25

Discussion: The Challenge of RL for SLMs

Reward Sparsity: Because SLMs frequently produce incorrect or improperly formatted outputs, they often fail to receive any positive outcome-based reward. This makes the reward signal extremely sparse, making it difficult to explore the solution space effectively

Overfitting to Outcome Reward: Relying on simple, rule-based rewards can cause the model to overfit to easy signals while neglecting the actual reasoning task.

University of Michigan

24 of 25

Limitations

Limited Model and Dataset Scope:�Our experiments were conducted on a single model and a single dataset. To generalize our conclusions, broader evaluations across diverse models and datasets are necessary.�

Resource-Constrained Fine-Tuning Approach:�Due to resource limitations, we adopted LoRA fine-tuning, whereas many related works employed full fine-tuning. This difference may affect direct comparability with other studies.

University of Michigan

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25