70%
Efficient Reinforcement Finetuning via
Adaptive Curriculum Learning
Taiwei Shi1, Yiyang Wu1, Linxin Song1, Tianyi Zhou2, Jieyu Zhao1
1University of Southern California 2University of Maryland, College Park
ADARFT
2. Difficulty Estimation
4. Discussion: Data Difficulty on Model Performance
Qwen 2.5 7B trained on different data distributions using PPO (Uniform, Easy-Extreme, Hard-Extreme) and
ADARFT instantiated with PPO (Uniform + ADARFT) 🌟
target reward β = 0.5 → learn at a balanced success rate
sensitivity parameter α = 2, step size η = 50 → ensure stable curriculum updates
difficulty range = [0, 100], initial target difficulty T = 0, batch size B = 1024
3. Difficulty Distribution & Result
samples
each set
Better Reasoning Performance
Higher Training Efficiency
~0.5
*
→
→
26(44%)
29(48%)
64(107%)
44(73%)
Steps Saved (%)
R=0/1
tanh(±1)≈±0.7616.
+-38.08
# updates