On the Rollout-Training Mismatch
In Modern RL Systems
November 10, 2025 – Presented at Applied Compute
Feng Yao* Liyuan Liu* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao
Efficient RL systems are rising
2
Efficient RL systems are rising
3
Efficient RL systems are rising
4
It also brings an issue…
5
It also brings an issue…
6
It also brings an issue…
7
It also brings an issue…
8
Mismatch!
It also brings an issue…
9
It also brings an issue…
10
It also brings an issue…
11
It also brings an issue…
12
It also brings an issue…
13
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
It also brings an issue…
14
Max Diff = 1.0
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
It also brings an issue…
15
Max Diff = 1.0
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
It also brings an issue…
Implicitly makes RL “Off-Policy”!
16
Max Diff = 1.0
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
But it can be fixed effectively
17
But it can be fixed effectively
18
Harvesting the Off-Policyness via Quantization
19
Harvesting the Off-Policyness via Quantization
20
Harvesting the Off-Policyness via Quantization
21
Outline
22
Outline
23
Why does Rollout-Training Mismatch occur?
24
Why does Rollout-Training Mismatch occur?
25
Why does Rollout-Training Mismatch occur?
26
Why does Rollout-Training Mismatch occur?
27
Outline
28
How to Fix the Off-Policy Issue It Brings
29
How to Fix the Off-Policy Issue It Brings
30
How to Fix the Off-Policy Issue It Brings
31
How to Fix the Off-Policy Issue It Brings
32
It helps, but …
How to Fix the Off-Policy Issue It Brings
33
It helps, but the gap still exists
How to Fix the Off-Policy Issue It Brings
34
How to Fix the Off-Policy Issue It Brings
35
How to Fix the Off-Policy Issue It Brings
36
How to Fix the Off-Policy Issue It Brings
37
How to Fix the Off-Policy Issue It Brings
38
How to Fix the Off-Policy Issue It Brings
39
How to Fix the Off-Policy Issue It Brings
40
How to Fix the Off-Policy Issue It Brings
41
How to Fix the Off-Policy Issue It Brings
42
How to Fix the Off-Policy Issue It Brings
43
How to Fix the Off-Policy Issue It Brings
44
How to Fix the Off-Policy Issue It Brings
45
How to Fix the Off-Policy Issue It Brings
46
How to Fix the Off-Policy Issue It Brings
47
Why Not Alternative Methods?
48
Why Not Alternative Methods?
49
Why Not Alternative Methods?
50
A commonly asked variant
Why Not Alternative Methods?
51
A commonly asked variant
Can break out of the trust region
Why Not Alternative Methods?
52
A commonly asked variant
Can break out of the trust region
Why Not Alternative Methods?
53
A commonly asked variant
Can break out of the trust region
Can be too large and makes training crash
How well can TIS fix it?
54
How well can TIS fix it?
55
Does TIS always help?
56
Does the Mismatch really matter?
57
DAPO Qwen2.5-32B
Does the Mismatch really matter?
58
PPO GSM8K Qwen2.5-32B
Community Verification
59
Community Verification
60
Community Verification
61
Outline
62
Harvesting Off-Policy in Quantization
As TIS handles the gap, can we go even further off-policy for speedup?
63
Harvesting Off-Policy in Quantization
As TIS handles the gap, can we go even further off-policy for speedup?
Rollout generation is a bottleneck in RL training efficiency:
In DAPO-32B setting, rollout takes up ~70% of the training time
64
Quantization helps speedup but hurts performance
65
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
66
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
67
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
But the performance is also degraded!
68
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
But the performance is also degraded!
This can be expected, as quantization introduces more mismatch
69
FlashRL preserves performance with TIS
This can be expected, as quantization introduces more mismatch
FlashRL fixes it with TIS
70
FlashRL preserves performance with TIS
FlashRL fixes it with TIS
FlashRL is implemented as a PyPI package to patch vLLM
71
FlashRL preserves performance with TIS
DAPO 32B Setting
Matches the performance of BF16 rollout with TIS
72
FlashRL preserves performance with TIS
DAPO 32B Setting
Matches the performance of BF16 rollout with TIS
Outperforms naive BF16 rollout (without TIS)
73
FlashRL preserves performance with TIS
GSM8K 0.5B Setting
TIS works both in INT8 and FP8 setting
74
More detailed analysis
Rollout Speedup
75
More detailed analysis
Rollout Speedup
Regular RL Setting
76
More detailed analysis
Rollout Speedup
Regular RL Setting
Standard Inference Setting
77
More detailed analysis
End-to-End Speedup & Effectiveness
78
More detailed analysis
End-to-End Speedup & Effectiveness
INT8 as a pressure test
79
More detailed analysis
End-to-End Speedup & Effectiveness
INT8 as a pressure test
80
How to perform INT8 quantization?
81
How to perform INT8 quantization?
FP8 quantization can be naturally conducted in an online manner
82
How to perform INT8 quantization?
FP8 quantization can be naturally conducted in an online manner
INT8 quantization requires complicated calibration process
83
How to perform INT8 quantization?
FP8 quantization can be naturally conducted in an online manner
INT8 quantization requires complicated calibration process
Our solution: Online INT8 Quantization via Calibration Transfer
calculate the calibration result once at the beginning of training and reuse it at every online step
84
How to perform INT8 quantization?
Online INT8 Quantization via Calibration Transfer
calculate the calibration result once at the beginning of training and reuse it at every online step
Observation: RL changes model weights less aggressively comparing to SFT
85
Outline
86
Analyzing the Effectiveness of Different Fixes
87
Analyzing the Effectiveness of Different Fixes
88
Analyzing the Effectiveness of Different Fixes
89
Analyzing the Effectiveness of Different Fixes
90
Comparison with TIS-Variants
GSM8K, PPO, Qwen2.5-0.5B-Instruct
Only TIS works consistently
91
Why Recompute fails
92
Why Recompute fails
93
Why Recompute fails
94
Why Recompute fails
95
Why Recompute fails
96
Why Recompute fails
97
Why Recompute fails
98
Why PPO-IS fails
99
Why PPO-IS fails
100
Why PPO-IS fails
101
Why Vanilla-IS fails
102
Why Vanilla-IS fails
103
Gradient noise
Outline
104
Factors Contributing to Mismatch
105
Factors Contributing to Mismatch
106
Factors Contributing to Mismatch
107
Factors Contributing to Mismatch
108
Factors Contributing to Mismatch
109
Factors Contributing to Mismatch
110
Factors Contributing to Mismatch
111
Factors Contributing to Mismatch
112
Factors Contributing to Mismatch
113
Factors Contributing to Mismatch
114
Factors Contributing to Mismatch
115
Factors Contributing to Mismatch
116
What’s beyond?
117
What’s beyond?
118
What’s beyond?
119
What’s beyond?
120
e.g., GRPO (token-level)� GSPO (sequence-level)
Takeaways
121
Thanks for Listening!
November 10, 2025 – Presented at Applied Compute