On the Rollout-Training Mismatch
In Modern RL Systems
Feng Yao
August 27, 2025 – Presented at TsinghuaNLP
UCSD
Efficient RL systems are rising
2
Efficient RL systems are rising
3
Efficient RL systems are rising
4
It also brings an issue…
5
It also brings an issue…
6
It also brings an issue…
7
It also brings an issue…
8
Mismatch!
It also brings an issue…
9
It also brings an issue…
10
It also brings an issue…
11
It also brings an issue…
12
It also brings an issue…
13
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
It also brings an issue…
14
Max Diff = 1.0
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
It also brings an issue…
15
Max Diff = 1.0
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
It also brings an issue…
Implicitly makes RL “Off-Policy”!
16
Max Diff = 1.0
DAPO Qwen2.5-32B
DS-Qwen2.5-1.5B
But it can be fixed effectively
17
But it can be fixed effectively
18
Harvesting the Off-Policyness via Quantization
19
Harvesting the Off-Policyness via Quantization
20
Harvesting the Off-Policyness via Quantization
21
Outline
22
Outline
23
Why does Rollout-Training Mismatch occur?
24
Why does Rollout-Training Mismatch occur?
25
Why does Rollout-Training Mismatch occur?
26
Why does Rollout-Training Mismatch occur?
27
Outline
28
How to Fix the Off-Policy Issue It Brings
29
How to Fix the Off-Policy Issue It Brings
30
How to Fix the Off-Policy Issue It Brings
31
How to Fix the Off-Policy Issue It Brings
32
It helps, but …
How to Fix the Off-Policy Issue It Brings
33
It helps, but the gap still exists
How to Fix the Off-Policy Issue It Brings
34
How to Fix the Off-Policy Issue It Brings
35
How to Fix the Off-Policy Issue It Brings
36
How to Fix the Off-Policy Issue It Brings
37
How to Fix the Off-Policy Issue It Brings
38
How to Fix the Off-Policy Issue It Brings
39
How to Fix the Off-Policy Issue It Brings
40
How to Fix the Off-Policy Issue It Brings
41
How to Fix the Off-Policy Issue It Brings
42
How to Fix the Off-Policy Issue It Brings
43
Why Not Alternative Methods?
44
Why Not Alternative Methods?
45
Why Not Alternative Methods?
46
A commonly asked variant
Why Not Alternative Methods?
47
A commonly asked variant
Can break out of the trust region
Why Not Alternative Methods?
48
A commonly asked variant
Can break out of the trust region
Why Not Alternative Methods?
49
A commonly asked variant
Can break out of the trust region
Can be too large and makes training crash
How well can TIS fix it?
50
How well can TIS fix it?
51
Does TIS always help?
52
Does the Mismatch really matter?
53
DAPO Qwen2.5-32B
Does the Mismatch really matter?
54
PPO GSM8K Qwen2.5-32B
Community Verification
55
Outline
56
Harvesting Off-Policy in Quantization
As TIS handles the gap, can we go even further off-policy for speedup?
57
Harvesting Off-Policy in Quantization
As TIS handles the gap, can we go even further off-policy for speedup?
Rollout generation is a bottleneck in RL training efficiency:
In DAPO-32B setting, rollout takes up ~70% of the training time
58
Quantization helps speedup but hurts performance
59
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
60
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
61
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
But the performance is also degraded!
62
Quantization helps speedup but hurts performance
Naively applying quantization can accelerate rollout speed
But the performance is also degraded!
This can be expected, as quantization introduces more mismatch
63
FlashRL preserves performance with TIS
This can be expected, as quantization introduces more mismatch
FlashRL fixes it with TIS
64
FlashRL preserves performance with TIS
FlashRL fixes it with TIS
FlashRL is implemented as a PyPI package to patch vLLM
65
FlashRL preserves performance with TIS
DAPO 32B Setting
Matches the performance of BF16 rollout with TIS
66
FlashRL preserves performance with TIS
DAPO 32B Setting
Matches the performance of BF16 rollout with TIS
Outperforms naive BF16 rollout (without TIS)
67
FlashRL preserves performance with TIS
GSM8K 0.5B Setting
TIS works both in INT8 and FP8 setting
68
More detailed analysis
Rollout Speedup
69
More detailed analysis
Rollout Speedup
Regular RL Setting
70
More detailed analysis
Rollout Speedup
Regular RL Setting
Standard Inference Setting
71
More detailed analysis
End-to-End Speedup & Effectiveness
72
More detailed analysis
End-to-End Speedup & Effectiveness
INT8 as a pressure test
73
More detailed analysis
End-to-End Speedup & Effectiveness
INT8 as a pressure test
74
How to perform INT8 quantization?
75
How to perform INT8 quantization?
FP8 quantization can be naturally conducted in an online manner
76
How to perform INT8 quantization?
FP8 quantization can be naturally conducted in an online manner
INT8 quantization requires complicated calibration process
77
How to perform INT8 quantization?
FP8 quantization can be naturally conducted in an online manner
INT8 quantization requires complicated calibration process
Our solution: Online INT8 Quantization via Calibration Transfer
calculate the calibration result once at the beginning of training and reuse it at every online step
78
How to perform INT8 quantization?
Online INT8 Quantization via Calibration Transfer
calculate the calibration result once at the beginning of training and reuse it at every online step
Observation: RL changes model weights less aggressively comparing to SFT
79
Outline
80
Analyzing the Effectiveness of Different Fixes
81
Analyzing the Effectiveness of Different Fixes
82
Analyzing the Effectiveness of Different Fixes
83
Analyzing the Effectiveness of Different Fixes
84
Comparison with TIS-Variants
GSM8K, PPO, Qwen2.5-0.5B-Instruct
Only TIS works consistently
85
Why Recompute fails
86
Why Recompute fails
87
Why Recompute fails
88
Why Recompute fails
89
Why Recompute fails
90
Why Recompute fails
91
Why Recompute fails
92
Why PPO-IS fails
93
Why PPO-IS fails
94
Why PPO-IS fails
95
Why Vanilla-IS fails
96
Why Vanilla-IS fails
97
Gradient noise
What’s beyond?
98
What’s beyond?
99
What’s beyond?
100
Takeaways
101
Thanks for Listening!
August 27, 2025 – Presented at TsinghuaNLP