1 of 122

On the Rollout-Training Mismatch

In Modern RL Systems

November 10, 2025 – Presented at Applied Compute

Feng Yao* Liyuan Liu* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao

2 of 122

Efficient RL systems are rising

VeRL/OpenRLHF/Slime adopts hybrid engines

3 of 122

Efficient RL systems are rising

VeRL/OpenRLHF/Slime adopts hybrid engines

Rollout: Advanced LLM inference engines (vLLM, SGLang)
Training: Modern LLM training backends (FSDP, Megatron)

4 of 122

Efficient RL systems are rising

VeRL/OpenRLHF/Slime adopts hybrid engines

Rollout: Advanced LLM inference engines (vLLM, SGLang)
Training: Modern LLM training backends (FSDP, Megatron)

5 of 122

It also brings an issue…

Rollout-Training Mismatch

6 of 122

It also brings an issue…

Rollout-Training Mismatch

Expected

7 of 122

It also brings an issue…

Rollout-Training Mismatch

Expected

Implementation: Rollout engine (vLLM) + Training backends (FSDP)

8 of 122

It also brings an issue…

Rollout-Training Mismatch

Expected

Implementation: Rollout engine (vLLM) + Training backends (FSDP)

Mismatch!

9 of 122

It also brings an issue…

Rollout-Training Mismatch

For the same rollout & model parameter

10 of 122

It also brings an issue…

Rollout-Training Mismatch

For the same rollout & model parameter

11 of 122

It also brings an issue…

Rollout-Training Mismatch

For the same rollout & model parameter

12 of 122

It also brings an issue…

Rollout-Training Mismatch

13 of 122

It also brings an issue…

Rollout-Training Mismatch

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

14 of 122

It also brings an issue…

Rollout-Training Mismatch

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

15 of 122

It also brings an issue…

Rollout-Training Mismatch

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

16 of 122

It also brings an issue…

Rollout-Training Mismatch

Implicitly makes RL “Off-Policy”!

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

17 of 122

But it can be fixed effectively

Using the classic Truncated Importance Sampling (TIS) technique

18 of 122

But it can be fixed effectively

Using the classic Truncated Importance Sampling (TIS) technique

We show that fix it with TIS can improve training effectiveness

19 of 122

Harvesting the Off-Policyness via Quantization

Since TIS is able to handle the mismatch

20 of 122

Harvesting the Off-Policyness via Quantization

Since TIS is able to handle the mismatch

Can we go even more “off-policy” and thus faster?

21 of 122

Harvesting the Off-Policyness via Quantization

Since TIS is able to handle the mismatch

Can we go even more “off-policy” and thus faster?

22 of 122

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes
Additional Analyses

23 of 122

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes
Additional Analyses

24 of 122

Why does Rollout-Training Mismatch occur?

Two common believes

25 of 122

Why does Rollout-Training Mismatch occur?

Two common believes

Inaccessible true sampling probabilities

Add additional gap

Backend numerical differences

Hard to fix

26 of 122

Why does Rollout-Training Mismatch occur?

Hybrid Engine & Error Propagation

Different compute patterns via different backends & parallelism

27 of 122

Why does Rollout-Training Mismatch occur?

Hybrid Engine & Error Propagation

Different compute patterns via different backends & parallelism

28 of 122

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes
Additional Analyses

29 of 122

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause

30 of 122

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

31 of 122

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

Return the actual sampling probabilities for vLLM V1 engine
Improve the numerical precision by using FP32 LM_Head

32 of 122

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

Return the actual sampling probabilities for vLLM V1 engine
Improve the numerical precision by using FP32 LM_Head

It helps, but …

33 of 122

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

Return the actual sampling probabilities for vLLM V1 engine
Improve the numerical precision by using FP32 LM_Head

It helps, but the gap still exists

34 of 122

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

35 of 122

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Recall: Vanilla Importance Sampling

36 of 122

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

37 of 122

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

But currently we have

38 of 122

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

But currently we have

So we should fix the gradient as:

39 of 122

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

But currently we have

In practice, we use Truncated Importance Sampling (TIS):

40 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

41 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

42 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

43 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

44 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

45 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

46 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

47 of 122

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

Truncated Importance Sampling (TIS)

48 of 122

Why Not Alternative Methods?

Variants of TIS

49 of 122

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

50 of 122

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

A commonly asked variant

51 of 122

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

A commonly asked variant

Can break out of the trust region

52 of 122

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

Vanilla Importance Sampling (Vanilla-IS)

A commonly asked variant

Can break out of the trust region

53 of 122

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

Vanilla Importance Sampling (Vanilla-IS)

A commonly asked variant

Can break out of the trust region

Can be too large and makes training crash

54 of 122

How well can TIS fix it?

DAPO 32B Setting

55 of 122

How well can TIS fix it?

GSM8K 0.5B Setting

Normal RL: Max Diff is smaller (~0.4) than 1.0 (in DAPO-32B setting)
INT8 Rollout: Max Diff is larger (~1.0) than normal RL setting

56 of 122

Does TIS always help?

DAPO 1.5B Setting

In settings where prob diff is relatively small

TIS does not always help, but doesn’t hurt

57 of 122

Does the Mismatch really matter?

Unexpected training instability on challenging tasks

DAPO Qwen2.5-32B

58 of 122

Does the Mismatch really matter?

Possible negligible on simple tasks

PPO GSM8K Qwen2.5-32B

59 of 122

Community Verification

60 of 122

Community Verification

61 of 122

Community Verification

62 of 122

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes
Additional Analyses

63 of 122

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

64 of 122

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

Rollout generation is a bottleneck in RL training efficiency:

In DAPO-32B setting, rollout takes up ~70% of the training time

65 of 122

Quantization helps speedup but hurts performance

66 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

67 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

68 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

69 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

This can be expected, as quantization introduces more mismatch

70 of 122

FlashRL preserves performance with TIS

This can be expected, as quantization introduces more mismatch

FlashRL fixes it with TIS

71 of 122

FlashRL preserves performance with TIS

FlashRL fixes it with TIS

FlashRL is implemented as a PyPI package to patch vLLM

72 of 122

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

73 of 122

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

Outperforms naive BF16 rollout (without TIS)

74 of 122

FlashRL preserves performance with TIS

GSM8K 0.5B Setting

TIS works both in INT8 and FP8 setting

75 of 122

More detailed analysis

Rollout Speedup

76 of 122

More detailed analysis

Rollout Speedup

Regular RL Setting

77 of 122

More detailed analysis

Rollout Speedup

Regular RL Setting

Standard Inference Setting

78 of 122

More detailed analysis

End-to-End Speedup & Effectiveness

79 of 122

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

80 of 122

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

81 of 122

How to perform INT8 quantization?

82 of 122

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

83 of 122

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

84 of 122

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

Our solution: Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

85 of 122

How to perform INT8 quantization?

Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

Observation: RL changes model weights less aggressively comparing to SFT

86 of 122

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes
Additional Analyses

87 of 122

Analyzing the Effectiveness of Different Fixes

PPO
Recompute

88 of 122

Analyzing the Effectiveness of Different Fixes

PPO
Recompute
PPO-IS

89 of 122

Analyzing the Effectiveness of Different Fixes

PPO
Recompute
PPO-IS
Vanilla-IS

90 of 122

Analyzing the Effectiveness of Different Fixes

PPO
Recompute
PPO-IS
Vanilla-IS
TIS

91 of 122

Comparison with TIS-Variants

GSM8K, PPO, Qwen2.5-0.5B-Instruct

Only TIS works consistently

92 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

93 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

94 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

95 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

96 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

97 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

98 of 122

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

99 of 122

Why PPO-IS fails

PPO-IS

PPO-IS is still “biased” from the PPO gradient

100 of 122

Why PPO-IS fails

PPO-IS

PPO-IS is still “biased” from the PPO gradient
The clip in PPO is designed for “trust region”

At time step 0, , we don’t want to clip but PPO-IS may clip
PPO-clip works differently than TIS

100

101 of 122

Why PPO-IS fails

PPO-IS

PPO-IS is still “biased” from the PPO gradient
The clip in PPO is designed for “trust region”

At time step 0, , we don’t want to clip but PPO-IS may clip
PPO-clip works differently than TIS

101

102 of 122

Why Vanilla-IS fails

Vanilla-IS

Uncapped importance ratio amplifies the gradient noise

Leading to unstable training

102

103 of 122

Why Vanilla-IS fails

Vanilla-IS

Uncapped importance ratio amplifies the gradient noise

Leading to unstable training

103

Gradient noise

104 of 122

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes
Additional Analyses

104

105 of 122

Factors Contributing to Mismatch

Investigation Setup

105

106 of 122

Factors Contributing to Mismatch

Investigation Setup

Model & Data:

DAPO-32B / Polaris 7B
DAPO Training Set (first 512 samples)

106

107 of 122

Factors Contributing to Mismatch

Investigation Setup

Model & Data:

DAPO-32B / Polaris 7B
DAPO Training Set (first 512 samples)

Metrics:

Max Mismatch per response

Mean Mismatch per response

107

108 of 122

Factors Contributing to Mismatch

Investigation Setup

Model & Data:

DAPO-32B / Polaris 7B
DAPO Training Set (first 512 samples)

Metrics:

Max Mismatch per response

Mean Mismatch per response

108

109 of 122

Factors Contributing to Mismatch

Investigation Setup

Model & Data:

DAPO-32B / Polaris 7B
DAPO Training Set (first 512 samples)

Metrics:

Max Mismatch per response

Mean Mismatch per response

109

110 of 122

Factors Contributing to Mismatch

Larger Parallelism Difference, Larger Max Gap

110

111 of 122

Factors Contributing to Mismatch

Larger Parallelism Difference, Larger Max Gap

111

112 of 122

Factors Contributing to Mismatch

Larger Parallelism Difference, Larger Max Gap

112

113 of 122

Factors Contributing to Mismatch

Longer Response Length, Larger Max Gap

113

114 of 122

Factors Contributing to Mismatch

Longer Response Length, Larger Max Gap

114

115 of 122

Factors Contributing to Mismatch

Longer Response Length, Larger Max Gap

115

116 of 122

Factors Contributing to Mismatch

Altering Sampler Alone, Gap Still There

116

117 of 122

What’s beyond?

117

118 of 122

What’s beyond?

The gap can be amplified in MoE RL

Dynamic Routing
Specially Optimized Kernels

118

119 of 122

What’s beyond?

The gap can be amplified in MoE RL

Dynamic Routing
Specially Optimized Kernels

TIS is orthogonal and compatible with existing GxPOs

GxPOs adjust the computation of advantage / importance ratio
TIS addresses the system-level mismatch problem

119

120 of 122

What’s beyond?

The gap can be amplified in MoE RL

Dynamic Routing
Specially Optimized Kernels

TIS is orthogonal and compatible with existing GxPOs

GxPOs adjust the computation of advantage / importance ratio
TIS addresses the system-level mismatch problem

120

e.g., GRPO (token-level)� GSPO (sequence-level)

121 of 122

Takeaways

Mixing inference backend with training backends brings off-policy RL training, even if they share the same weights

Truncated Importance Sampling (TIS) is effective mitigating the gap

With TIS integrated, rollout generation can be accelerated via quantization without sacrificing the performance

121

122 of 122

Thanks for Listening!

Feng Yao

https://yaof20.github.io/

November 10, 2025 – Presented at Applied Compute