1 of 102

On the Rollout-Training Mismatch

In Modern RL Systems

Feng Yao

August 27, 2025 – Presented at TsinghuaNLP

UCSD

2 of 102

Efficient RL systems are rising

VeRL/OpenRLHF adopts hybrid engines

3 of 102

Efficient RL systems are rising

VeRL/OpenRLHF adopts hybrid engines

Rollout: Advanced LLM inference engines (vLLM, SGLang)
Training: Modern LLM training backends (FSDP, Megatron)

4 of 102

Efficient RL systems are rising

VeRL/OpenRLHF adopts hybrid engines

Rollout: Advanced LLM inference engines (vLLM, SGLang)
Training: Modern LLM training backends (FSDP, Megatron)

5 of 102

It also brings an issue…

Rollout-Training Mismatch

6 of 102

It also brings an issue…

Rollout-Training Mismatch

Expected

7 of 102

It also brings an issue…

Rollout-Training Mismatch

Expected

Implementation: Rollout engine (vLLM) + Training backends (FSDP)

8 of 102

It also brings an issue…

Rollout-Training Mismatch

Expected

Implementation: Rollout engine (vLLM) + Training backends (FSDP)

Mismatch!

9 of 102

It also brings an issue…

Rollout-Training Mismatch

For the same rollout & model parameter

10 of 102

It also brings an issue…

Rollout-Training Mismatch

For the same rollout & model parameter

11 of 102

It also brings an issue…

Rollout-Training Mismatch

For the same rollout & model parameter

12 of 102

It also brings an issue…

Rollout-Training Mismatch

13 of 102

It also brings an issue…

Rollout-Training Mismatch

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

14 of 102

It also brings an issue…

Rollout-Training Mismatch

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

15 of 102

It also brings an issue…

Rollout-Training Mismatch

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

16 of 102

It also brings an issue…

Rollout-Training Mismatch

Implicitly makes RL “Off-Policy”!

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

17 of 102

But it can be fixed effectively

Using the classic Truncated Importance Sampling (TIS) technique

18 of 102

But it can be fixed effectively

Using the classic Truncated Importance Sampling (TIS) technique

We show that fix it with TIS can improve training effectiveness

19 of 102

Harvesting the Off-Policyness via Quantization

Since TIS is able to handle the mismatch

20 of 102

Harvesting the Off-Policyness via Quantization

Since TIS is able to handle the mismatch

Can we go even more “off-policy” and thus faster?

21 of 102

Harvesting the Off-Policyness via Quantization

Since TIS is able to handle the mismatch

Can we go even more “off-policy” and thus faster?

22 of 102

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes

23 of 102

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes

24 of 102

Why does Rollout-Training Mismatch occur?

Two common believes

25 of 102

Why does Rollout-Training Mismatch occur?

Two common believes

Inaccessible true sampling probabilities

Add additional gap

Backend numerical differences

Hard to fix

26 of 102

Why does Rollout-Training Mismatch occur?

Hybrid Engine & Error Propagation

Different compute patterns via different backends & parallelism

27 of 102

Why does Rollout-Training Mismatch occur?

Hybrid Engine & Error Propagation

Different compute patterns via different backends & parallelism

28 of 102

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes

29 of 102

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause

30 of 102

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

31 of 102

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

Return the actual sampling probabilities for vLLM V1 engine
Improve the numerical precision by using FP32 LM_Head

32 of 102

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

Return the actual sampling probabilities for vLLM V1 engine
Improve the numerical precision by using FP32 LM_Head

It helps, but …

33 of 102

How to Fix the Off-Policy Issue It Brings

Trial 1 – Mitigate the system-level mismatch

vLLM seems to be the root cause Patch vLLM to:

Return the actual sampling probabilities for vLLM V1 engine
Improve the numerical precision by using FP32 LM_Head

It helps, but the gap still exists

34 of 102

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

35 of 102

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Recall: Vanilla Importance Sampling

36 of 102

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

37 of 102

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

But currently we have

38 of 102

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

But currently we have

So we should fix the gradient as:

39 of 102

How to Fix the Off-Policy Issue It Brings

Trial 2 – Apply algorithm-level fix

Be aware of the mismatch Importance sampling correction:

Expected gradient

But currently we have

In practice, we use Truncated Importance Sampling (TIS):

40 of 102

How to Fix the Off-Policy Issue It Brings

Extend to General Case

41 of 102

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

42 of 102

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

43 of 102

How to Fix the Off-Policy Issue It Brings

Extend to General Case

Expected Policy Gradient (PPO)

VeRL/OpenRLHF’s Implementation (recompute)

Truncated Importance Sampling (TIS)

44 of 102

Why Not Alternative Methods?

Variants of TIS

45 of 102

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

46 of 102

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

A commonly asked variant

47 of 102

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

A commonly asked variant

Can break out of the trust region

48 of 102

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

Vanilla Importance Sampling (Vanilla-IS)

A commonly asked variant

Can break out of the trust region

49 of 102

Why Not Alternative Methods?

Variants of TIS

PPO Importance Sampling (PPO-IS)

Vanilla Importance Sampling (Vanilla-IS)

A commonly asked variant

Can break out of the trust region

Can be too large and makes training crash

50 of 102

How well can TIS fix it?

DAPO 32B Setting

51 of 102

How well can TIS fix it?

GSM8K 0.5B Setting

Normal RL: Max Diff is smaller (~0.4) than 1.0 (in DAPO-32B setting)
INT8 Rollout: Max Diff is larger (~1.0) than normal RL setting

52 of 102

Does TIS always help?

DAPO 1.5B Setting

In settings where prob diff is relatively small

TIS does not always help, but doesn’t hurt

53 of 102

Does the Mismatch really matter?

Unexpected training instability on challenging tasks

DAPO Qwen2.5-32B

54 of 102

Does the Mismatch really matter?

Possible negligible on simple tasks

PPO GSM8K Qwen2.5-32B

55 of 102

Community Verification

56 of 102

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes

57 of 102

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

58 of 102

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

Rollout generation is a bottleneck in RL training efficiency:

In DAPO-32B setting, rollout takes up ~70% of the training time

59 of 102

Quantization helps speedup but hurts performance

60 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

61 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

62 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

63 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

This can be expected, as quantization introduces more mismatch

64 of 102

FlashRL preserves performance with TIS

This can be expected, as quantization introduces more mismatch

FlashRL fixes it with TIS

65 of 102

FlashRL preserves performance with TIS

FlashRL fixes it with TIS

FlashRL is implemented as a PyPI package to patch vLLM

66 of 102

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

67 of 102

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

Outperforms naive BF16 rollout (without TIS)

68 of 102

FlashRL preserves performance with TIS

GSM8K 0.5B Setting

TIS works both in INT8 and FP8 setting

69 of 102

More detailed analysis

Rollout Speedup

70 of 102

More detailed analysis

Rollout Speedup

Regular RL Setting

71 of 102

More detailed analysis

Rollout Speedup

Regular RL Setting

Standard Inference Setting

72 of 102

More detailed analysis

End-to-End Speedup & Effectiveness

73 of 102

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

74 of 102

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

75 of 102

How to perform INT8 quantization?

76 of 102

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

77 of 102

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

78 of 102

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

Our solution: Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

79 of 102

How to perform INT8 quantization?

Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

Observation: RL changes model weights less aggressively comparing to SFT

80 of 102

Outline

Why Rollout-Training Mismatch Occurs
How to Fix the Off-Policy Issue It Brings
Harvesting Rollout-Training Mismatch via Quantization
Analyzing the Effectiveness of Different Fixes

81 of 102

Analyzing the Effectiveness of Different Fixes

PPO
Recompute

82 of 102

Analyzing the Effectiveness of Different Fixes

PPO
Recompute
PPO-IS

83 of 102

Analyzing the Effectiveness of Different Fixes

PPO
Recompute
PPO-IS
Vanilla-IS

84 of 102

Analyzing the Effectiveness of Different Fixes

PPO
Recompute
PPO-IS
Vanilla-IS
TIS

85 of 102

Comparison with TIS-Variants

GSM8K, PPO, Qwen2.5-0.5B-Instruct

Only TIS works consistently

86 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

87 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

88 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

89 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

90 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

91 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

92 of 102

Why Recompute fails

Recompute

The mismatch can lead to entropy collapse

Gradient computation vs. rollout generation

93 of 102

Why PPO-IS fails

PPO-IS

PPO-IS is still “biased” from the PPO gradient

94 of 102

Why PPO-IS fails

PPO-IS

PPO-IS is still “biased” from the PPO gradient
The clip in PPO is designed for “trust region”

At time step 0, , we don’t want to clip but PPO-IS may clip
PPO-clip works differently than TIS

95 of 102

Why PPO-IS fails

PPO-IS

PPO-IS is still “biased” from the PPO gradient
The clip in PPO is designed for “trust region”

At time step 0, , we don’t want to clip but PPO-IS may clip
PPO-clip works differently than TIS

96 of 102

Why Vanilla-IS fails

Vanilla-IS

Uncapped importance ratio amplifies the gradient noise

Leading to unstable training

97 of 102

Why Vanilla-IS fails

Vanilla-IS

Uncapped importance ratio amplifies the gradient noise

Leading to unstable training

Gradient noise

98 of 102

What’s beyond?

99 of 102

What’s beyond?

The gap can be amplified in MoE RL

Dynamic Routing
Specially Optimized Kernels

100 of 102

What’s beyond?

The gap can be amplified in MoE RL

Dynamic Routing
Specially Optimized Kernels

TIS is orthogonal and compatible with existing GxPOs

GxPOs adjust the computation of advantage / importance ratio
TIS addresses the system-level mismatch problem

100

101 of 102

Takeaways

Mixing inference backend with training backends brings off-policy RL training, even if they share the same weights

Truncated Importance Sampling (TIS) is effective mitigating the gap

With TIS integrated, rollout generation can be accelerated via quantization without sacrificing the performance

101

102 of 102

Thanks for Listening!

Feng Yao

https://yaof20.github.io/

August 27, 2025 – Presented at TsinghuaNLP