1 of 122

On the Rollout-Training Mismatch

In Modern RL Systems

November 10, 2025 – Presented at Applied Compute

Feng Yao* Liyuan Liu* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao

2 of 122

Efficient RL systems are rising

  • VeRL/OpenRLHF/Slime adopts hybrid engines

2

3 of 122

Efficient RL systems are rising

  • VeRL/OpenRLHF/Slime adopts hybrid engines
    • Rollout: Advanced LLM inference engines (vLLM, SGLang)
    • Training: Modern LLM training backends (FSDP, Megatron)

3

4 of 122

Efficient RL systems are rising

  • VeRL/OpenRLHF/Slime adopts hybrid engines
    • Rollout: Advanced LLM inference engines (vLLM, SGLang)
    • Training: Modern LLM training backends (FSDP, Megatron)

4

5 of 122

It also brings an issue…

  • Rollout-Training Mismatch

5

6 of 122

It also brings an issue…

  • Rollout-Training Mismatch
    • Expected

6

7 of 122

It also brings an issue…

  • Rollout-Training Mismatch
    • Expected

    • Implementation: Rollout engine (vLLM) + Training backends (FSDP)

7

8 of 122

It also brings an issue…

  • Rollout-Training Mismatch
    • Expected

    • Implementation: Rollout engine (vLLM) + Training backends (FSDP)

8

Mismatch!

9 of 122

It also brings an issue…

  • Rollout-Training Mismatch
    • For the same rollout & model parameter

9

10 of 122

It also brings an issue…

  • Rollout-Training Mismatch
    • For the same rollout & model parameter

10

11 of 122

It also brings an issue…

  • Rollout-Training Mismatch
    • For the same rollout & model parameter

11

12 of 122

It also brings an issue…

  • Rollout-Training Mismatch

12

13 of 122

It also brings an issue…

  • Rollout-Training Mismatch

13

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

14 of 122

It also brings an issue…

  • Rollout-Training Mismatch

14

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

15 of 122

It also brings an issue…

  • Rollout-Training Mismatch

15

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

16 of 122

It also brings an issue…

  • Rollout-Training Mismatch

Implicitly makes RL “Off-Policy”!

16

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

17 of 122

But it can be fixed effectively

  • Using the classic Truncated Importance Sampling (TIS) technique

17

18 of 122

But it can be fixed effectively

  • Using the classic Truncated Importance Sampling (TIS) technique
    • We show that fix it with TIS can improve training effectiveness

18

19 of 122

Harvesting the Off-Policyness via Quantization

  • Since TIS is able to handle the mismatch

19

20 of 122

Harvesting the Off-Policyness via Quantization

  • Since TIS is able to handle the mismatch
    • Can we go even more “off-policy” and thus faster?

20

21 of 122

Harvesting the Off-Policyness via Quantization

  • Since TIS is able to handle the mismatch
    • Can we go even more “off-policy” and thus faster?

21

22 of 122

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes
  • Additional Analyses

22

23 of 122

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes
  • Additional Analyses

23

24 of 122

Why does Rollout-Training Mismatch occur?

  • Two common believes

24

25 of 122

Why does Rollout-Training Mismatch occur?

  • Two common believes
    • Inaccessible true sampling probabilities
      • Add additional gap
    • Backend numerical differences
      • Hard to fix

25

26 of 122

Why does Rollout-Training Mismatch occur?

  • Hybrid Engine & Error Propagation
    • Different compute patterns via different backends & parallelism

26

27 of 122

Why does Rollout-Training Mismatch occur?

  • Hybrid Engine & Error Propagation
    • Different compute patterns via different backends & parallelism

27

28 of 122

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes
  • Additional Analyses

28

29 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause

29

30 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:

30

31 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:
      • Return the actual sampling probabilities for vLLM V1 engine
      • Improve the numerical precision by using FP32 LM_Head

31

32 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:
      • Return the actual sampling probabilities for vLLM V1 engine
      • Improve the numerical precision by using FP32 LM_Head

32

It helps, but …

33 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:
      • Return the actual sampling probabilities for vLLM V1 engine
      • Improve the numerical precision by using FP32 LM_Head

33

It helps, but the gap still exists

34 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:

34

35 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Recall: Vanilla Importance Sampling

35

36 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

36

37 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

      • But currently we have

37

38 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

      • But currently we have

      • So we should fix the gradient as:

38

39 of 122

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

      • But currently we have

      • In practice, we use Truncated Importance Sampling (TIS):

39

40 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case

40

41 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

41

42 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

42

43 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

43

44 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

44

45 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

45

46 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

46

47 of 122

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

    • Truncated Importance Sampling (TIS)

47

48 of 122

Why Not Alternative Methods?

  • Variants of TIS

48

49 of 122

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

49

50 of 122

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

50

A commonly asked variant

51 of 122

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

51

A commonly asked variant

Can break out of the trust region

52 of 122

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

    • Vanilla Importance Sampling (Vanilla-IS)

52

A commonly asked variant

Can break out of the trust region

53 of 122

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

    • Vanilla Importance Sampling (Vanilla-IS)

53

A commonly asked variant

Can break out of the trust region

Can be too large and makes training crash

54 of 122

How well can TIS fix it?

  • DAPO 32B Setting

54

55 of 122

How well can TIS fix it?

  • GSM8K 0.5B Setting
    • Normal RL: Max Diff is smaller (~0.4) than 1.0 (in DAPO-32B setting)
    • INT8 Rollout: Max Diff is larger (~1.0) than normal RL setting

55

56 of 122

Does TIS always help?

  • DAPO 1.5B Setting
    • In settings where prob diff is relatively small
      • TIS does not always help, but doesn’t hurt

56

57 of 122

Does the Mismatch really matter?

  • Unexpected training instability on challenging tasks

57

DAPO Qwen2.5-32B

58 of 122

Does the Mismatch really matter?

  • Possible negligible on simple tasks

58

PPO GSM8K Qwen2.5-32B

59 of 122

Community Verification

59

60 of 122

Community Verification

60

61 of 122

Community Verification

61

62 of 122

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes
  • Additional Analyses

62

63 of 122

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

63

64 of 122

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

Rollout generation is a bottleneck in RL training efficiency:

In DAPO-32B setting, rollout takes up ~70% of the training time

64

65 of 122

Quantization helps speedup but hurts performance

65

66 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

66

67 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

67

68 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

68

69 of 122

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

This can be expected, as quantization introduces more mismatch

69

70 of 122

FlashRL preserves performance with TIS

This can be expected, as quantization introduces more mismatch

FlashRL fixes it with TIS

70

71 of 122

FlashRL preserves performance with TIS

FlashRL fixes it with TIS

FlashRL is implemented as a PyPI package to patch vLLM

71

72 of 122

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

72

73 of 122

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

Outperforms naive BF16 rollout (without TIS)

73

74 of 122

FlashRL preserves performance with TIS

GSM8K 0.5B Setting

TIS works both in INT8 and FP8 setting

74

75 of 122

More detailed analysis

Rollout Speedup

75

76 of 122

More detailed analysis

Rollout Speedup

Regular RL Setting

76

77 of 122

More detailed analysis

Rollout Speedup

Regular RL Setting

Standard Inference Setting

77

78 of 122

More detailed analysis

End-to-End Speedup & Effectiveness

78

79 of 122

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

79

80 of 122

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

80

81 of 122

How to perform INT8 quantization?

81

82 of 122

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

82

83 of 122

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

83

84 of 122

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

Our solution: Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

84

85 of 122

How to perform INT8 quantization?

Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

Observation: RL changes model weights less aggressively comparing to SFT

85

86 of 122

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes
  • Additional Analyses

86

87 of 122

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute

87

88 of 122

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute
  • PPO-IS

88

89 of 122

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute
  • PPO-IS
  • Vanilla-IS

89

90 of 122

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute
  • PPO-IS
  • Vanilla-IS
  • TIS

90

91 of 122

Comparison with TIS-Variants

GSM8K, PPO, Qwen2.5-0.5B-Instruct

Only TIS works consistently

91

92 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

92

93 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

93

94 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

94

95 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

95

96 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

96

97 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

97

98 of 122

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

98

99 of 122

Why PPO-IS fails

  • PPO-IS
    • PPO-IS is still “biased” from the PPO gradient

99

100 of 122

Why PPO-IS fails

  • PPO-IS
    • PPO-IS is still “biased” from the PPO gradient
    • The clip in PPO is designed for “trust region”
      • At time step 0, , we don’t want to clip but PPO-IS may clip
      • PPO-clip works differently than TIS

100

101 of 122

Why PPO-IS fails

  • PPO-IS
    • PPO-IS is still “biased” from the PPO gradient
    • The clip in PPO is designed for “trust region”
      • At time step 0, , we don’t want to clip but PPO-IS may clip
      • PPO-clip works differently than TIS

101

102 of 122

Why Vanilla-IS fails

  • Vanilla-IS
    • Uncapped importance ratio amplifies the gradient noise
      • Leading to unstable training

102

103 of 122

Why Vanilla-IS fails

  • Vanilla-IS
    • Uncapped importance ratio amplifies the gradient noise
      • Leading to unstable training

103

Gradient noise

104 of 122

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes
  • Additional Analyses

104

105 of 122

Factors Contributing to Mismatch

  • Investigation Setup

105

106 of 122

Factors Contributing to Mismatch

  • Investigation Setup
    • Model & Data:
      • DAPO-32B / Polaris 7B
      • DAPO Training Set (first 512 samples)

106

107 of 122

Factors Contributing to Mismatch

  • Investigation Setup
    • Model & Data:
      • DAPO-32B / Polaris 7B
      • DAPO Training Set (first 512 samples)
    • Metrics:
      • Max Mismatch per response

      • Mean Mismatch per response

107

108 of 122

Factors Contributing to Mismatch

  • Investigation Setup
    • Model & Data:
      • DAPO-32B / Polaris 7B
      • DAPO Training Set (first 512 samples)
    • Metrics:
      • Max Mismatch per response

      • Mean Mismatch per response

108

109 of 122

Factors Contributing to Mismatch

  • Investigation Setup
    • Model & Data:
      • DAPO-32B / Polaris 7B
      • DAPO Training Set (first 512 samples)
    • Metrics:
      • Max Mismatch per response

      • Mean Mismatch per response

109

110 of 122

Factors Contributing to Mismatch

  • Larger Parallelism Difference, Larger Max Gap

110

111 of 122

Factors Contributing to Mismatch

  • Larger Parallelism Difference, Larger Max Gap

111

112 of 122

Factors Contributing to Mismatch

  • Larger Parallelism Difference, Larger Max Gap

112

113 of 122

Factors Contributing to Mismatch

  • Longer Response Length, Larger Max Gap

113

114 of 122

Factors Contributing to Mismatch

  • Longer Response Length, Larger Max Gap

114

115 of 122

Factors Contributing to Mismatch

  • Longer Response Length, Larger Max Gap

115

116 of 122

Factors Contributing to Mismatch

  • Altering Sampler Alone, Gap Still There

116

117 of 122

What’s beyond?

117

118 of 122

What’s beyond?

  • The gap can be amplified in MoE RL
    • Dynamic Routing
    • Specially Optimized Kernels

118

119 of 122

What’s beyond?

  • The gap can be amplified in MoE RL
    • Dynamic Routing
    • Specially Optimized Kernels

  • TIS is orthogonal and compatible with existing GxPOs
    • GxPOs adjust the computation of advantage / importance ratio
    • TIS addresses the system-level mismatch problem

119

120 of 122

What’s beyond?

  • The gap can be amplified in MoE RL
    • Dynamic Routing
    • Specially Optimized Kernels

  • TIS is orthogonal and compatible with existing GxPOs
    • GxPOs adjust the computation of advantage / importance ratio
    • TIS addresses the system-level mismatch problem

120

e.g., GRPO (token-level)� GSPO (sequence-level)

121 of 122

Takeaways

  • Mixing inference backend with training backends brings off-policy RL training, even if they share the same weights

  • Truncated Importance Sampling (TIS) is effective mitigating the gap

  • With TIS integrated, rollout generation can be accelerated via quantization without sacrificing the performance

121

122 of 122

Thanks for Listening!

November 10, 2025 – Presented at Applied Compute