1 of 102

On the Rollout-Training Mismatch

In Modern RL Systems

Feng Yao

August 27, 2025 – Presented at TsinghuaNLP

UCSD

2 of 102

Efficient RL systems are rising

  • VeRL/OpenRLHF adopts hybrid engines

2

3 of 102

Efficient RL systems are rising

  • VeRL/OpenRLHF adopts hybrid engines
    • Rollout: Advanced LLM inference engines (vLLM, SGLang)
    • Training: Modern LLM training backends (FSDP, Megatron)

3

4 of 102

Efficient RL systems are rising

  • VeRL/OpenRLHF adopts hybrid engines
    • Rollout: Advanced LLM inference engines (vLLM, SGLang)
    • Training: Modern LLM training backends (FSDP, Megatron)

4

5 of 102

It also brings an issue…

  • Rollout-Training Mismatch

5

6 of 102

It also brings an issue…

  • Rollout-Training Mismatch
    • Expected

6

7 of 102

It also brings an issue…

  • Rollout-Training Mismatch
    • Expected

    • Implementation: Rollout engine (vLLM) + Training backends (FSDP)

7

8 of 102

It also brings an issue…

  • Rollout-Training Mismatch
    • Expected

    • Implementation: Rollout engine (vLLM) + Training backends (FSDP)

8

Mismatch!

9 of 102

It also brings an issue…

  • Rollout-Training Mismatch
    • For the same rollout & model parameter

9

10 of 102

It also brings an issue…

  • Rollout-Training Mismatch
    • For the same rollout & model parameter

10

11 of 102

It also brings an issue…

  • Rollout-Training Mismatch
    • For the same rollout & model parameter

11

12 of 102

It also brings an issue…

  • Rollout-Training Mismatch

12

13 of 102

It also brings an issue…

  • Rollout-Training Mismatch

13

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

14 of 102

It also brings an issue…

  • Rollout-Training Mismatch

14

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

15 of 102

It also brings an issue…

  • Rollout-Training Mismatch

15

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

16 of 102

It also brings an issue…

  • Rollout-Training Mismatch

Implicitly makes RL “Off-Policy”!

16

Max Diff = 1.0

DAPO Qwen2.5-32B

DS-Qwen2.5-1.5B

17 of 102

But it can be fixed effectively

  • Using the classic Truncated Importance Sampling (TIS) technique

17

18 of 102

But it can be fixed effectively

  • Using the classic Truncated Importance Sampling (TIS) technique
    • We show that fix it with TIS can improve training effectiveness

18

19 of 102

Harvesting the Off-Policyness via Quantization

  • Since TIS is able to handle the mismatch

19

20 of 102

Harvesting the Off-Policyness via Quantization

  • Since TIS is able to handle the mismatch
    • Can we go even more “off-policy” and thus faster?

20

21 of 102

Harvesting the Off-Policyness via Quantization

  • Since TIS is able to handle the mismatch
    • Can we go even more “off-policy” and thus faster?

21

22 of 102

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes

22

23 of 102

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes

23

24 of 102

Why does Rollout-Training Mismatch occur?

  • Two common believes

24

25 of 102

Why does Rollout-Training Mismatch occur?

  • Two common believes
    • Inaccessible true sampling probabilities
      • Add additional gap
    • Backend numerical differences
      • Hard to fix

25

26 of 102

Why does Rollout-Training Mismatch occur?

  • Hybrid Engine & Error Propagation
    • Different compute patterns via different backends & parallelism

26

27 of 102

Why does Rollout-Training Mismatch occur?

  • Hybrid Engine & Error Propagation
    • Different compute patterns via different backends & parallelism

27

28 of 102

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes

28

29 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause

29

30 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:

30

31 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:
      • Return the actual sampling probabilities for vLLM V1 engine
      • Improve the numerical precision by using FP32 LM_Head

31

32 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:
      • Return the actual sampling probabilities for vLLM V1 engine
      • Improve the numerical precision by using FP32 LM_Head

32

It helps, but …

33 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 1 – Mitigate the system-level mismatch
    • vLLM seems to be the root cause Patch vLLM to:
      • Return the actual sampling probabilities for vLLM V1 engine
      • Improve the numerical precision by using FP32 LM_Head

33

It helps, but the gap still exists

34 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:

34

35 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Recall: Vanilla Importance Sampling

35

36 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

36

37 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

      • But currently we have

37

38 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

      • But currently we have

      • So we should fix the gradient as:

38

39 of 102

How to Fix the Off-Policy Issue It Brings

  • Trial 2 – Apply algorithm-level fix
    • Be aware of the mismatch Importance sampling correction:
      • Expected gradient

      • But currently we have

      • In practice, we use Truncated Importance Sampling (TIS):

39

40 of 102

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case

40

41 of 102

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

41

42 of 102

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

42

43 of 102

How to Fix the Off-Policy Issue It Brings

  • Extend to General Case
    • Expected Policy Gradient (PPO)

    • VeRL/OpenRLHF’s Implementation (recompute)

    • Truncated Importance Sampling (TIS)

43

44 of 102

Why Not Alternative Methods?

  • Variants of TIS

44

45 of 102

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

45

46 of 102

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

46

A commonly asked variant

47 of 102

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

47

A commonly asked variant

Can break out of the trust region

48 of 102

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

    • Vanilla Importance Sampling (Vanilla-IS)

48

A commonly asked variant

Can break out of the trust region

49 of 102

Why Not Alternative Methods?

  • Variants of TIS
    • PPO Importance Sampling (PPO-IS)

    • Vanilla Importance Sampling (Vanilla-IS)

49

A commonly asked variant

Can break out of the trust region

Can be too large and makes training crash

50 of 102

How well can TIS fix it?

  • DAPO 32B Setting

50

51 of 102

How well can TIS fix it?

  • GSM8K 0.5B Setting
    • Normal RL: Max Diff is smaller (~0.4) than 1.0 (in DAPO-32B setting)
    • INT8 Rollout: Max Diff is larger (~1.0) than normal RL setting

51

52 of 102

Does TIS always help?

  • DAPO 1.5B Setting
    • In settings where prob diff is relatively small
      • TIS does not always help, but doesn’t hurt

52

53 of 102

Does the Mismatch really matter?

  • Unexpected training instability on challenging tasks

53

DAPO Qwen2.5-32B

54 of 102

Does the Mismatch really matter?

  • Possible negligible on simple tasks

54

PPO GSM8K Qwen2.5-32B

55 of 102

Community Verification

55

56 of 102

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes

56

57 of 102

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

57

58 of 102

Harvesting Off-Policy in Quantization

As TIS handles the gap, can we go even further off-policy for speedup?

Rollout generation is a bottleneck in RL training efficiency:

In DAPO-32B setting, rollout takes up ~70% of the training time

58

59 of 102

Quantization helps speedup but hurts performance

59

60 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

60

61 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

61

62 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

62

63 of 102

Quantization helps speedup but hurts performance

Naively applying quantization can accelerate rollout speed

But the performance is also degraded!

This can be expected, as quantization introduces more mismatch

63

64 of 102

FlashRL preserves performance with TIS

This can be expected, as quantization introduces more mismatch

FlashRL fixes it with TIS

64

65 of 102

FlashRL preserves performance with TIS

FlashRL fixes it with TIS

FlashRL is implemented as a PyPI package to patch vLLM

65

66 of 102

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

66

67 of 102

FlashRL preserves performance with TIS

DAPO 32B Setting

Matches the performance of BF16 rollout with TIS

Outperforms naive BF16 rollout (without TIS)

67

68 of 102

FlashRL preserves performance with TIS

GSM8K 0.5B Setting

TIS works both in INT8 and FP8 setting

68

69 of 102

More detailed analysis

Rollout Speedup

69

70 of 102

More detailed analysis

Rollout Speedup

Regular RL Setting

70

71 of 102

More detailed analysis

Rollout Speedup

Regular RL Setting

Standard Inference Setting

71

72 of 102

More detailed analysis

End-to-End Speedup & Effectiveness

72

73 of 102

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

73

74 of 102

More detailed analysis

End-to-End Speedup & Effectiveness

INT8 as a pressure test

74

75 of 102

How to perform INT8 quantization?

75

76 of 102

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

76

77 of 102

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

77

78 of 102

How to perform INT8 quantization?

FP8 quantization can be naturally conducted in an online manner

INT8 quantization requires complicated calibration process

Our solution: Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

78

79 of 102

How to perform INT8 quantization?

Online INT8 Quantization via Calibration Transfer

calculate the calibration result once at the beginning of training and reuse it at every online step

Observation: RL changes model weights less aggressively comparing to SFT

79

80 of 102

Outline

  • Why Rollout-Training Mismatch Occurs
  • How to Fix the Off-Policy Issue It Brings
  • Harvesting Rollout-Training Mismatch via Quantization
  • Analyzing the Effectiveness of Different Fixes

80

81 of 102

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute

81

82 of 102

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute
  • PPO-IS

82

83 of 102

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute
  • PPO-IS
  • Vanilla-IS

83

84 of 102

Analyzing the Effectiveness of Different Fixes

  • PPO
  • Recompute
  • PPO-IS
  • Vanilla-IS
  • TIS

84

85 of 102

Comparison with TIS-Variants

GSM8K, PPO, Qwen2.5-0.5B-Instruct

Only TIS works consistently

85

86 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

86

87 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

87

88 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

88

89 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

89

90 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

90

91 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

91

92 of 102

Why Recompute fails

  • Recompute
    • The mismatch can lead to entropy collapse
      • Gradient computation vs. rollout generation

92

93 of 102

Why PPO-IS fails

  • PPO-IS
    • PPO-IS is still “biased” from the PPO gradient

93

94 of 102

Why PPO-IS fails

  • PPO-IS
    • PPO-IS is still “biased” from the PPO gradient
    • The clip in PPO is designed for “trust region”
      • At time step 0, , we don’t want to clip but PPO-IS may clip
      • PPO-clip works differently than TIS

94

95 of 102

Why PPO-IS fails

  • PPO-IS
    • PPO-IS is still “biased” from the PPO gradient
    • The clip in PPO is designed for “trust region”
      • At time step 0, , we don’t want to clip but PPO-IS may clip
      • PPO-clip works differently than TIS

95

96 of 102

Why Vanilla-IS fails

  • Vanilla-IS
    • Uncapped importance ratio amplifies the gradient noise
      • Leading to unstable training

96

97 of 102

Why Vanilla-IS fails

  • Vanilla-IS
    • Uncapped importance ratio amplifies the gradient noise
      • Leading to unstable training

97

Gradient noise

98 of 102

What’s beyond?

98

99 of 102

What’s beyond?

  • The gap can be amplified in MoE RL
    • Dynamic Routing
    • Specially Optimized Kernels

99

100 of 102

What’s beyond?

  • The gap can be amplified in MoE RL
    • Dynamic Routing
    • Specially Optimized Kernels

  • TIS is orthogonal and compatible with existing GxPOs
    • GxPOs adjust the computation of advantage / importance ratio
    • TIS addresses the system-level mismatch problem

100

101 of 102

Takeaways

  • Mixing inference backend with training backends brings off-policy RL training, even if they share the same weights

  • Truncated Importance Sampling (TIS) is effective mitigating the gap

  • With TIS integrated, rollout generation can be accelerated via quantization without sacrificing the performance

101

102 of 102

Thanks for Listening!

August 27, 2025 – Presented at TsinghuaNLP