1 of 17

Efficiency and Determinism in Large-Scale �RL Training on the Miles Framework

RadixArk: Yusheng Su

AMD: Liz Li

Acknowledgement: RadixArk team and AMD teams

1

|

[AMD Official Use Only]

2 of 17

Content

  • Overall RL training
  • Miles architecture and goal
  • Scalability and determinism
  • Deepseek-v4 - Miles day 0 support

2

|

[AMD Official Use Only]

3 of 17

Overall RL Training

3

|

[AMD Official Use Only]

4 of 17

Miles Architecture and Goal

  • Architecture
  • Training engine: FSDP / Megatron-LM
  • Inference engine: SGLang

  • Goal
    • Efficiency
    • Determinism

4

|

[AMD Official Use Only]

5 of 17

Scalability and Determinism

  • Efficiency
  • Co-located to disaggregated (fully async)
  • Low-precision (FP8, int4)
  • RDMA weight sync
  • Multi-token-prediction (MTP)
  • Determinism
    • Rollout routing replay (R3)
    • Train-inference matching (kernel-level alignment)
    • TITO – Token-In Token-Out consistency

5

|

[AMD Official Use Only]

6 of 17

Deepseek-v4 - Miles day 0 support

  • Also support the AMD MI350 GPUs on rollout side

6

|

[AMD Official Use Only]

7 of 17

Miles support on AMD��

7

|

[AMD Official Use Only]

8 of 17

Bring MILES to AMD - Functionally and Efficiently

-Single node perf optimization

-Rollout phase: SGlang

-Training phase: Megatron-LM & Transformer Engine (TE)

-True-on-policy enabling

-Multi node scaling – Async

-Roadmap

8

|

[AMD Official Use Only]

9 of 17

SGLang on AMD

  • SGLang supports inference for a wide range of models on AMD through Aiter and Triton kernels.
  • High performant EP with Mori
  • RL features: R3, True-On-Policy, etc.

9

|

[AMD Official Use Only]

10 of 17

10

|

[AMD Official Use Only]

11 of 17

11

|

[AMD Official Use Only]

12 of 17

12

|

[AMD Official Use Only]

13 of 17

13

|

[AMD Official Use Only]

14 of 17

14

|

[AMD Official Use Only]

15 of 17

15

|

[AMD Official Use Only]

16 of 17

Back up

Attention Is All You Need

by Ashish Vaswani 2017

16

|

[AMD Official Use Only]

17 of 17

Miles – AMD support gaps

Feature

AMD Support

Notes

Basic GRPO RL Training

Yes

Core training loop works

Megatron Backend

Yes

All AMD scripts use Megatron

FSDP Backend

Yes

Device-agnostic, but no AMD examples provided

Dynamic Batch Size

Yes

Used in AMD scripts (--max-tokens-per-gpu 9216)

Partial Rollout / Over-Sampling

Yes

Device-agnostic implementation

Model Parallelism (TP/PP/SP)

Yes

Validated in AMD scripts

Multiple RL Algorithms (GRPO, PPO, etc.)

Yes

Algorithm logic is device-agnostic

Miles Router

Yes

HTTP-level, no GPU dependency

True On-Policy

No

Requires FA3 + DeepGEMM (NVIDIA Hopper+ only)

FP8 Pipeline

No

Experimental on ROCm, no AMD examples

R3 (Routing Replay)

No

Not in any AMD scripts

INT4 QAT

No

CUDA-specific kernels

Speculative Decoding

No

No AMD examples, NVIDIA-optimized

DeepEP (Expert Parallelism)

No

AMD SGLang patch disables it

Gradient Accumulation Fusion

No

No apex support on ROCm

Zero-Copy Weight Sync (CUDA IPC)

No

CUDA IPC is NVIDIA-specific

Model coverage, optimization, CI, etc.

17

|

[AMD Official Use Only]