Efficiency and Determinism in Large-Scale �RL Training on the Miles Framework
RadixArk: Yusheng Su
AMD: Liz Li
Acknowledgement: RadixArk team and AMD teams
1
|
[AMD Official Use Only]
Content
2
|
[AMD Official Use Only]
Overall RL Training
3
|
[AMD Official Use Only]
Miles Architecture and Goal
4
|
[AMD Official Use Only]
Scalability and Determinism
5
|
[AMD Official Use Only]
Deepseek-v4 - Miles day 0 support
6
|
[AMD Official Use Only]
Miles support on AMD��
7
|
[AMD Official Use Only]
Bring MILES to AMD - Functionally and Efficiently
-Single node perf optimization
-Rollout phase: SGlang
-Training phase: Megatron-LM & Transformer Engine (TE)
-True-on-policy enabling
-Multi node scaling – Async
-Roadmap
8
|
[AMD Official Use Only]
SGLang on AMD
9
|
[AMD Official Use Only]
10
|
[AMD Official Use Only]
11
|
[AMD Official Use Only]
12
|
[AMD Official Use Only]
13
|
[AMD Official Use Only]
14
|
[AMD Official Use Only]
15
|
[AMD Official Use Only]
Back up
Attention Is All You Need
by Ashish Vaswani 2017
16
|
[AMD Official Use Only]
Miles – AMD support gaps
Feature | AMD Support | Notes |
Basic GRPO RL Training | Yes | Core training loop works |
Megatron Backend | Yes | All AMD scripts use Megatron |
FSDP Backend | Yes | Device-agnostic, but no AMD examples provided |
Dynamic Batch Size | Yes | Used in AMD scripts (--max-tokens-per-gpu 9216) |
Partial Rollout / Over-Sampling | Yes | Device-agnostic implementation |
Model Parallelism (TP/PP/SP) | Yes | Validated in AMD scripts |
Multiple RL Algorithms (GRPO, PPO, etc.) | Yes | Algorithm logic is device-agnostic |
Miles Router | Yes | HTTP-level, no GPU dependency |
True On-Policy | No | Requires FA3 + DeepGEMM (NVIDIA Hopper+ only) |
FP8 Pipeline | No | Experimental on ROCm, no AMD examples |
R3 (Routing Replay) | No | Not in any AMD scripts |
INT4 QAT | No | CUDA-specific kernels |
Speculative Decoding | No | No AMD examples, NVIDIA-optimized |
DeepEP (Expert Parallelism) | No | AMD SGLang patch disables it |
Gradient Accumulation Fusion | No | No apex support on ROCm |
Zero-Copy Weight Sync (CUDA IPC) | No | CUDA IPC is NVIDIA-specific |
Model coverage, optimization, CI, etc.
17
|
[AMD Official Use Only]