1 of 18

SGLang DeepSeek MLA

Ke Bao<bockbao@gmail.com>, Yineng Zhang<me@zhyncs.com>

SGLang Team

1

2 of 18

Outlines

2

1. Introduction to MLA

2. SGLang MLA Optimizations

3. How to Use MLA in SGLang

4. Future Work

3 of 18

3

  • 1. Introduction to MLA

4 of 18

What is MLA?

4

MLA (Multi-head Latent Attention)1 is a innovative attention architecture introduced by the DeepSeek-AI team.

1DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (https://arxiv.org/pdf/2405.04434)

5 of 18

Computation Overview

5

MHA

MLA

6 of 18

6

  • 2. SGLang MLA Optimizations

7 of 18

Weight Absorption

7

Change the computation order based on associative law of matrix multiplication.

FLOPs:

In decoding stage (nq=1), the method can take less computation.

In DeepSeek-V2,

  • d = 128
  • dc = 512
  • h = 128

orignal order

absorbed order

8 of 18

Weight Absorption

8

Original

w/ Weight Absorption

9 of 18

Weight Absorption

9

Benefits:

  • Reduced overall computation in decoding stage
  • Balanced the computation and memory access in decoding kernel
    • Increased the attention computation intensity
    • Reduced the memory access of kv cache

Related PR: #905

10 of 18

Groupped Decoding Kernel

10

We optimized the memory access of Triton decoding kernel for MLA.

11 of 18

Quantization

11

Related PR: #469, #1285, #1286

Limitations:

torch compile not support FP8; W8A8 only support sm89+; FP8 KV Cache only support E5M2 for now

12 of 18

CUDA Graph & Torch Compile

12

Related PR: #1401, #1469, #1442

Workaround: #1446(FusedMoE skipped for torch compile)

13 of 18

End2end Benchmark

14 of 18

14

  • 3. How to Use MLA in SGLang

15 of 18

How to Use MLA in SGLang

15

Recommend using the latest version(>=v0.3.1.post3).

# fp16 tp8

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code

# fp16 tp8 w/ torch compile

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile

# fp16 tp8 w/ torch compile, max torch compile batch size 1

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile --max-torch-compile-bs 1

# fp8 tp8, fp8 e5m2 kv cache

python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --kv-cache-dtype fp8_e5m2

16 of 18

16

  • 4. Future Work

17 of 18

Future Work

17

17

  • MoE use TP or EP, Attention use DP
  • torch.compile for MoE layers
  • Separate prefill and decoding
  • Kernel optimizations (FlashInfer support)
  • FP8 KV Cache support E4M3

18 of 18

18

Welcome to join our Slack and use SGLang!