1 of 18

SGLang DeepSeek MLA

Ke Bao<bockbao@gmail.com>, Yineng Zhang<me@zhyncs.com>

SGLang Team

1

2 of 18

Outlines

2

1. Introduction to MLA

2. SGLang MLA Optimizations

3. How to Use MLA in SGLang

4. Future Work

3 of 18

3

1. Introduction to MLA

4 of 18

What is MLA？

4

MLA (Multi-head Latent Attention)¹ is a innovative attention architecture introduced by the DeepSeek-AI team.

¹DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (https://arxiv.org/pdf/2405.04434)

5 of 18

Computation Overview

5

MHA

MLA

6 of 18

6

2. SGLang MLA Optimizations

7 of 18

Weight Absorption

7

Change the computation order based on associative law of matrix multiplication.

①

②

FLOPs:

In decoding stage (nq=1), the method ② can take less computation.

In DeepSeek-V2,

d = 128
dc = 512
h = 128

orignal order

absorbed order

8 of 18

Weight Absorption

8

Original

w/ Weight Absorption

9 of 18

Weight Absorption

9

Benefits:

Reduced overall computation in decoding stage
Balanced the computation and memory access in decoding kernel

Increased the attention computation intensity
Reduced the memory access of kv cache

Related PR: #905

10 of 18

Groupped Decoding Kernel

10

We optimized the memory access of Triton decoding kernel for MLA.

11 of 18

Quantization

11

Related PR: #469, #1285, #1286

Limitations:

torch compile not support FP8; W8A8 only support sm89+; FP8 KV Cache only support E5M2 for now

12 of 18

CUDA Graph & Torch Compile

12

Related PR: #1401, #1469, #1442

Workaround: #1446(FusedMoE skipped for torch compile)

13 of 18

End2end Benchmark

source and setup: https://lmsys.org/blog/2024-09-04-sglang-v0-3/

14 of 18

14

3. How to Use MLA in SGLang

15 of 18

How to Use MLA in SGLang

15

Recommend using the latest version(>=v0.3.1.post3).

# fp16 tp8

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code

# fp16 tp8 w/ torch compile

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile

# fp16 tp8 w/ torch compile, max torch compile batch size 1

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile --max-torch-compile-bs 1

# fp8 tp8, fp8 e5m2 kv cache

python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --kv-cache-dtype fp8_e5m2

16 of 18

16

4. Future Work

17 of 18

Future Work

17

MoE use TP or EP, Attention use DP
torch.compile for MoE layers
Separate prefill and decoding
Kernel optimizations (FlashInfer support)
FP8 KV Cache support E4M3

18 of 18

18

Welcome to join our Slack and use SGLang!