SGLang DeepSeek MLA
Ke Bao<bockbao@gmail.com>, Yineng Zhang<me@zhyncs.com>
SGLang Team
1
Outlines
2
1. Introduction to MLA
2. SGLang MLA Optimizations
3. How to Use MLA in SGLang
4. Future Work
3
What is MLA?
4
MLA (Multi-head Latent Attention)1 is a innovative attention architecture introduced by the DeepSeek-AI team.
1DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (https://arxiv.org/pdf/2405.04434)
Computation Overview
5
MHA
MLA
6
Weight Absorption
7
Change the computation order based on associative law of matrix multiplication.
①
②
FLOPs:
In decoding stage (nq=1), the method ② can take less computation.
In DeepSeek-V2,
orignal order
absorbed order
Weight Absorption
8
Original
w/ Weight Absorption
Weight Absorption
9
Benefits:
Related PR: #905
Groupped Decoding Kernel
10
We optimized the memory access of Triton decoding kernel for MLA.
Quantization
11
Limitations:
torch compile not support FP8; W8A8 only support sm89+; FP8 KV Cache only support E5M2 for now
CUDA Graph & Torch Compile
12
Workaround: #1446(FusedMoE skipped for torch compile)
End2end Benchmark
source and setup: https://lmsys.org/blog/2024-09-04-sglang-v0-3/
14
How to Use MLA in SGLang
15
Recommend using the latest version(>=v0.3.1.post3).
# fp16 tp8
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code
# fp16 tp8 w/ torch compile
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile
# fp16 tp8 w/ torch compile, max torch compile batch size 1
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile --max-torch-compile-bs 1
# fp8 tp8, fp8 e5m2 kv cache
python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --kv-cache-dtype fp8_e5m2
16
Future Work
17
17