CSE 451
Operating Systems
Lecture 25 – Recap and SysML
Baris Kasikci
Who thinks Jonathan is awesome?
2
I certainly do!
If you agree, please nominate him for the Bob Bandes Memorial Teaching award:
https://ta.cs.washington.edu/bandes
Operating Systems Recap
Operating Systems Recap
Operating System Recap
The Quest for�Blazingly Fast LLM Serving
6
Baris Kasikci
7
Full stack co-design of adaptive systems
Deep integration of traditional/emerging software/hardware ecosystems
AI Assistants (ChatGPT, Google Bard)��Text-to-Image (DALL·E, MidJourney)��Creative Writing (Jasper, Copy.ai)��AI Coding Tools (GitHub Copilot, Replit)��Text-to-Speech & Audio (Descript, Synthesia)
8
LLMs Enable Many New Applications
9
Source: https://nerdynav.com/chatgpt-statistics/
LLM Serving Systems Face Surging Demand
Cost of Training and Serving LLMs is Massive
Large-scale investment in H100s in 2024
NVIDIA H200 HGX Server
10
[1] Tom’s Hardware
[2] The Register
Batching user requests and aiming for optimal throughput are key
Prior Throughput Improvements
11
How do we approach optimal LLM serving throughput?
12
Compute
Network
Memory
Existing LLM Inference Engines
Request
Batch
GEMM, …
Decode
Attention,…
Activation
Communication, …
13
Compute
Network
Memory
Existing LLM Inference Engines
Request
Batch
GEMM, …
Decode
Attention,…
Activation
Communication, …
Batches processed sequentially by heterogeneous operations
14
Compute
Network
Memory
Existing LLM Inference Engines
Request
Batch
GEMM, …
Decode
Attention,…
Activation
Communication, …
Batches processed sequentially by heterogeneous operations
15
Compute
Network
Memory
Existing LLM Inference Engines
Request
Batch
GEMM, …
Decode
Attention,…
Activation
Communication, …
Batches processed sequentially by heterogeneous operations
Compute is the most-constrained resource in modern LLM serving systems
Most-constrained
resource
16
Existing LLM Inference Engines
Batches processed sequentially by heterogeneous operations
Compute
Network
Memory
Request
Batch
~100% compute utilization
during this time
Most-constrained
resource
17
Existing LLM Inference Engines
Batches processed sequentially by heterogeneous operations
Compute
Network
Memory
Request
Batch
Utilization of compute, the most-constrained resource, is not maximized
~40% compute utilization end-to-end
Most-constrained
resource
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu,
Dedong Xie, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori,
Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci
18
[1] https://github.com/efeslab/Nanoflow
Extra weight loading from nano-batches is acceptable due to overall compute-boundness
Compute
Network
Memory
Request
Batch
C1
C2
M1
M2
N1
N2
B1
B2
Nano-batches
Nano-operations
: ran with intra-device parallelism
20
C1
C2
M1
M2
N1
N2
B1
B2
Nano-batches
Nano-operations
: ran with intra-device parallelism
Increased end-to-end compute utilization
1.91x throughput gain compared to state-of-the-art systems
Nanoflow automatically generates and executes optimized pipelines
21
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
22
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
23
Decode
Decode
Decode
Transformer
Prefill
The
University
of
Washington
is
is
in
in
Seattle
located
located
Seattle
Prefill and Decode Phases of the Transformer
24
Transformer Layers and Iterations
Inference Iteration
Inference Layer
Output Token
Activations
25
+
+
Decoder-Only Transformer
Matching Operation Needs With GPU Resources
26
Compute
Memory
Bandwidth
Network
Bandwidth
W_Q
W_K
W_V
W_O
W_Up
W_Gate
W_Down
Decode
Attention
Prefill
Attention
Attention
All-gather
FFN
All-reduce
27
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
28
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
What Is the Main Resource Bottleneck of LLM Serving?
Model sizes are vastly larger than prior DNNs’
Self-Attention is memory-intensive during decoding
29
[1] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference, 2024
LLM serving is assumed to be memory-bound
Memory-Centric Execution Time Estimation*
30
GPU
memory
Weight
Input/
Actv.
KV cache
Entire contents of the GPU memory would be loaded once during one iteration
* We assume operation at maximum batch size, which is necessary for optimal throughput
Compute-Centric Execution Time Estimation
31
Weight
Output
Output
Output
K
Bdense
K
N
K1
Bdense
N1
Every GEMM compute =
K2
Bdense
N2
K3
Bdense
N3
…
Ki
Bdense
Ni
+
+
+
+
Bdense
Pmodel
2
2
2
2
2
…
…
…
K
Bdense
N
2
All Dense Operations
Network-Centric Execution Time Estimation
32
A
B
C
D
A
A
A
B
B
B
C
C
C
D
D
D
NGPU
All-Gather Recap
(NGPU – 1)
# of hops per AllGather
(NGPU – 1): # of hops
Network-Centric Execution Time Estimation
33
All-Reduce ≈ 2 x All-Gather
(NGPU – 1): # of hops
4: All-Gathers/layer
Network-Centric Execution Time Estimation
34
(NGPU – 1): # of hops
Layer l
Layer l+1
4: All-Gathers/layer
[Bdense, Dmodel]
Bdense x Dmodel: shape
Stype: data type size
Each item
of size Stype
Network-Centric Execution Time Estimation
35
(NGPU – 1): # of hops
4: All-Gathers/layer
Bdense x Dmodel: shape
Stype: data type size
Execution Times wrt Different Resources
36
Execution Times wrt Different Resources
37
Compute vs Network
38
LLM Serving is more compute-bound than network-bound
Shift to
Compute
Bound
Shift to
Network
Bound
Not
Relevant
Compute vs Network
39
Execution Times wrt Different Resources
40
Execution Times wrt Different Resources
41
Compute vs Memory
42
Grouped Query Attention1
43
GPU
memory
Weight
Input
KV cache
[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai
H1
Attention Head
H2
H3
H4
HN
Grouped Query Attention1
44
GPU
memory
Weight
Input
KV cache
H1
H2
H3
H4
HN
Attention Head
[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai
Grouped Query Attention1
45
GPU
memory
Weight
Input
KV cache
H1
H2
H3
H4
HN
Group size: 4
GQA allows increasing the batch size by a factor of the group size
X
H4N
[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai
Compute vs Memory
46
GQA allows for a large batch size
Compute
Memory
Model sizes tend to increase
Batches with large decode requests increase memory accesses
LLM serving is more compute-bound than memory-bound
Compute vs Memory - Validation
47
High Utilization of GPU compute is the key for LLM serving
Revisit the performance of existing frameworks
48
There is a large gap to optimal throughput
49
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
50
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
51
Decoder-Only Transformer
Dependency
Nano-batching via Nano-operations
52
Up Gate
Input Batch
Dependency
Down
Input Batch
Up Gate 1
Input Batch
Up Gate 2
Input Batch
Down 1
Input Batch
Down 2
Input Batch
Dependency
Dependency
Can overlap
Pay extra weight loading for overlapping opportunities
Nano-batch Design Space
53
Up Gate
Input
Number of nano-batches
Up Gate
Input
Up Gate
Input
Up Gate
Input
Up Gate
Input
Nano-batch Design Space
54
Sizes of Nano-batches
Up Gate
Input
Up Gate
Input
Up Gate
Input
Up Gate
Input
Nano-batch Design Space
55
Ordering of Nano-batches
Up Gate 1
Input
Up Gate 2
Input
Down 1
Input
Down 2
Input
Up Gate 1
Input
Up Gate 2
Input
Down 1
Input
Down 2
Input
56
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
57
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
Auto-Search Stage 1: Pipeline Search
58
Inputs
Dependency
DAG
Goal: Minimize execution time
Constraints
Outputs
Ordering of same type
nano-ops
1. Obey operation dependencies
2. Same type operations do not overlap
Simplifications
1. Only consider three layers 2. Nano-batch partitioning is symmetric across layers 3. No interference
Design Space For Overlapping Nano-Ops
59
GPU resource allocation
Up Gate 2
Decode Attention
ComputeOnly
Memory
Only
Overlap
Plan A
Overlap
Plan B
Up Gate 2
Decode Attention
Up Gate 2
Decode Attention
Profiling interference is challenging
60
Profile pair-wise interference for simplification
Profiling Interference
61
Modeling Interference
62
Interference Example
63
GEMM
GEMV
Network
100%
0%
0%
64
GEMM
GEMV
Network
80%
0%
0%
Interference Example
65
GEMM
GEMV
Network
80%
0%
0%
Interference Example
Interference Example
66
GEMM
GEMV
Network
80%
30%
0%
67
GEMM
GEMV
Network
60%
30%
50%
Interference Example
Auto-Search Stage 2: Resource allocation
68
Goal: Minimize execution time
Constraints
Outputs
1. Obey dependencies and orderings
2. Sum of R ≤ 1
Simplifications
1. Only consider three layers 2. R is symmetric across layers
Ordering
Inputs
Dependency
KQV1
O2
Up Gate Down 2
KQV2
Decode Attention 1
PF1
O1
Up Gate Down 1
KQV1
Decode Attention 2
PF2
O2
Up Gate Down 2
Decode Attention 1
KQV1
Decode Attention 1
PF1
O1
Up Gate Down 1
KQV1
KQV2
Decode Attention 2
PF2
O2
Up Gate Down 2
KQV2
KQV
Decode Attention
PF
O
Up Gate Down
KQV
KQV1
KQV2
Decode Attention 1
Decode Attention 2
PF1
PF2
Up Gate Down 1
O1
O2
Up Gate Down 2
KQV1
8B Pipeline Design
WASTED
Memory-bound
Compute-bound
Compute-bound kernels fully occupy the GPU
Reorder
Overlap
Split
70B – Nanoflow Pipeline
70
O.AR2
R=0.1
KQV1
R=0.4
KQV2
R=0.4
KQV3
R=0.4
KQV4
R=0.4
DecAttn1
R=0.4
DecAttn2
R=0.4
DecAttn3
R=0.4
DecAttn4
R=0.4
O.AG1
R=0.2
Attn.AG1
R=0.2
O1
R=0.6
O2
R=0.8
Up,Gate,Down (UGD) 1
R=0.9
Up,Gate,Down (UGD) 2
R=0.9
UGD.AR1
R=0.1
PF1
R=0.6
UGD.AR2
R=0.2
UGD.AR3
R=0.2
Layer
Attn.AG2
AG to AR Transformation
UGD.AR2
R=0.2
UGD.AR3
R=0.2
KQV1
R=0.4
KQV2
R=0.4
KQV3
R=0.4
DecAttn1
R=0.4
DecAttn2
R=0.4
Prefill Attention
71
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
72
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
Batch Formation
73
Decode Request 1
Decode Request 2
Decode Request 5
…
Prefill Request 1
Prefill Request 2
Fixed Budget
Next round
Nanoflow forms constant batches for consistent performance
Asynchronous Scheduling
74
Batch formation and KV cache management are moved out of the critical path
Asynchronous Scheduling
75
Batch formation and KV cache management are moved out of the critical path
76
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
77
What is the optimal LLM serving throughput?
Design of Nanoflow
Operation Classification
Evaluation
Resource Bottleneck Analysis
Nano-batching
Auto-search
Runtime
Evaluation: End-to-End Throughput
78
1.73x const length I/O throughput gain
1.91x dataset throughput gain
68.5% of optimal throughput
Evaluation: Latency
79
Splitwise
ShareGPT
Mlsys
Similar (but a bit higher) latency at
low request rate
Processes 1.64x requests at
200ms SLO
Low tail latency
Evaluation – Nanoflow Pipelines with Different Models
80
Porting Nanoflow to other models demonstrates 50.4% to 78.5% of optimal throughput
NanoFlow as a Research Oriented Serving Engine
AMD
Nvidia
AMD
Nvidia
AMD
Nvidia
KQV
Rope
Dec Attn
O
Prefill Attn
Attn All Gather
Model Level
Operation
Level
Impl
Level
GEMM
Attention
AllGather
Torch
FlashAttention
FlashInfer
Torch
Triton
CUBLAS
RocBLAS
CUTLASS
CK
MSCCL++
NCCL
RCCL
Torch
Triton
User
Lib
Example: CPU Code Execution
82
CPU path
GPU path
NanoFlow as a Research-Oriented Serving Engine
83
Small Code Base
Performance Centric
Heterogenous Support
84
Nano-batched pipeline design
Automated search & efficient runtime
1.91x throughput gain
50.4-78.5% optimal throughput
Paper
Github
baris@cs.washington.edu
Evaluation - Resource usage
Non-overlap baseline
85
Nanoflow utilizes multiple resources concurrently, achieving higher average compute usage
Nanoflow
Evaluation - Ablation study
86
Overlapping network: 1.07x speedup
Overlapping memory: 1.10x speedup
Overall, 1.17x speedup