1 of 86

CSE 451

Operating Systems

Lecture 25 – Recap and SysML

Baris Kasikci

2 of 86

Who thinks Jonathan is awesome?

2

I certainly do!

If you agree, please nominate him for the Bob Bandes Memorial Teaching award:

https://ta.cs.washington.edu/bandes

3 of 86

Operating Systems Recap

Kernel/user space abstraction

Trusted intermediary providing security and fault isolation between untrusted (or weakly trusted) applications
Traps (system calls, exceptions, and interrupts) from untrusted code into kernel

Virtual machine abstraction

Host operating system, guest operating system
Software and hardware support for virtualization

Thread abstraction

Sequential execution of instructions within a thread, at variable speed

Synchronization

Data races, write buffering
Locks and condition variables: how to write correct concurrent programs

4 of 86

Operating Systems Recap

Deadlock

Necessary conditions for deadlock; ways to prevent deadlock

CPU scheduling

Shortest job first; round robin, fair queueing, multilevel feedback queues

Address Translation

Conversion of virtual to physical addresses using multi-level page tables
Translation Lookaside Buffers (TLBs)

Paging

Mechanisms to implement demand paging, copy on write
Least recently used and clock algorithm

5 of 86

Operating System Recap

Storage Technologies

Flash (SSD) storage with multi-block erasures and wearout; translation layer from virtual to physical block numbers
Magnetic (hard) disk storage; latency to physically move disk head to data

File Systems

Directory name translation to inode (file) #
Inode to data translation using extents, indirect/doubly indirect blocks
Transaction semantics and write ahead logging

Security

Theory: authentication, authorization, enforcement, auditing
Practice: all systems have bugs, particularly if adversary is malicious

6 of 86

The Quest for�Blazingly Fast LLM Serving

6

Baris Kasikci

7 of 86

7

Full stack co-design of adaptive systems

Deep integration of traditional/emerging software/hardware ecosystems

8 of 86

AI Assistants (ChatGPT, Google Bard)��Text-to-Image (DALL·E, MidJourney)��Creative Writing (Jasper, Copy.ai)��AI Coding Tools (GitHub Copilot, Replit)��Text-to-Speech & Audio (Descript, Synthesia)

8

LLMs Enable Many New Applications

9 of 86

9

Source: https://nerdynav.com/chatgpt-statistics/

LLM Serving Systems Face Surging Demand

10 of 86

Cost of Training and Serving LLMs is Massive

Large-scale investment in H100s in 2024

Meta: 300K¹
Google: 150K¹
Microsoft: 150K¹
X: 85K²

NVIDIA H200 HGX Server

~$250,000
High operating cost

Up to ~10,000 W

Long lead times

10

[1] Tom’s Hardware

[2] The Register

Batching user requests and aiming for optimal throughput are key

11 of 86

Prior Throughput Improvements

11

How do we approach optimal LLM serving throughput?

Due to this environment, existing serving systems have made significant throughput improvements. For example, the vLLM framework introduced the idea of paged attention, which increases the batch size of requests that can be processed by reducing memory fragmentation.

Deepspeed-fastgen groups operations from the two major phases of LLM inference, namely prefill and decode to amortize the weight loading cost.

TensorRTLLM from NVIDIA uses a compiler backend to generate high performance GPU kernels.

These systems achieve throughput on the order of several hundreds of tokens per second.

<click> However, if we look at the optimal LLM serving throughput that can be achieved, there is still a gap.

<click> The question we will explore in this talk is how can we close this gap and approach optimal serving throughput?

12 of 86

12

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

13 of 86

13

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

Batches processed sequentially by heterogeneous operations

14 of 86

14

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

Batches processed sequentially by heterogeneous operations

15 of 86

15

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

Batches processed sequentially by heterogeneous operations

Compute is the most-constrained resource in modern LLM serving systems

Most-constrained

resource

16 of 86

16

Existing LLM Inference Engines

Batches processed sequentially by heterogeneous operations

Operations have high resource utilization locally

Full access to GPU
~100% compute utilization during compute-bound operations

Compute

Network

Memory

Request

Batch

~100% compute utilization

during this time

Most-constrained

resource

17 of 86

17

Existing LLM Inference Engines

Batches processed sequentially by heterogeneous operations

Operations have high resource utilization locally

Full access to GPU
~100% compute utilization during compute-bound operations

End-to-end resource utilization is low

40% compute utilization end-to-end

Compute

Network

Memory

Request

Batch

Utilization of compute, the most-constrained resource, is not maximized

~40% compute utilization end-to-end

Most-constrained

resource

18 of 86

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu,

Dedong Xie, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori,

Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

18

[1] https://github.com/efeslab/Nanoflow

19 of 86

Extra weight loading from nano-batches is acceptable due to overall compute-boundness

Compute

Network

Memory

Request

Batch

C₁

C₂

M₁

M₂

N₁

N₂

B₁

B₂

Nano-batches

Nano-operations

: ran with intra-device parallelism

20 of 86

20

C₁

C₂

M₁

M₂

N₁

N₂

B₁

B₂

Nano-batches

Nano-operations

: ran with intra-device parallelism

Increased end-to-end compute utilization

1.91x throughput gain compared to state-of-the-art systems

Nanoflow automatically generates and executes optimized pipelines

Challenging due to the size of the search space (models, hardware, requests)

21 of 86

21

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

22 of 86

22

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

23 of 86

23

Decode

Transformer

Prefill

The

University

of

Washington

is

in

Seattle

located

Seattle

Prefill and Decode Phases of the Transformer

24 of 86

24

Transformer Layers and Iterations

Inference Iteration

Inference Layer

Output Token

Activations

25 of 86

25

+

Decoder-Only Transformer

26 of 86

Matching Operation Needs With GPU Resources

26

Compute

Memory

Bandwidth

Network

Bandwidth

W_Q

W_K

W_V

W_O

W_Up

W_Gate

W_Down

Decode

Attention

Prefill

Attention

All-gather

FFN

All-reduce

27 of 86

27

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

28 of 86

28

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

29 of 86

What Is the Main Resource Bottleneck of LLM Serving?

Model sizes are vastly larger than prior DNNs’

Deepseek-V3: 671B parameters (~1.5 TB VMRAM requirement for FP16)
H200 has 141GB VRAM

Self-Attention is memory-intensive during decoding

Key-Value (KV) Cache

Scales linearly with the context length
Size can surpass that of model weights¹

29

[1] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference, 2024

LLM serving is assumed to be memory-bound

30 of 86

Memory-Centric Execution Time Estimation*

30

GPU

memory

Weight

Input/

Actv.

KV cache

Entire contents of the GPU memory would be loaded once during one iteration

* We assume operation at maximum batch size, which is necessary for optimal throughput

31 of 86

Compute-Centric Execution Time Estimation

31

Weight

Output

K

B_dense

K

N

K₁

B_dense

N₁

Every GEMM compute =

K₂

B_dense

N₂

K₃

B_dense

N₃

…

K_i

B_dense

N_i

+

B_dense

P_model

2

…

K

B_dense

N

2

All Dense Operations

For compute, almost all the demand comes from dense operations which are due to activations being multiplied with large weight matrices in large batches.

In these dense general matrix matrix multiplication operations, GEMMs, every element is computed via a dot product of one row of the input with one column of the weight. Let's assume a weight matrix of KxN

<click> Therefore, K multiplications and K additions are required to compute one element, so that's how we get the 2K term.

<click> There are B_dense*N output elements, where B_dense is the size of the input batch involved in dense operations.

Therefore, for one GEMM, 2B_dense KN compute operations is needed. <click> We can sum all the dense operations to derive the total number of compute operations. Note that the K N product is the number of elements inside weight matrices.

<click> Therefore, if we factor out the 2xBdense term, <click> the remaining sum is the total number of elements in all weight matrices, which can be approximated by the number of parameters of the Model, which we denote as Pmodel. I find this a neat way to think about compute demand of a model.

<click> Therefore, we can estimate the execution time from the compute perspective to be 2Bdense Pmodel divided by the compute capacity of the GPU in FLOPS/s.

32 of 86

Network-Centric Execution Time Estimation

32

A

B

C

D

A

B

C

D

N_GPU

All-Gather Recap

(N_GPU– 1)

# of hops per AllGather

(N_GPU– 1): # of hops

33 of 86

Network-Centric Execution Time Estimation

33

All-Reduce ≈ 2 x All-Gather

(N_GPU– 1): # of hops

4: All-Gathers/layer

34 of 86

Network-Centric Execution Time Estimation

34

(N_GPU– 1): # of hops

Layer l

Layer l+1

4: All-Gathers/layer

[B_dense, D_model]

B_densex D_model: shape

S_type: data type size

Each item

of size S_type

35 of 86

Network-Centric Execution Time Estimation

35

(N_GPU– 1): # of hops

4: All-Gathers/layer

B_densex D_model: shape

S_type: data type size

36 of 86

Execution Times wrt Different Resources

36

37 of 86

Execution Times wrt Different Resources

37

38 of 86

Compute vs Network

38

LLM Serving is more compute-bound than network-bound

Shift to

Compute

Bound

Shift to

Network

Bound

Not

Relevant

So this is what the ratio of the two execution time models look like. I actually have a more analytical way to compare these but it is a bit involved and so I won’t get into the details here, instead, I’ll qualitatively compare them and then show some empirical validation results.

<click> We can see that increasing model sizes and network bandwidth pushes this ratio towards a compute bound regime, and both of these have been on a steep rise

<click> Increasing compute capacity and sizes of types would push the ratio in favor of a network bound regime. Compute capacity keeps increasing, mainly due to increased parallelism, but it is stunted at an individual core or SM level due to slowdown of effects such as Moore’s law. The sizes of datatypes tend to somewhat decrease thanks to advances in quantization methods.

<click> Finally, the dense batch size has no effect on this ratio as you can see.

<click> So the increase trend in network boundness lags behind that of compute boundness, therefore, LLM serving is more compute-bound than network-bound

39 of 86

Compute vs Network

39

40 of 86

Execution Times wrt Different Resources

40

41 of 86

Execution Times wrt Different Resources

41

42 of 86

Compute vs Memory

42

43 of 86

Grouped Query Attention¹

43

GPU

memory

Weight

Input

KV cache

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

H₁

Attention Head

H₂

H₃

H₄

H_N

44 of 86

Grouped Query Attention¹

44

GPU

memory

Weight

Input

KV cache

H₁

H₂

H₃

H₄

H_N

Attention Head

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

45 of 86

Grouped Query Attention¹

45

GPU

memory

Weight

Input

KV cache

H₁

H₂

H₃

H₄

H_N

Group size: 4

GQA allows increasing the batch size by a factor of the group size

X

H_4N

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

46 of 86

Compute vs Memory

46

GQA allows for a large batch size

Compute

Memory

Model sizes tend to increase

Batches with large decode requests increase memory accesses

LLM serving is more compute-bound than memory-bound

47 of 86

Compute vs Memory - Validation

47

High Utilization of GPU compute is the key for LLM serving

48 of 86

Revisit the performance of existing frameworks

48

There is a large gap to optimal throughput

49 of 86

49

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

50 of 86

50

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

51 of 86

51

Decoder-Only Transformer

Dependency

52 of 86

Nano-batching via Nano-operations

52

Up Gate

Input Batch

Dependency

Down

Input Batch

Up Gate 1

Input Batch

Up Gate 2

Input Batch

Down 1

Input Batch

Down 2

Input Batch

Dependency

Can overlap

Pay extra weight loading for overlapping opportunities

53 of 86

Nano-batch Design Space

53

Up Gate

Input

Number of nano-batches

Up Gate

Input

Up Gate

Input

Up Gate

Input

Up Gate

Input

54 of 86

Nano-batch Design Space

54

Sizes of Nano-batches

Up Gate

Input

Up Gate

Input

Up Gate

Input

Up Gate

Input

55 of 86

Nano-batch Design Space

55

Ordering of Nano-batches

Up Gate 1

Input

Up Gate 2

Input

Down 1

Input

Down 2

Input

Up Gate 1

Input

Up Gate 2

Input

Down 1

Input

Down 2

Input

56 of 86

56

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

57 of 86

57

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

58 of 86

Auto-Search Stage 1: Pipeline Search

58

Inputs

Dependency

DAG

Goal: Minimize execution time

Constraints

Outputs

Ordering of same type

nano-ops

1. Obey operation dependencies

2. Same type operations do not overlap

Simplifications

1. Only consider three layers 2. Nano-batch partitioning is symmetric across layers 3. No interference

In the first stage, the first input to the auto-search algorithm is a map of operation to execution time for different batch sizes, which we determine through profiling. The second input is a graph of dependencies across operations

<click> The outputs are number of nanobatches, batch sizes per nano-batch and ordering across operations of the same type, that is order across compute nano-operations or memory nano-operations etc.

<click> So these outputs are represented using these variables. T_start is the start time of each nano-op, which effectively gives us this ordering at the output.

<click> The objective is to minimize the execution time subject to constraints like obeying operation dependencies and making sure operations of the same type don't overlap. So a compute-bound kernel doesn't overlap with another compute-bound-kernel.

<click> We use a number of simplifying assumptions to make this linear programming problem tractable. We consider overlap opportunities across three layers, we assume the nanobatches are partitioned the same way across layers and we also assume that we can provide overlapping without any interference between different nano-operations.

59 of 86

Design Space For Overlapping Nano-Ops

59

GPU resource allocation

Up Gate 2

Decode Attention

ComputeOnly

Memory

Only

Overlap

Plan A

Overlap

Plan B

Up Gate 2

Decode Attention

Up Gate 2

Decode Attention

60 of 86

Profiling interference is challenging

Combinational profiling space

1000 GEMM
100 GEMV
100 Network
In total 1000*100*100 configurations

GEMM performance matters more

The pipeline is compute-bound
Overlapping between GEMM & GEMV, GEMM & Network dominates

60

Profile pair-wise interference for simplification

61 of 86

Profiling Interference

GEMM

Tile size
Split K
Number of warps
…

GEMV

Number of thread blocks
Loading pipeline stages
…

61

62 of 86

Modeling Interference

62

63 of 86

Interference Example

63

GEMM

GEMV

Network

100%

0%

64 of 86

64

GEMM

GEMV

Network

80%

0%

Interference Example

65 of 86

65

GEMM

GEMV

Network

80%

0%

Interference Example

66 of 86

Interference Example

66

GEMM

GEMV

Network

80%

30%

0%

67 of 86

67

GEMM

GEMV

Network

60%

30%

50%

Interference Example

68 of 86

Auto-Search Stage 2: Resource allocation

68

Goal: Minimize execution time

Constraints

Outputs

1. Obey dependencies and orderings

2. Sum of R ≤ 1

Simplifications

1. Only consider three layers 2. R is symmetric across layers

Ordering

Inputs

Dependency

69 of 86

KQV1

O2

Up Gate Down 2

KQV2

Decode Attention 1

PF1

O1

Up Gate Down 1

KQV1

Decode Attention 2

PF2

O2

Up Gate Down 2

Decode Attention 1

KQV1

Decode Attention 1

PF1

O1

Up Gate Down 1

KQV1

KQV2

Decode Attention 2

PF2

O2

Up Gate Down 2

KQV2

KQV

Decode Attention

PF

O

Up Gate Down

KQV

KQV1

KQV2

Decode Attention 1

Decode Attention 2

PF1

PF2

Up Gate Down 1

O1

O2

Up Gate Down 2

KQV1

8B Pipeline Design

WASTED

Memory-bound

Compute-bound

Compute-bound kernels fully occupy the GPU

Reorder

Overlap

Split

To illustrate a pipeline that Nanoflow automatically computes, I’ll provide an example using the serving pipeline design of an 8B parameter model, where there is no networking since all the model fits on a single GPU.

As we said, existing pipeline designs sequentially execute operations, since they each depend on one another. For example when decode attention is executing, the compute units are almost idle, causing low resource utilization.

<click> To break these dependencies, Nanoflow splits the batches of user requests into nano-batches to seeks opportunities for overlapping operations that use different resources.

After breaking operations into nanobatches, <click> Nanoflow’s auto-search may reorder the operations and <click> overlap nanobatch 1’s decode attention with nanobatch 2’s dense operations.

<click> In the final case, compute-bound kernels can fully occupy the GPU.

Detail: Contrast to micro-batches in pipeline parallelism: Instead of microbatches in pipeline parallelism, to overlap the resource usage of operations, we need more fine grained batches.

70 of 86

70B – Nanoflow Pipeline

70

O.AR2

R=0.1

KQV1

R=0.4

KQV2

R=0.4

KQV3

R=0.4

KQV4

R=0.4

DecAttn1

R=0.4

DecAttn2

R=0.4

DecAttn3

R=0.4

DecAttn4

R=0.4

O.AG1

R=0.2

Attn.AG1

R=0.2

O1

R=0.6

O2

R=0.8

Up,Gate,Down (UGD) 1

R=0.9

Up,Gate,Down (UGD) 2

R=0.9

UGD.AR1

R=0.1

PF1

R=0.6

UGD.AR2

R=0.2

UGD.AR3

R=0.2

Layer

Attn.AG2

AG to AR Transformation

UGD.AR2

R=0.2

UGD.AR3

R=0.2

KQV1

R=0.4

KQV2

R=0.4

KQV3

R=0.4

DecAttn1

R=0.4

DecAttn2

R=0.4

Prefill Attention

71 of 86

71

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

72 of 86

72

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

73 of 86

Batch Formation

73

Decode Request 1

Decode Request 2

Decode Request 5

…

Prefill Request 1

Prefill Request 2

Fixed Budget

Next round

Nanoflow forms constant batches for consistent performance

74 of 86

Asynchronous Scheduling

74

Batch formation and KV cache management are moved out of the critical path

Batch formation and KV cache management takes a non-negligible amount of time on the CPU side too.

It is not uncommon for existing frameworks to detect the EOS tokens of a request after the GPU executes one iteration. This allows removing the request from the batch and refilling the batch with new requests. But this sequential operation means GPU would be idle while the CPU processing happens.

So instead, Nanoflow asynchronously schedules batch formation and KV cache management for the iteration. So for example here in iteration i, batch formation and KV-cache management is performed for the iteration i+1.

<click> In iteration i+1, batch formation and KV-cache management for iteration i+2 is happening, after which Nanoflow detects the EOS token from iteration i.

<click> Then Nanoflow removes the finished request from the batch.

75 of 86

Asynchronous Scheduling

75

Batch formation and KV cache management are moved out of the critical path

76 of 86

76

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

77 of 86

77

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

78 of 86

Evaluation: End-to-End Throughput

78

1.73x const length I/O throughput gain

1.91x dataset throughput gain

68.5% of optimal throughput

79 of 86

Evaluation: Latency

79

Splitwise

ShareGPT

Mlsys

Similar (but a bit higher) latency at

low request rate

Processes 1.64x requests at

200ms SLO

Low tail latency

80 of 86

Evaluation – Nanoflow Pipelines with Different Models

80

Porting Nanoflow to other models demonstrates 50.4% to 78.5% of optimal throughput

81 of 86

NanoFlow as a Research Oriented Serving Engine

AMD

Nvidia

AMD

Nvidia

AMD

Nvidia

KQV

Rope

Dec Attn

O

Prefill Attn

Attn All Gather

Model Level

Operation

Level

Impl

Level

GEMM

Attention

AllGather

Torch

FlashAttention

FlashInfer

Torch

Triton

CUBLAS

RocBLAS

CUTLASS

CK

MSCCL++

NCCL

RCCL

Torch

Triton

User

Lib

82 of 86

Example: CPU Code Execution

82

CPU path

GPU path

83 of 86

NanoFlow as a Research-Oriented Serving Engine

83

Small Code Base

Only Representative Models
Lower Backward Combability Pressure

Performance Centric

Support Custom Kernels
Automatically Search for the Best Implementation

Heterogenous Support

CPU-GPU, GPU-GPU Cooperation
Assist Pipeline Design with Auto-search

84 of 86

84

Nano-batched pipeline design

Automated search & efficient runtime

1.91x throughput gain

50.4-78.5% optimal throughput

Paper

Github

baris@cs.washington.edu

85 of 86

Evaluation - Resource usage

Non-overlap baseline

85

Nanoflow utilizes multiple resources concurrently, achieving higher average compute usage

Nanoflow

86 of 86

Evaluation - Ablation study

86

Overlapping network: 1.07x speedup

Overlapping memory: 1.10x speedup

Overall, 1.17x speedup