1 of 86

CSE 451

Operating Systems

Lecture 25 – Recap and SysML

Baris Kasikci

2 of 86

Who thinks Jonathan is awesome?

2

I certainly do!

If you agree, please nominate him for the Bob Bandes Memorial Teaching award:

https://ta.cs.washington.edu/bandes

3 of 86

Operating Systems Recap

  • Kernel/user space abstraction
    • Trusted intermediary providing security and fault isolation between untrusted (or weakly trusted) applications
    • Traps (system calls, exceptions, and interrupts) from untrusted code into kernel
  • Virtual machine abstraction
    • Host operating system, guest operating system
    • Software and hardware support for virtualization
  • Thread abstraction
    • Sequential execution of instructions within a thread, at variable speed
  • Synchronization
    • Data races, write buffering
    • Locks and condition variables: how to write correct concurrent programs

4 of 86

Operating Systems Recap

  • Deadlock
    • Necessary conditions for deadlock; ways to prevent deadlock
  • CPU scheduling
    • Shortest job first; round robin, fair queueing, multilevel feedback queues
  • Address Translation
    • Conversion of virtual to physical addresses using multi-level page tables
    • Translation Lookaside Buffers (TLBs)
  • Paging
    • Mechanisms to implement demand paging, copy on write
    • Least recently used and clock algorithm

5 of 86

Operating System Recap

  • Storage Technologies
    • Flash (SSD) storage with multi-block erasures and wearout; translation layer from virtual to physical block numbers
    • Magnetic (hard) disk storage; latency to physically move disk head to data
  • File Systems
    • Directory name translation to inode (file) #
    • Inode to data translation using extents, indirect/doubly indirect blocks
    • Transaction semantics and write ahead logging
  • Security
    • Theory: authentication, authorization, enforcement, auditing
    • Practice: all systems have bugs, particularly if adversary is malicious

6 of 86

The Quest for�Blazingly Fast LLM Serving

6

Baris Kasikci

7 of 86

7

Full stack co-design of adaptive systems

Deep integration of traditional/emerging software/hardware ecosystems

8 of 86

AI Assistants (ChatGPT, Google Bard)��Text-to-Image (DALL·E, MidJourney)��Creative Writing (Jasper, Copy.ai)��AI Coding Tools (GitHub Copilot, Replit)��Text-to-Speech & Audio (Descript, Synthesia)

8

LLMs Enable Many New Applications

9 of 86

9

Source: https://nerdynav.com/chatgpt-statistics/

LLM Serving Systems Face Surging Demand

10 of 86

Cost of Training and Serving LLMs is Massive

Large-scale investment in H100s in 2024

  • Meta: 300K1
  • Google: 150K1
  • Microsoft: 150K1
  • X: 85K2

NVIDIA H200 HGX Server

  • ~$250,000
  • High operating cost
    • Up to ~10,000 W
  • Long lead times

10

Batching user requests and aiming for optimal throughput are key

11 of 86

Prior Throughput Improvements

11

How do we approach optimal LLM serving throughput?

12 of 86

12

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

13 of 86

13

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

Batches processed sequentially by heterogeneous operations

14 of 86

14

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

Batches processed sequentially by heterogeneous operations

15 of 86

15

Compute

Network

Memory

Existing LLM Inference Engines

Request

Batch

GEMM, …

Decode

Attention,…

Activation

Communication, …

Batches processed sequentially by heterogeneous operations

Compute is the most-constrained resource in modern LLM serving systems

Most-constrained

resource

16 of 86

16

Existing LLM Inference Engines

Batches processed sequentially by heterogeneous operations

  • Operations have high resource utilization locally
    • Full access to GPU
    • ~100% compute utilization during compute-bound operations

Compute

Network

Memory

Request

Batch

~100% compute utilization

during this time

Most-constrained

resource

17 of 86

17

Existing LLM Inference Engines

Batches processed sequentially by heterogeneous operations

  • Operations have high resource utilization locally
    • Full access to GPU
    • ~100% compute utilization during compute-bound operations
  • End-to-end resource utilization is low
    • 40% compute utilization end-to-end

Compute

Network

Memory

Request

Batch

Utilization of compute, the most-constrained resource, is not maximized

~40% compute utilization end-to-end

Most-constrained

resource

18 of 86

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu,

Dedong Xie, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori,

Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

18

[1] https://github.com/efeslab/Nanoflow

19 of 86

Extra weight loading from nano-batches is acceptable due to overall compute-boundness

Compute

Network

Memory

Request

Batch

C1

C2

M1

M2

N1

N2

B1

B2

Nano-batches

Nano-operations

: ran with intra-device parallelism

20 of 86

20

C1

C2

M1

M2

N1

N2

B1

B2

Nano-batches

Nano-operations

: ran with intra-device parallelism

Increased end-to-end compute utilization

1.91x throughput gain compared to state-of-the-art systems

Nanoflow automatically generates and executes optimized pipelines

  • Challenging due to the size of the search space (models, hardware, requests)

21 of 86

21

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

22 of 86

22

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

23 of 86

23

Decode

Decode

Decode

Transformer

Prefill

The

University

of

Washington

is

is

in

in

Seattle

located

located

Seattle

Prefill and Decode Phases of the Transformer

24 of 86

24

Transformer Layers and Iterations

Inference Iteration

Inference Layer

Output Token

Activations

25 of 86

25

+

+

Decoder-Only Transformer

26 of 86

Matching Operation Needs With GPU Resources

26

Compute

Memory

Bandwidth

Network

Bandwidth

W_Q

W_K

W_V

W_O

W_Up

W_Gate

W_Down

Decode

Attention

Prefill

Attention

Attention

All-gather

FFN

All-reduce

27 of 86

27

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

28 of 86

28

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

29 of 86

What Is the Main Resource Bottleneck of LLM Serving?

Model sizes are vastly larger than prior DNNs’

  • Deepseek-V3: 671B parameters (~1.5 TB VMRAM requirement for FP16)
  • H200 has 141GB VRAM

Self-Attention is memory-intensive during decoding

  • Key-Value (KV) Cache
    • Scales linearly with the context length
    • Size can surpass that of model weights1

29

[1] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference, 2024

LLM serving is assumed to be memory-bound

30 of 86

Memory-Centric Execution Time Estimation*

30

 

GPU

memory

Weight

Input/

Actv.

KV cache

Entire contents of the GPU memory would be loaded once during one iteration

* We assume operation at maximum batch size, which is necessary for optimal throughput

31 of 86

Compute-Centric Execution Time Estimation

31

Weight

Output

Output

Output

K

Bdense

K

N

K1

Bdense

N1

Every GEMM compute =

K2

Bdense

N2

K3

Bdense

N3

Ki

Bdense

Ni

+

+

+

+

Bdense

Pmodel

2

2

2

2

2

K

Bdense

N

2

All Dense Operations

 

32 of 86

Network-Centric Execution Time Estimation

32

A

B

C

D

A

A

A

B

B

B

C

C

C

D

D

D

NGPU

All-Gather Recap

(NGPU – 1)

# of hops per AllGather

(NGPU – 1): # of hops

33 of 86

Network-Centric Execution Time Estimation

33

All-Reduce ≈ 2 x All-Gather

(NGPU – 1): # of hops

4: All-Gathers/layer

34 of 86

Network-Centric Execution Time Estimation

34

(NGPU – 1): # of hops

Layer l

Layer l+1

4: All-Gathers/layer

[Bdense, Dmodel]

Bdense x Dmodel: shape

Stype: data type size

Each item

of size Stype

35 of 86

Network-Centric Execution Time Estimation

35

(NGPU – 1): # of hops

4: All-Gathers/layer

Bdense x Dmodel: shape

Stype: data type size

 

 

36 of 86

Execution Times wrt Different Resources

36

 

 

 

37 of 86

Execution Times wrt Different Resources

37

 

 

 

38 of 86

Compute vs Network

38

LLM Serving is more compute-bound than network-bound

 

 

Shift to

Compute

Bound

 

 

Shift to

Network

Bound

 

 

Not

Relevant

 

39 of 86

Compute vs Network

39

40 of 86

Execution Times wrt Different Resources

40

 

 

 

41 of 86

Execution Times wrt Different Resources

41

 

 

42 of 86

Compute vs Memory

42

 

 

43 of 86

Grouped Query Attention1

43

GPU

memory

Weight

Input

KV cache

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

H1

Attention Head

H2

H3

H4

HN

44 of 86

Grouped Query Attention1

44

GPU

memory

Weight

Input

KV cache

H1

H2

H3

H4

HN

Attention Head

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

45 of 86

Grouped Query Attention1

45

GPU

memory

Weight

Input

KV cache

H1

H2

H3

H4

HN

Group size: 4

GQA allows increasing the batch size by a factor of the group size

X

H4N

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

46 of 86

Compute vs Memory

46

 

 

GQA allows for a large batch size

Compute

Memory

Model sizes tend to increase

Batches with large decode requests increase memory accesses

LLM serving is more compute-bound than memory-bound

47 of 86

Compute vs Memory - Validation

47

High Utilization of GPU compute is the key for LLM serving

48 of 86

Revisit the performance of existing frameworks

48

There is a large gap to optimal throughput

49 of 86

49

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

50 of 86

50

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

51 of 86

51

Decoder-Only Transformer

Dependency

52 of 86

Nano-batching via Nano-operations

52

Up Gate

Input Batch

Dependency

Down

Input Batch

Up Gate 1

Input Batch

Up Gate 2

Input Batch

Down 1

Input Batch

Down 2

Input Batch

Dependency

Dependency

Can overlap

Pay extra weight loading for overlapping opportunities

53 of 86

Nano-batch Design Space

53

Up Gate

Input

Number of nano-batches

Up Gate

Input

Up Gate

Input

Up Gate

Input

Up Gate

Input

54 of 86

Nano-batch Design Space

54

Sizes of Nano-batches

Up Gate

Input

Up Gate

Input

Up Gate

Input

Up Gate

Input

55 of 86

Nano-batch Design Space

55

Ordering of Nano-batches

Up Gate 1

Input

Up Gate 2

Input

Down 1

Input

Down 2

Input

Up Gate 1

Input

Up Gate 2

Input

Down 1

Input

Down 2

Input

56 of 86

56

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

57 of 86

57

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

58 of 86

Auto-Search Stage 1: Pipeline Search

58

Inputs

Dependency

DAG

 

Goal: Minimize execution time

Constraints

Outputs

 

 

Ordering of same type

nano-ops

1. Obey operation dependencies

2. Same type operations do not overlap

Simplifications

1. Only consider three layers 2. Nano-batch partitioning is symmetric across layers 3. No interference

59 of 86

Design Space For Overlapping Nano-Ops

59

GPU resource allocation

Up Gate 2

Decode Attention

ComputeOnly

Memory

Only

Overlap

Plan A

Overlap

Plan B

Up Gate 2

Decode Attention

Up Gate 2

Decode Attention

60 of 86

Profiling interference is challenging

  • Combinational profiling space
    • 1000 GEMM
    • 100 GEMV
    • 100 Network
    • In total 1000*100*100 configurations

  • GEMM performance matters more
    • The pipeline is compute-bound
    • Overlapping between GEMM & GEMV, GEMM & Network dominates

60

Profile pair-wise interference for simplification

61 of 86

Profiling Interference

  • GEMM
    • Tile size
    • Split K
    • Number of warps
  • GEMV
    • Number of thread blocks
    • Loading pipeline stages

61

62 of 86

Modeling Interference

62

63 of 86

Interference Example

63

GEMM

GEMV

Network

100%

0%

0%

64 of 86

64

GEMM

GEMV

Network

80%

0%

0%

Interference Example

65 of 86

65

GEMM

GEMV

Network

80%

0%

0%

Interference Example

66 of 86

Interference Example

66

GEMM

GEMV

Network

80%

30%

0%

67 of 86

67

GEMM

GEMV

Network

60%

30%

50%

Interference Example

68 of 86

Auto-Search Stage 2: Resource allocation

68

Goal: Minimize execution time

Constraints

Outputs

1. Obey dependencies and orderings

2. Sum of R ≤ 1

Simplifications

1. Only consider three layers 2. R is symmetric across layers

Ordering

Inputs

Dependency

 

 

 

 

69 of 86

KQV1

O2

Up Gate Down 2

KQV2

Decode Attention 1

PF1

O1

Up Gate Down 1

KQV1

Decode Attention 2

PF2

O2

Up Gate Down 2

Decode Attention 1

KQV1

Decode Attention 1

PF1

O1

Up Gate Down 1

KQV1

KQV2

Decode Attention 2

PF2

O2

Up Gate Down 2

KQV2

KQV

Decode Attention

PF

O

Up Gate Down

KQV

KQV1

KQV2

Decode Attention 1

Decode Attention 2

PF1

PF2

Up Gate Down 1

O1

O2

Up Gate Down 2

KQV1

8B Pipeline Design

WASTED

Memory-bound

Compute-bound

Compute-bound kernels fully occupy the GPU

Reorder

Overlap

Split

70 of 86

70B – Nanoflow Pipeline

70

O.AR2

R=0.1

KQV1

R=0.4

KQV2

R=0.4

KQV3

R=0.4

KQV4

R=0.4

DecAttn1

R=0.4

DecAttn2

R=0.4

DecAttn3

R=0.4

DecAttn4

R=0.4

O.AG1

R=0.2

Attn.AG1

R=0.2

O1

R=0.6

O2

R=0.8

Up,Gate,Down (UGD) 1

R=0.9

Up,Gate,Down (UGD) 2

R=0.9

UGD.AR1

R=0.1

PF1

R=0.6

UGD.AR2

R=0.2

UGD.AR3

R=0.2

Layer

Attn.AG2

AG to AR Transformation

UGD.AR2

R=0.2

UGD.AR3

R=0.2

KQV1

R=0.4

KQV2

R=0.4

KQV3

R=0.4

DecAttn1

R=0.4

DecAttn2

R=0.4

Prefill Attention

71 of 86

71

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

72 of 86

72

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

73 of 86

Batch Formation

73

Decode Request 1

Decode Request 2

Decode Request 5

Prefill Request 1

Prefill Request 2

Fixed Budget

Next round

Nanoflow forms constant batches for consistent performance

74 of 86

Asynchronous Scheduling

74

Batch formation and KV cache management are moved out of the critical path

75 of 86

Asynchronous Scheduling

75

Batch formation and KV cache management are moved out of the critical path

76 of 86

76

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

77 of 86

77

What is the optimal LLM serving throughput?

Design of Nanoflow

Operation Classification

Evaluation

Resource Bottleneck Analysis

Nano-batching

Auto-search

Runtime

78 of 86

Evaluation: End-to-End Throughput

78

1.73x const length I/O throughput gain

1.91x dataset throughput gain

68.5% of optimal throughput

79 of 86

Evaluation: Latency

79

Splitwise

ShareGPT

Mlsys

Similar (but a bit higher) latency at

low request rate

Processes 1.64x requests at

200ms SLO

Low tail latency

80 of 86

Evaluation – Nanoflow Pipelines with Different Models

80

Porting Nanoflow to other models demonstrates 50.4% to 78.5% of optimal throughput

81 of 86

NanoFlow as a Research Oriented Serving Engine

AMD

Nvidia

AMD

Nvidia

AMD

Nvidia

KQV

Rope

Dec Attn

O

Prefill Attn

Attn All Gather

Model Level

Operation

Level

Impl

Level

GEMM

Attention

AllGather

Torch

FlashAttention

FlashInfer

Torch

Triton

CUBLAS

RocBLAS

CUTLASS

CK

MSCCL++

NCCL

RCCL

Torch

Triton

User

Lib

82 of 86

Example: CPU Code Execution

82

CPU path

GPU path

83 of 86

NanoFlow as a Research-Oriented Serving Engine

83

Small Code Base

  • Only Representative Models
  • Lower Backward Combability Pressure

Performance Centric

  • Support Custom Kernels
  • Automatically Search for the Best Implementation

Heterogenous Support

  • CPU-GPU, GPU-GPU Cooperation
  • Assist Pipeline Design with Auto-search

84 of 86

84

Nano-batched pipeline design

Automated search & efficient runtime

1.91x throughput gain

50.4-78.5% optimal throughput

Paper

Github

baris@cs.washington.edu

85 of 86

Evaluation - Resource usage

Non-overlap baseline

85

Nanoflow utilizes multiple resources concurrently, achieving higher average compute usage

Nanoflow

86 of 86

Evaluation - Ablation study

86

Overlapping network: 1.07x speedup

Overlapping memory: 1.10x speedup

Overall, 1.17x speedup