1 of 58

First Meetup @ SF

Oct 5th, 2023

@ a16z San Francisco Office

2 of 58

About us

2

Woosuk Kwon

PhD Student, UC Berkeley

@woosuk_k

Zhuohan Li

PhD Student, UC Berkeley

@zhuohan123

3 of 58

Thanks our sponsors!

3

for venue & open-source AI grant

for catering & contributions to vLLM

4 of 58

vLLM Overview

Woosuk Kwon

5 of 58

Our Goal

Build the fastest and

easiest-to-use open-source

LLM inference & serving engine

5

6 of 58

PagedAttention

An attention algorithm that allows for storing continuous keys and values in non-contiguous memory space.

6

7 of 58

Manage KV cache like OS virtual memory

7

Alan	Turing	is	a
computer	scientist	and	mathematician
renowned

Logical KV blocks

Request

A

Block Table


computer	scientist	and	mathe-�matician

Artificial	Intelli- gence	is	the

renowned
future	of	tech-�nology
Alan	Turing	is	a

Physical KV blocks

Artificial	Intelli-�gence	is	the
future	of	tech-�nology

Logical KV blocks

Request

B

Block Table

8 of 58

vLLM Github Repo

8

https://github.com/vllm-project/vllm

$ pip install vllm

7.9K Stars

Official

release!

9 of 58

vLLM API (1): LLM class

9

from vllm import LLM

# Example prompts.

prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.

llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Generate texts from the prompts.

outputs = llm.generate(prompts)

A Python interface for offline batched inference

10 of 58

vLLM API (2): OpenAI-compatible server

10

$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf

$ curl http://localhost:8000/v1/completions \

-H "Content-Type: application/json" \

-d '{

"model": "meta-llama/Llama-2-7b-hf",

"prompt": "San Francisco is a",

"max_tokens": 7,

"temperature": 0

}'

A FastAPI-based server for online serving

Server

Client

11 of 58

vLLM Adopters

11

lm-sys/FastChat

WizardLM

Open-Source Projects

Companies

…

allenai/open-instruct

12 of 58

“Talk is cheap. Show me the code.”

12

	https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py		https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_vision_model_garden/model_oss/vllm
	https://github.com/imoneoi/openchat/blob/master/ochat/evaluation/run_eval.py		https://github.com/leptonai/examples/tree/main/advanced/vllm
	https://github.com/allenai/open-instruct/blob/main/eval/toxigen/run_eval.py		https://github.com/facebookresearch/llama-recipes/tree/main/examples/vllm
	https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml		https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM
	https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/humaneval_gen_vllm.py		https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/vllm_inference.py
	https://github.com/ray-project/ray/blob/master/doc/source/serve/doc_code/vllm_example.py		https://github.com/replicate/vllm-with-loras
	https://github.com/bentoml/OpenLLM/blob/main/openllm-python/src/openllm/models/auto/modeling_vllm_auto.py		https://github.com/scaleapi/llm-engine/tree/main/model-engine/model_engine_server/inference/vllm

lm-sys/FastChat

WizardLM

…

allenai/open-instruct

13 of 58

Contributors

13

…

Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!

14 of 58

vLLM System Walkthrough

Zhuohan Li

15 of 58

Goal of the walkthrough

Understand how vLLM processes a request and generates its outputs. Background for later content.
Learn where to modify if you would like to make a specific modification/contribution.

15

16 of 58

User Interface

16

LLMEngine

vllm/engine/llm_engine.py

AsyncLLMEngine

vllm/engine/async_llm_engine.py

Synchronous

Asynchronous

Developer Interface�vllm/engine

User custom server

def add_request()

def abort_request()

def step() → List[ReaquestOutput]

Stream one token output for a batch of requests

async def generate()

async def abort()

+ background engine loop

LLM

vllm/entrypoints/llm.py

api_server

vllm/entrypoints/api_server.py

openai_api_server

vllm/entrypoints/openai/api_server.py

End-user Interface�vllm/entrypoints

Batched inference

Simple demo API server

OpenAI-compatible API server

17 of 58

17

LLMEngine

vllm/engine/llm_engine.py

requests�(prompts)

Scheduler

vllm/core/scheduler.py

BlockSpaceManager

vllm/core/block_manager.py

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

Worker

vllm/worker/worker.py

N ×

CacheEngine

vllm/worker/cache_engine.py

Worker.model

vllm/model_executor/models

PagedAttention

vllm/model_executor/layers/�attention.py

…

Centralized Controller

(same process as LLMEngine, CPU only)

Distributed Workers�(distributed processes with GPUs)

18 of 58

Before serving requests�1. Initialize & load model

18

Worker

vllm/worker/worker.py

Worker.model

vllm/model_executor/models

LLMEngine

vllm/engine/llm_engine.py

model.load_weights

vllm/model_executor/models

Initialize & load from HuggingFace Model Hub

× N

19 of 58

Before serving requests�2. Profile memory usage

19

Worker

vllm/worker/worker.py

Worker.model

vllm/model_executor/models

LLMEngine

vllm/engine/llm_engine.py

Worker.profile_num_available_blocks

vllm/model_executor/models

Profile the memory usage with a batch with the max possible #tokens.

#GPU KV blocks = remaining memory / block byte size

× N

20 of 58

Before serving requests�3. Pre-allocate KV Blocks

20

LLMEngine

vllm/engine/llm_engine.py

Scheduler

vllm/core/scheduler.py

BlockSpaceManager

vllm/core/block_manager.py

Worker

vllm/worker/worker.py

× N

CacheEngine

vllm/worker/cache_engine.py

#GPU KV blocks

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

#GPU KV blocks

Initialize block table

Allocate KV cache memory

21 of 58

When requests arrive

21

LLMEngine

vllm/engine/llm_engine.py

Request prompt: �“The future of Artificial Intelligence”

Scheduler

vllm/core/scheduler.py

1. Tokenization: �[1, 450, 5434, 310, 3012, 928, 616, 3159, 28286]

2. Add to the scheduler’s waiting queue

Waiting request queue
Running request queue
Swapped request queue

22 of 58

At every step, the scheduler

22

Scheduler

vllm/core/scheduler.py

Decide the set of requests to run at the current step.

When there are free KV block memory�waiting → running
When no KV block memory available for new tokens:�Swapping: running → swapped�Recomputation: running → waiting

BlockSpaceManager

vllm/core/block_manager.py

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

Allocate space for new KV Cache.
Handle block sharing.
Swap/delete preempted KV blocks.

→ Cache instructions & Block table

23 of 58

At every step, the worker

23

Worker.model

vllm/model_executor/models

Token IDs & Block Table

Execute the model with tensor parallelism

CacheEngine

vllm/worker/cache_engine.py

Cache instructions

Swap blocks between GPU & CPU.
Perform copy-on-write for shared blocks.

Worker

vllm/worker/worker.py

N ×

24 of 58

At every step, the model

24

A PyTorch model (tensor model parallel shard)

Token IDs & Block Table

PagedAttention

vllm/model_executor/layers/attention.py

xformers/FlashAttention for prompt
PagedAttention kernel for generation

ParallelLinear

vllm/parallel_utils/layers.py

QuantizedLinear

vllm/model_executor/layers/quantized_linear/

Faster kernels (e.g. LayerNorm)

vllm/model_executor/layers/…

Optimized kernels for other layers & distributed execution

Sampler

vllm/model_executor/layers/attention.py

Greedy/Random/Beam search → Generated tokens

Worker.model

vllm/model_executor/models

25 of 58

25

LLMEngine

vllm/engine/llm_engine.py

requests�(prompts)

Scheduler

vllm/core/scheduler.py

BlockSpaceManager

vllm/core/block_manager.py

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

Worker

vllm/worker/worker.py

N ×

CacheEngine

vllm/worker/cache_engine.py

Worker.model

vllm/model_executor/models

PagedAttention

vllm/model_executor/layers/�attention.py

…

Centralized Controller

(same process as LLMEngine, CPU only)

Distributed Workers�(distributed processes with GPUs)

unfinished requests

streaming results

detokenize

1 new token ID for each request

26 of 58

Summary

Core component: LLMEngine
Centralized controller → Distributed workers
Scheduler prepares the requests at each step
Workers run the model with PagedAttention

26

27 of 58

vLLM Recent Updates

Woosuk Kwon

28 of 58

Our journey since the project release

28

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

29 of 58

Our journey since the project release

29

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

30 of 58

Supported models

30

LLaMA/Vicuna/�LLaMA-2/CodeLlama

BLOOM

OPT

GPT-J/NeoX

Mistral

GPT

StarCoder

MPT

Falcon

Contributed by MistralAI

Supported from Day 1

+ 4 more architectures

+ RoPE scaling

+ Quantization

31 of 58

Our journey since the project release

31

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

32 of 58

Optimization (1): Efficient de-tokenization

32

1	450	5434	310	319	29902	338	5921

Token IDs

Tokens

Text

<s>	The	future	of	A	I	is	promise

“The future of AI is promise”

33 of 58

Optimization (1): Efficient de-tokenization

Reduces CPU overhead by caching and incremental updates

33

1	450	5434	310	319	29902	338	5921

Token IDs

Tokens

Text

<s>	The	future	of	A	I	is

“The future of AI is promise”

promise

Cached

34 of 58

Optimization (1): Efficient de-tokenization

Reduces CPU overhead by caching and incremental updates

34

1	450	5434	310	319	29902	338	5921

Token IDs

Tokens

Text

<s>	The	future	of	A	I	is

“The future of AI is promising”

promise

-ing

292

New ID

35 of 58

Optimization (2): Vectorized sampler

Iterative (Previous)

Vectorized (Current)

35

All sequences

log_probs: 5

n: 2

log_probs: 5

Sequence 1

n: 1

log_probs: 3

0.1	-0.2	0.03	0.13	-0.1

SamplingParams

logits

torch.multinomial()

torch.topk()

next_token_ids: 3

log_probs: ...

Output

Sequence 2

n: 2

log_probs: 5

-0.3	0.71	0.11	-0.2	0.58

SamplingParams

logits

torch.multinomial()

torch.topk()

next_token_ids: 1,4

log_probs: ...

Output

...

logits

0.1	-0.2	0.03	0.13	-0.1

-0.3	0.71	0.11	-0.2	0.58

SamplingParams

n: 1

log_probs: 3

torch.multinomial()

torch.topk()

Outputs

log_probs: ...

next_token_ids: 1,4

log_probs: ...

next_token_ids: 3

log_probs: ...

36 of 58

Performance improvements

v0.2.0: Up to 66% throughput increase compared to v0.1.7

36

18%↑

42%↑

37 of 58

Our journey since the project release

37

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

38 of 58

Ensuring correctness is hard

38

Because of the custom CUDA kernels, vLLM cannot always ensure bit-wise correctness

As a result, vLLM can sometimes generate different output compared to reference implementations

Ref. logits:

vLLM logits:

-0.88	0.10	…	0.13	-0.06

-0.85	0.10	…	0.17	-0.09

Ref. output:

vLLM output:

“The future of AI is unpredictable but full of potential.”

“The future of AI is promising and transformative.”

39 of 58

Tests for ensuring correctness

39

1. Unit tests for op-level correctness

2. End-to-end tests for certain cases

Ex) argmax sampling, beam search

3. Accuracy tests on standard benchmarks

40 of 58

Accuracy tests

Benchmark: AI2 Reasoning Challenge (ARC)*

40

* acc_norm with 25 shots, https://github.com/EleutherAI/lm-evaluation-harness

41 of 58

Correctness & robustness

41

Making AsyncLLMEngine robust

PR #880, #943, #969, #988, #1059, #1102
Special thanks to Antoni @ Anyscale

Making beam search algorithm compatible with HF Transformers

PR #646, #847

Bug fixes for corner cases

PR #936, #1004, #1154, #1241

42 of 58

Our journey since the project release

42

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

43 of 58

Pre-built CUDA binaries (since v0.1.4)

No need for compilation
Fast installation

12 secs for installing vLLM
2 mins for end-to-end installation

43

44 of 58

Integration with serving libraries

Inference Engine

Serving Library

44

OpenLLM

API Server

NVIDIA

Triton

Observability

Model versioning

Autoscaling

. . .

45 of 58

Our journey since the project release

45

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

46 of 58

Maintaining simple and clean code

46

Goal: Making vLLM easy to understand, hack, and contribute

High standard for code complexity
Refactoring

Scheduler (PR #658, #1251)
Tensor parallelism (PR #1181)
Quantization (PR #926, #1032)
Ray utils (PR #397)
AsyncLLMEngine (PR #880)

47 of 58

vLLM Roadmap

Woosuk Kwon & Zhuohan Li

48 of 58

vLLM will remain open-source

Apache 2.0 License

48

49 of 58

Future Roadmap (Overview)

49

Optimizing latency

Better quantization support

CUDA-level optimizations

50 of 58

Optimizing latency (1): Lightweight runtime

50

Time

CPU

GPU

torch.nn.Linear

matmul

kernel

torch.nn.GELU

torch.nn.Linear

matmul

kernel

gelu

kernel

Python/PyTorch overhead

Idle

Python/PyTorch overhead takes up to 50% of overall latency

51 of 58

Optimizing latency (1): Lightweight runtime

51

Time

CPU

GPU

matmul

kernel

matmul

kernel

gelu

kernel

Lightweight Runtime

Runtime

Two solutions

52 of 58

Optimizing latency (2): Multi-process architecture

Single-process architecture

(Current)

Multi-process architecture

52

Process 2

Process 1

Request handling

Tokenization

Model execution

De-tokenization

Token streaming

Request handler

Tokenizer

Process 3

Model

Executor

Process 4

De-tokenizer

Process 5

Token

Streamer

Async. send

53 of 58

Optimizing latency (3): Speculative decoding

Draft

Verification

Small model writes a draft → Large model verifies it

53

Small Model

(#parameter: N)

<s>

① Several

Several

② famous

famous

③ songs

songs

④ are

Autoregressive (sequential)

Non-autoregressive (parallel)

* Figure adapted from Kim et al., “Speculative Decoding with Big Little Decoder” (NeurIPS 2023)

Large Model

(#parameter: 10xN)

<s>

Several

famous

songs

are

⑤ composed

Several

famous

songs

are

54 of 58

Better quantization support

54

Step 1: Develop a general abstraction for diverse quantization methods.

Step 2: Implement diverse quantization methods.

[WIP] GPTQ, SmoothQuant, SqueezeLLM
[TODO] GGML, GGUF, . . .

Step 3: Optimize the performance of quantized ops.

55 of 58

CUDA-level optimization: PagedAttention kernel

PagedAttention takes 20-40% of overall running time

55

Optimization 1: Better work partitioning

(leveraging sequence-level parallelism)

Optimization 2: Efficient MQA/GQA support

56 of 58

Second-priority items

56

* Figure adapted from Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (ICLR 2022)

(Efficient) LoRA support*

AMD GPU support

Mixture-of-Experts

Pipeline parallelism

. . .

Stage 1

Stage N

57 of 58

Q&A on future roadmap (5 min)

Latency optimization

Lightweight runtime
Multi-process architecture
Speculative decoding

Better quantization support
CUDA-level optimization

(Efficient) LoRA support
AMD support
Mixture-of-experts
Pipeline parallelism

First-Priority Features

Second-Priority Features

57

58 of 58

vLLM Networking Hour!

https://github.com/vllm-project/vllm

https://vllm.ai

https://arxiv.org/abs/2309.06180

https://discord.gg/jz7wjKhh6g

@woosuk_k

@zhuohan123

Please fill out our survey before you leave: