1 of 58

First Meetup @ SF

Oct 5th, 2023

@ a16z San Francisco Office

2 of 58

About us

2

Woosuk Kwon

PhD Student, UC Berkeley

@woosuk_k

Zhuohan Li

PhD Student, UC Berkeley

@zhuohan123

3 of 58

Thanks our sponsors!

3

for venue & open-source AI grant

for catering & contributions to vLLM

4 of 58

vLLM Overview

Woosuk Kwon

5 of 58

Our Goal

Build the fastest and

easiest-to-use open-source

LLM inference & serving engine

5

6 of 58

PagedAttention

An attention algorithm that allows for storing continuous keys and values in non-contiguous memory space.

6

7 of 58

Manage KV cache like OS virtual memory

7

Alan

Turing

is

a

computer

scientist

and

mathematician

renowned

Logical KV blocks

Request

A

Block Table

computer

scientist

and

mathe-�matician

Artificial

Intelli-

gence

is

the

renowned

future

of

tech-�nology

Alan

Turing

is

a

Physical KV blocks

Artificial

Intelli-�gence

is

the

future

of

tech-�nology

Logical KV blocks

Request

B

Block Table

8 of 58

vLLM Github Repo

8

$ pip install vllm

7.9K Stars

Official

release!

9 of 58

vLLM API (1): LLM class

9

from vllm import LLM

# Example prompts.

prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.

llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Generate texts from the prompts.

outputs = llm.generate(prompts)

A Python interface for offline batched inference

10 of 58

vLLM API (2): OpenAI-compatible server

10

$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf

$ curl http://localhost:8000/v1/completions \

-H "Content-Type: application/json" \

-d '{

"model": "meta-llama/Llama-2-7b-hf",

"prompt": "San Francisco is a",

"max_tokens": 7,

"temperature": 0

}'

A FastAPI-based server for online serving

Server

Client

11 of 58

vLLM Adopters

11

lm-sys/FastChat

WizardLM

Open-Source Projects

Companies

allenai/open-instruct

12 of 58

“Talk is cheap. Show me the code.”

12

lm-sys/FastChat

WizardLM

allenai/open-instruct

13 of 58

Contributors

13

Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!

14 of 58

vLLM System Walkthrough

Zhuohan Li

15 of 58

Goal of the walkthrough

  1. Understand how vLLM processes a request and generates its outputs. Background for later content.
  2. Learn where to modify if you would like to make a specific modification/contribution.

15

16 of 58

User Interface

16

LLMEngine

vllm/engine/llm_engine.py

AsyncLLMEngine

vllm/engine/async_llm_engine.py

Synchronous

Asynchronous

Developer Interface�vllm/engine

User custom server

def add_request()

def abort_request()

def step() → List[ReaquestOutput]

  • Stream one token output for a batch of requests

async def generate()

async def abort()

+ background engine loop

LLM

vllm/entrypoints/llm.py

api_server

vllm/entrypoints/api_server.py

openai_api_server

vllm/entrypoints/openai/api_server.py

End-user Interface�vllm/entrypoints

Batched inference

Simple demo API server

OpenAI-compatible API server

17 of 58

17

LLMEngine

vllm/engine/llm_engine.py

requests�(prompts)

Scheduler

vllm/core/scheduler.py

BlockSpaceManager

vllm/core/block_manager.py

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

Worker

vllm/worker/worker.py

N ×

CacheEngine

vllm/worker/cache_engine.py

Worker.model

vllm/model_executor/models

PagedAttention

vllm/model_executor/layers/�attention.py

Centralized Controller

(same process as LLMEngine, CPU only)

Distributed Workers�(distributed processes with GPUs)

18 of 58

Before serving requests�1. Initialize & load model

18

Worker

vllm/worker/worker.py

Worker.model

vllm/model_executor/models

LLMEngine

vllm/engine/llm_engine.py

model.load_weights

vllm/model_executor/models

Initialize & load from HuggingFace Model Hub

× N

19 of 58

Before serving requests�2. Profile memory usage

19

Worker

vllm/worker/worker.py

Worker.model

vllm/model_executor/models

LLMEngine

vllm/engine/llm_engine.py

Worker.profile_num_available_blocks

vllm/model_executor/models

Profile the memory usage with a batch with the max possible #tokens.

#GPU KV blocks = remaining memory / block byte size

× N

20 of 58

Before serving requests�3. Pre-allocate KV Blocks

20

LLMEngine

vllm/engine/llm_engine.py

Scheduler

vllm/core/scheduler.py

BlockSpaceManager

vllm/core/block_manager.py

Worker

vllm/worker/worker.py

× N

CacheEngine

vllm/worker/cache_engine.py

#GPU KV blocks

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

#GPU KV blocks

Initialize block table

Allocate KV cache memory

21 of 58

When requests arrive

21

LLMEngine

vllm/engine/llm_engine.py

Request prompt: �“The future of Artificial Intelligence

Scheduler

vllm/core/scheduler.py

1. Tokenization: �[1, 450, 5434, 310, 3012, 928, 616, 3159, 28286]

2. Add to the scheduler’s waiting queue

  • Waiting request queue
  • Running request queue
  • Swapped request queue

22 of 58

At every step, the scheduler

22

Scheduler

vllm/core/scheduler.py

Decide the set of requests to run at the current step.

  • When there are free KV block memory�waitingrunning
  • When no KV block memory available for new tokens:�Swapping: runningswapped�Recomputation: runningwaiting

BlockSpaceManager

vllm/core/block_manager.py

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

  • Allocate space for new KV Cache.
  • Handle block sharing.
  • Swap/delete preempted KV blocks.

Cache instructions & Block table

23 of 58

At every step, the worker

23

Worker.model

vllm/model_executor/models

Token IDs & Block Table

  • Execute the model with tensor parallelism

CacheEngine

vllm/worker/cache_engine.py

Cache instructions

  • Swap blocks between GPU & CPU.
  • Perform copy-on-write for shared blocks.

Worker

vllm/worker/worker.py

N ×

24 of 58

At every step, the model

24

  • A PyTorch model (tensor model parallel shard)

Token IDs & Block Table

PagedAttention

vllm/model_executor/layers/attention.py

  • xformers/FlashAttention for prompt
  • PagedAttention kernel for generation

ParallelLinear

vllm/parallel_utils/layers.py

QuantizedLinear

vllm/model_executor/layers/quantized_linear/

Faster kernels (e.g. LayerNorm)

vllm/model_executor/layers/…

Optimized kernels for other layers & distributed execution

Sampler

vllm/model_executor/layers/attention.py

Greedy/Random/Beam search → Generated tokens

Worker.model

vllm/model_executor/models

25 of 58

25

LLMEngine

vllm/engine/llm_engine.py

requests�(prompts)

Scheduler

vllm/core/scheduler.py

BlockSpaceManager

vllm/core/block_manager.py

BlockAllocator (GPU/CPU)

vllm/core/block_manager.py

Worker

vllm/worker/worker.py

N ×

CacheEngine

vllm/worker/cache_engine.py

Worker.model

vllm/model_executor/models

PagedAttention

vllm/model_executor/layers/�attention.py

Centralized Controller

(same process as LLMEngine, CPU only)

Distributed Workers�(distributed processes with GPUs)

unfinished requests

streaming results

detokenize

1 new token ID for each request

26 of 58

Summary

  • Core component: LLMEngine
  • Centralized controller → Distributed workers
  • Scheduler prepares the requests at each step
  • Workers run the model with PagedAttention

26

27 of 58

vLLM Recent Updates

Woosuk Kwon

28 of 58

Our journey since the project release

28

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

29 of 58

Our journey since the project release

29

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

30 of 58

Supported models

30

LLaMA/Vicuna/�LLaMA-2/CodeLlama

BLOOM

OPT

GPT-J/NeoX

Mistral

GPT

StarCoder

MPT

Falcon

Contributed by MistralAI

Supported from Day 1

+ 4 more architectures

+ RoPE scaling

+ Quantization

31 of 58

Our journey since the project release

31

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

32 of 58

Optimization (1): Efficient de-tokenization

32

1

450

5434

310

319

29902

338

5921

Token IDs

Tokens

Text

<s>

The

future

of

A

I

is

promise

“The future of AI is promise”

33 of 58

Optimization (1): Efficient de-tokenization

Reduces CPU overhead by caching and incremental updates

33

1

450

5434

310

319

29902

338

5921

Token IDs

Tokens

Text

<s>

The

future

of

A

I

is

The future of AI is promise

promise

Cached

Cached

34 of 58

Optimization (1): Efficient de-tokenization

Reduces CPU overhead by caching and incremental updates

34

1

450

5434

310

319

29902

338

5921

Token IDs

Tokens

Text

<s>

The

future

of

A

I

is

The future of AI is promising

promise

-ing

292

New ID

35 of 58

Optimization (2): Vectorized sampler

Iterative (Previous)

Vectorized (Current)

35

All sequences

log_probs: 5

n: 2

log_probs: 5

Sequence 1

n: 1

log_probs: 3

0.1

-0.2

0.03

0.13

-0.1

SamplingParams

logits

torch.multinomial()

torch.topk()

next_token_ids: 3

log_probs: ...

Output

Sequence 2

n: 2

log_probs: 5

-0.3

0.71

0.11

-0.2

0.58

SamplingParams

logits

torch.multinomial()

torch.topk()

next_token_ids: 1,4

log_probs: ...

Output

...

logits

0.1

-0.2

0.03

0.13

-0.1

-0.3

0.71

0.11

-0.2

0.58

SamplingParams

n: 1

log_probs: 3

torch.multinomial()

torch.topk()

Outputs

log_probs: ...

next_token_ids: 1,4

log_probs: ...

next_token_ids: 3

log_probs: ...

36 of 58

Performance improvements

v0.2.0: Up to 66% throughput increase compared to v0.1.7

36

18%↑

42%↑

37 of 58

Our journey since the project release

37

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

38 of 58

Ensuring correctness is hard

38

  • Because of the custom CUDA kernels, vLLM cannot always ensure bit-wise correctness
  • As a result, vLLM can sometimes generate different output compared to reference implementations

Ref. logits:

vLLM logits:

-0.88

0.10

0.13

-0.06

-0.85

0.10

0.17

-0.09

Ref. output:

vLLM output:

“The future of AI is unpredictable but full of potential.”

“The future of AI is promising and transformative.”

39 of 58

Tests for ensuring correctness

39

1. Unit tests for op-level correctness

2. End-to-end tests for certain cases

  • Ex) argmax sampling, beam search

3. Accuracy tests on standard benchmarks

40 of 58

Accuracy tests

  • Benchmark: AI2 Reasoning Challenge (ARC)*

40

41 of 58

Correctness & robustness

41

Making AsyncLLMEngine robust

  • PR #880, #943, #969, #988, #1059, #1102
  • Special thanks to Antoni @ Anyscale

Making beam search algorithm compatible with HF Transformers

  • PR #646, #847

Bug fixes for corner cases

  • PR #936, #1004, #1154, #1241

42 of 58

Our journey since the project release

42

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

43 of 58

Pre-built CUDA binaries (since v0.1.4)

  • No need for compilation
  • Fast installation
    • 12 secs for installing vLLM
    • 2 mins for end-to-end installation

43

44 of 58

Integration with serving libraries

Inference Engine

Serving Library

44

OpenLLM

API Server

NVIDIA

Triton

Observability

Model versioning

Autoscaling

. . .

45 of 58

Our journey since the project release

45

1. Supporting new models

2. Optimizing performance

3. Ensuring correctness & robustness

4. Improving usability

5. Code gardening & refactoring

46 of 58

Maintaining simple and clean code

46

Goal: Making vLLM easy to understand, hack, and contribute

  • High standard for code complexity
  • Refactoring
    • Scheduler (PR #658, #1251)
    • Tensor parallelism (PR #1181)
    • Quantization (PR #926, #1032)
    • Ray utils (PR #397)
    • AsyncLLMEngine (PR #880)

47 of 58

vLLM Roadmap

Woosuk Kwon & Zhuohan Li

48 of 58

vLLM will remain open-source

Apache 2.0 License

48

49 of 58

Future Roadmap (Overview)

49

Optimizing latency

Better quantization support

CUDA-level optimizations

50 of 58

Optimizing latency (1): Lightweight runtime

50

Time

CPU

GPU

torch.nn.Linear

matmul

kernel

torch.nn.GELU

torch.nn.Linear

matmul

kernel

gelu

kernel

Python/PyTorch overhead

Idle

Idle

Python/PyTorch overhead takes up to 50% of overall latency

51 of 58

Optimizing latency (1): Lightweight runtime

51

Time

CPU

GPU

matmul

kernel

matmul

kernel

gelu

kernel

Lightweight Runtime

Runtime

Two solutions

52 of 58

Optimizing latency (2): Multi-process architecture

Single-process architecture

(Current)

Multi-process architecture

52

Process 2

Process 1

Request handling

Tokenization

Model execution

De-tokenization

Token streaming

Request handler

Tokenizer

Process 3

Model

Executor

Process 4

De-tokenizer

Process 5

Token

Streamer

Async. send

Async. send

Async. send

Async. send

53 of 58

Optimizing latency (3): Speculative decoding

Draft

Verification

Small model writes a draft → Large model verifies it

53

Small Model

(#parameter: N)

<s>

Several

Several

famous

famous

songs

songs

are

Autoregressive (sequential)

Non-autoregressive (parallel)

* Figure adapted from Kim et al., “Speculative Decoding with Big Little Decoder” (NeurIPS 2023)

Large Model

(#parameter: 10xN)

<s>

Several

famous

songs

are

composed

Several

famous

songs

are

54 of 58

Better quantization support

54

Step 1: Develop a general abstraction for diverse quantization methods.

Step 2: Implement diverse quantization methods.

  • [WIP] GPTQ, SmoothQuant, SqueezeLLM
  • [TODO] GGML, GGUF, . . .

Step 3: Optimize the performance of quantized ops.

55 of 58

CUDA-level optimization: PagedAttention kernel

PagedAttention takes 20-40% of overall running time

55

Optimization 1: Better work partitioning

(leveraging sequence-level parallelism)

Optimization 2: Efficient MQA/GQA support

56 of 58

Second-priority items

56

* Figure adapted from Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (ICLR 2022)

(Efficient) LoRA support*

AMD GPU support

Mixture-of-Experts

Pipeline parallelism

. . .

Stage 1

Stage N

57 of 58

Q&A on future roadmap (5 min)

  • Latency optimization
    • Lightweight runtime
    • Multi-process architecture
    • Speculative decoding
  • Better quantization support
  • CUDA-level optimization
  • (Efficient) LoRA support
  • AMD support
  • Mixture-of-experts
  • Pipeline parallelism

First-Priority Features

Second-Priority Features

57

58 of 58

vLLM Networking Hour!

@woosuk_k

@zhuohan123

Please fill out our survey before you leave: