First Meetup @ SF
Oct 5th, 2023
@ a16z San Francisco Office
About us
2
Woosuk Kwon
PhD Student, UC Berkeley
@woosuk_k
Zhuohan Li
PhD Student, UC Berkeley
@zhuohan123
Thanks our sponsors!
3
for venue & open-source AI grant
for catering & contributions to vLLM
vLLM Overview
Woosuk Kwon
Our Goal
Build the fastest and
easiest-to-use open-source
LLM inference & serving engine
5
PagedAttention
An attention algorithm that allows for storing continuous keys and values in non-contiguous memory space.
6
Manage KV cache like OS virtual memory
7
Alan | Turing | is | a |
computer | scientist | and | mathematician |
renowned | | | |
| | | |
Logical KV blocks
Request
A
| |
| |
| |
| |
| |
Block Table
| | | |
computer | scientist | and | mathe-�matician |
| | | |
Artificial | Intelli- gence | is | the |
| | | |
renowned | | | |
future | of | tech-�nology | |
Alan | Turing | is | a |
Physical KV blocks
Artificial | Intelli-�gence | is | the |
future | of | tech-�nology | |
| | | |
| | | |
Logical KV blocks
Request
B
| |
| |
| |
| |
| |
Block Table
vLLM Github Repo
8
$ pip install vllm
7.9K Stars
Official
release!
vLLM API (1): LLM class
9
from vllm import LLM
# Example prompts.
prompts = ["Hello, my name is", "The capital of France is"]
# Create an LLM with HF model name.
llm = LLM(model="meta-llama/Llama-2-7b-hf")
# Generate texts from the prompts.
outputs = llm.generate(prompts)
A Python interface for offline batched inference
vLLM API (2): OpenAI-compatible server
10
$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
A FastAPI-based server for online serving
Server
Client
vLLM Adopters
11
lm-sys/FastChat
WizardLM
Open-Source Projects
Companies
…
…
allenai/open-instruct
“Talk is cheap. Show me the code.”
12
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| |
lm-sys/FastChat
WizardLM
…
…
allenai/open-instruct
Contributors
13
…
Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!
vLLM System Walkthrough
Zhuohan Li
Goal of the walkthrough
15
User Interface
16
LLMEngine
vllm/engine/llm_engine.py
AsyncLLMEngine
vllm/engine/async_llm_engine.py
Synchronous
Asynchronous
Developer Interface�vllm/engine
User custom server
def add_request()
def abort_request()
def step() → List[ReaquestOutput]
async def generate()
async def abort()
+ background engine loop
LLM
vllm/entrypoints/llm.py
api_server
vllm/entrypoints/api_server.py
openai_api_server
vllm/entrypoints/openai/api_server.py
End-user Interface�vllm/entrypoints
Batched inference
Simple demo API server
OpenAI-compatible API server
17
LLMEngine
vllm/engine/llm_engine.py
requests�(prompts)
Scheduler
vllm/core/scheduler.py
BlockSpaceManager
vllm/core/block_manager.py
BlockAllocator (GPU/CPU)
vllm/core/block_manager.py
Worker
vllm/worker/worker.py
N ×
CacheEngine
vllm/worker/cache_engine.py
Worker.model
vllm/model_executor/models
PagedAttention
vllm/model_executor/layers/�attention.py
…
Centralized Controller
(same process as LLMEngine, CPU only)
Distributed Workers�(distributed processes with GPUs)
Before serving requests�1. Initialize & load model
18
Worker
vllm/worker/worker.py
Worker.model
vllm/model_executor/models
LLMEngine
vllm/engine/llm_engine.py
model.load_weights
vllm/model_executor/models
Initialize & load from HuggingFace Model Hub
× N
Before serving requests�2. Profile memory usage
19
Worker
vllm/worker/worker.py
Worker.model
vllm/model_executor/models
LLMEngine
vllm/engine/llm_engine.py
Worker.profile_num_available_blocks
vllm/model_executor/models
Profile the memory usage with a batch with the max possible #tokens.
#GPU KV blocks = remaining memory / block byte size
× N
Before serving requests�3. Pre-allocate KV Blocks
20
LLMEngine
vllm/engine/llm_engine.py
Scheduler
vllm/core/scheduler.py
BlockSpaceManager
vllm/core/block_manager.py
Worker
vllm/worker/worker.py
× N
CacheEngine
vllm/worker/cache_engine.py
#GPU KV blocks
BlockAllocator (GPU/CPU)
vllm/core/block_manager.py
#GPU KV blocks
Initialize block table
Allocate KV cache memory
When requests arrive
21
LLMEngine
vllm/engine/llm_engine.py
Request prompt: �“The future of Artificial Intelligence”
Scheduler
vllm/core/scheduler.py
1. Tokenization: �[1, 450, 5434, 310, 3012, 928, 616, 3159, 28286]
2. Add to the scheduler’s waiting queue
At every step, the scheduler
22
Scheduler
vllm/core/scheduler.py
Decide the set of requests to run at the current step.
BlockSpaceManager
vllm/core/block_manager.py
BlockAllocator (GPU/CPU)
vllm/core/block_manager.py
→ Cache instructions & Block table
At every step, the worker
23
Worker.model
vllm/model_executor/models
Token IDs & Block Table
CacheEngine
vllm/worker/cache_engine.py
Cache instructions
Worker
vllm/worker/worker.py
N ×
At every step, the model
24
Token IDs & Block Table
PagedAttention
vllm/model_executor/layers/attention.py
ParallelLinear
vllm/parallel_utils/layers.py
QuantizedLinear
vllm/model_executor/layers/quantized_linear/
Faster kernels (e.g. LayerNorm)
vllm/model_executor/layers/…
Optimized kernels for other layers & distributed execution
Sampler
vllm/model_executor/layers/attention.py
Greedy/Random/Beam search → Generated tokens
Worker.model
vllm/model_executor/models
25
LLMEngine
vllm/engine/llm_engine.py
requests�(prompts)
Scheduler
vllm/core/scheduler.py
BlockSpaceManager
vllm/core/block_manager.py
BlockAllocator (GPU/CPU)
vllm/core/block_manager.py
Worker
vllm/worker/worker.py
N ×
CacheEngine
vllm/worker/cache_engine.py
Worker.model
vllm/model_executor/models
PagedAttention
vllm/model_executor/layers/�attention.py
…
Centralized Controller
(same process as LLMEngine, CPU only)
Distributed Workers�(distributed processes with GPUs)
unfinished requests
streaming results
detokenize
1 new token ID for each request
Summary
26
vLLM Recent Updates
Woosuk Kwon
Our journey since the project release
28
1. Supporting new models
2. Optimizing performance
3. Ensuring correctness & robustness
4. Improving usability
5. Code gardening & refactoring
Our journey since the project release
29
1. Supporting new models
2. Optimizing performance
3. Ensuring correctness & robustness
4. Improving usability
5. Code gardening & refactoring
Supported models
30
LLaMA/Vicuna/�LLaMA-2/CodeLlama
BLOOM
OPT
GPT-J/NeoX
Mistral
GPT
StarCoder
MPT
Falcon
Contributed by MistralAI
Supported from Day 1
+ 4 more architectures
+ RoPE scaling
+ Quantization
Our journey since the project release
31
1. Supporting new models
2. Optimizing performance
3. Ensuring correctness & robustness
4. Improving usability
5. Code gardening & refactoring
Optimization (1): Efficient de-tokenization
32
1 | 450 | 5434 | 310 | 319 | 29902 | 338 | 5921 |
Token IDs
Tokens
Text
<s> | The | future | of | A | I | is | promise |
“The future of AI is promise”
Optimization (1): Efficient de-tokenization
Reduces CPU overhead by caching and incremental updates
33
1 | 450 | 5434 | 310 | 319 | 29902 | 338 | 5921 |
Token IDs
Tokens
Text
<s> | The | future | of | A | I | is |
“The future of AI is promise”
promise |
Cached
Cached
Optimization (1): Efficient de-tokenization
Reduces CPU overhead by caching and incremental updates
34
1 | 450 | 5434 | 310 | 319 | 29902 | 338 | 5921 |
Token IDs
Tokens
Text
<s> | The | future | of | A | I | is |
“The future of AI is promising”
promise |
-ing |
292 |
New ID
Optimization (2): Vectorized sampler
Iterative (Previous)
Vectorized (Current)
35
All sequences
log_probs: 5
n: 2
log_probs: 5
Sequence 1
n: 1
log_probs: 3
0.1 | -0.2 | 0.03 | 0.13 | -0.1 |
SamplingParams
logits
torch.multinomial()
torch.topk()
next_token_ids: 3
log_probs: ...
Output
Sequence 2
n: 2
log_probs: 5
-0.3 | 0.71 | 0.11 | -0.2 | 0.58 |
SamplingParams
logits
torch.multinomial()
torch.topk()
next_token_ids: 1,4
log_probs: ...
Output
...
logits
0.1 | -0.2 | 0.03 | 0.13 | -0.1 |
-0.3 | 0.71 | 0.11 | -0.2 | 0.58 |
SamplingParams
n: 1
log_probs: 3
| | | | |
torch.multinomial()
torch.topk()
Outputs
log_probs: ...
next_token_ids: 1,4
log_probs: ...
next_token_ids: 3
log_probs: ...
Performance improvements
v0.2.0: Up to 66% throughput increase compared to v0.1.7
36
18%↑
42%↑
Our journey since the project release
37
1. Supporting new models
2. Optimizing performance
3. Ensuring correctness & robustness
4. Improving usability
5. Code gardening & refactoring
Ensuring correctness is hard
38
Ref. logits:
vLLM logits:
-0.88 | 0.10 | … | 0.13 | -0.06 |
-0.85 | 0.10 | … | 0.17 | -0.09 |
Ref. output:
vLLM output:
“The future of AI is unpredictable but full of potential.”
“The future of AI is promising and transformative.”
Tests for ensuring correctness
39
1. Unit tests for op-level correctness
2. End-to-end tests for certain cases
3. Accuracy tests on standard benchmarks
Accuracy tests
40
* acc_norm with 25 shots, https://github.com/EleutherAI/lm-evaluation-harness
Correctness & robustness
41
Making AsyncLLMEngine robust
Our journey since the project release
42
1. Supporting new models
2. Optimizing performance
3. Ensuring correctness & robustness
4. Improving usability
5. Code gardening & refactoring
Pre-built CUDA binaries (since v0.1.4)
43
Integration with serving libraries
Inference Engine
Serving Library
44
OpenLLM
API Server
NVIDIA
Triton
Observability
Model versioning
Autoscaling
. . .
Our journey since the project release
45
1. Supporting new models
2. Optimizing performance
3. Ensuring correctness & robustness
4. Improving usability
5. Code gardening & refactoring
Maintaining simple and clean code
46
Goal: Making vLLM easy to understand, hack, and contribute
vLLM Roadmap
Woosuk Kwon & Zhuohan Li
vLLM will remain open-source
Apache 2.0 License
48
Future Roadmap (Overview)
49
Optimizing latency
Better quantization support
CUDA-level optimizations
Optimizing latency (1): Lightweight runtime
50
Time
CPU
GPU
torch.nn.Linear
matmul
kernel
torch.nn.GELU
torch.nn.Linear
matmul
kernel
gelu
kernel
Python/PyTorch overhead
Idle
Idle
Python/PyTorch overhead takes up to 50% of overall latency
Optimizing latency (1): Lightweight runtime
51
Time
CPU
GPU
matmul
kernel
matmul
kernel
gelu
kernel
Lightweight Runtime
Runtime
Two solutions
Optimizing latency (2): Multi-process architecture
Single-process architecture
(Current)
Multi-process architecture
52
Process 2
Process 1
Request handling
Tokenization
Model execution
De-tokenization
Token streaming
Request handler
Tokenizer
Process 3
Model
Executor
Process 4
De-tokenizer
Process 5
Token
Streamer
Async. send
Async. send
Async. send
Async. send
Optimizing latency (3): Speculative decoding
Draft
Verification
Small model writes a draft → Large model verifies it
53
Small Model
(#parameter: N)
<s>
① Several
Several
② famous
famous
③ songs
songs
④ are
Autoregressive (sequential)
Non-autoregressive (parallel)
* Figure adapted from Kim et al., “Speculative Decoding with Big Little Decoder” (NeurIPS 2023)
Large Model
(#parameter: 10xN)
<s>
Several
famous
songs
are
⑤ composed
Several
famous
songs
are
Better quantization support
54
Step 1: Develop a general abstraction for diverse quantization methods.
Step 2: Implement diverse quantization methods.
Step 3: Optimize the performance of quantized ops.
CUDA-level optimization: PagedAttention kernel
PagedAttention takes 20-40% of overall running time
55
Optimization 1: Better work partitioning
(leveraging sequence-level parallelism)
Optimization 2: Efficient MQA/GQA support
Second-priority items
56
* Figure adapted from Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (ICLR 2022)
(Efficient) LoRA support*
AMD GPU support
Mixture-of-Experts
Pipeline parallelism
. . .
Stage 1
Stage N
Q&A on future roadmap (5 min)
First-Priority Features
Second-Priority Features
57
vLLM Networking Hour!
@woosuk_k
@zhuohan123
Please fill out our survey before you leave: