Project Update
The Fourth vLLM Meetup
x AGI Builders Meetup
The vLLM Team
🙏 Thank you Cloudflare & BentoML for Hosting!
2
Topics
3
Topics
4
The era of Large Language Models (LLMs)
Served by
GPUs
5
Chat
Program
Search
Serving LLMs is (surprisingly) slow and expensive
6
Our Goal
Build the fastest and
easiest-to-use open-source
LLM inference & serving engine
7
KV Cache
Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding
8
https://medium.com/@joaolages/kv-caching-explained-276520203249
PagedAttention
An attention algorithm that allows for storing continuous keys and values in non-contiguous memory space.
9
Manage KV cache like OS virtual memory
10
Alan | Turing | is | a |
computer | scientist | and | mathematician |
renowned | | | |
| | | |
Logical KV blocks
Request
A
| |
| |
| |
| |
| |
Block Table
| | | |
computer | scientist | and | mathe-�matician |
| | | |
Artificial | Intelli- gence | is | the |
| | | |
renowned | | | |
future | of | tech-�nology | |
Alan | Turing | is | a |
Physical KV blocks
Artificial | Intelli-�gence | is | the |
future | of | tech-�nology | |
| | | |
| | | |
Logical KV blocks
Request
B
| |
| |
| |
| |
| |
Block Table
vLLM Github Repo
11
$ pip install vllm
20.8K Stars
Official
release!
vLLM API (1): LLM class
12
from vllm import LLM
# Example prompts.
prompts = ["Hello, my name is", "The capital of France is"]
# Create an LLM with HF model name.
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
# Generate texts from the prompts.
outputs = llm.generate(prompts)
A Python interface for offline batched inference
vLLM API (2): OpenAI-compatible server
13
$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
A FastAPI-based server for online serving
Server
Client
Why vLLM?
14
Wide range of model support
Diverse hardware support
End-to-end inference optimizations
Contributors
15
…
Thanks to 389+ contributors who raised issues, participated in discussions, and submitted PRs!
vLLM is a true community project!
16
vLLM Adopters
17
lm-sys/FastChat
Open-Source Projects
Companies
…
allenai/open-instruct
EleutherAI/lm-�evaluation-harness
outlines-dev/outlines
…
Topics
18
Extensive Model Support
Jan: Deepseek MoE
Feb: OLMo, Gemma
March: StarCoder2, Command R, Qwen2 MoE, DBRX, XVerse, Jais, LLaVA
April: Command R+, minicpm, Meta Llama 3, Mixtral 8x22b
May: Phi-3-mini, Phi-3-small, Arctic, IBM Granite, E5 Mistral
June (in progress): Jamba, PaliGemma
Blue: Added by model vendor; Red: Exciting new architecture
19
Out-of-tree proprietary model support! (#3871)
OpenAI API Compatibility
20
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
- openai_api_base = "http://api.openai.com/v1"
+ openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
A plug-and-play replacement
OpenAI Vision API Compatibility �(by @ywang96, @DarkLight1337)
21
Server
Client
$ python -m \
vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--image-input-type pixel_values \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/c/ce/City-of-the-future.jpg"}}]}],
"max_tokens": 7}'
Diverse Hardware Support
22
NVIDIA GPU
AMD GPU
AWS Neuron
Google TPU
Intel Gaudi
x86 CPU
Intel GPU
Feature Highlights
23
All features ready for testing and production usage.
They will be on by default in upcoming releases!
25
26
Example 1: Shared system prompt
Request A
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello!
Request B
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?
Related work: [SGLang], [PromptCache], [Hydragen], [Parrot], …
27
Example 2: Multi-round conversation
Prompt (round 1)
Human: What's AI?
LLM Result (round 1)
LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.
Prompt (round 2)
Human: What's AI?
LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.
Human: Cool, thanks!
LLM Result (round 2)
LLM: No problem!
Shared
Related work: [SGLang], [PromptCache], [Hydragen], [Parrot], …
Speculative Decoding
28
Lookahead Scheduler
(schedule more than 1 slots)
Speculative Worker
Target Model KV
Draft Model
KV
Draft Worker
Target Worker
Small Draft model
In the future:
Other propose methods (ngram, RAG, …)
Propose
Rejection Sampler
Score
Accept
Continuous Optimization
29
vLLM will become faster and faster!
Integration with FlashAttention and FlashInfer
30
vLLM handles distributed inference for you
31
Multi-Processing Based GPU Executor
(by @njhill)
32
Other Improvements
33
Topics
34
Development Roadmap: Major Themes
35
Preliminary Q3 Roadmap
36
Feedback Welcome!
Model Support
Hardware Support
Production Engine
Performance
Re-Architecture
OSS Community
Our Goal
Build the fastest and
easiest-to-use open-source
LLM inference & serving engine
37
Topics
38
What is Speculative Decoding?
39
LLM (7B)
Small Model (80M)
LLM (7B)
Prompt
T1
T2
T3
Prompt
T1
T2
T3’
T1
T2
T3’
T1
T2
T3’
✅
✅
❌
T3
Vanilla speculative decoding
with continuous batching
40
Speculative Decoding @cadedaniel
41
Lookahead Scheduler
(schedule more than 1 slots)
Speculative Worker
Target Model KV
Draft Model
KV
Draft Worker
Target Worker
Small Draft model
Prompt Lookup Decoding
In the future:
RAG, Medusa etc.
Propose
Rejection Sampler
Score
Accept
Vanilla Speculative Decoding is not the final solution!
42
Dynamic Speculative Decoding (DSD) (RFC)
43
44
More detailed benchmark numbers here.
Optimize Speculative Decoding (by @cadedaniel)
45
More Types of Speculative Decoding in Progress
46
Spec Decode Usage
from vllm import LLM
# Example prompts.
prompts = ["Hello, my name is", "The capital of France is"]
# Create an LLM with HF model name.
llm = LLM(model="meta-llama/Meta-Llama-3-8B",
speculative_model="facebook/opt-125m",
num_speculative_tokens=5,
use_v2_block_manager=True)
# Generate texts from the prompts.
outputs = llm.generate(prompts)
47
speculative_model="[ngram]"
Speedup of DSD on Online Chatting (Arena) Dataset
48
QPS | Llama-7B (160M Draft) | QPS | Llama-33B (1.1B Draft) | QPS | Llama-70B (7B Draft) |
2 | 1.8x | 1 | 1.7x | 1 | 1.3x |
4 | 2.2x | 2 | 1.8x | 2 | 1.2x |
6 | 1.2x | 3 | 1.2x | 3 | 1.1x |
8 | 1.0x | 4 | 1.1x | 4 | 1.0x |
Speedup = request latency of non-SD / request latency with SD
Sponsors (funding compute!)
49
50
Building the fastest and easiest-to-use open-source LLM inference & serving engine!
https://twitter.com/vllm_project
https://opencollective.com/vllm
Ray Summit (09/30-10/02) now has vLLM Track. CFP Open!