1 of 50

Project Update

The Fourth vLLM Meetup

x AGI Builders Meetup

The vLLM Team

2 of 50

🙏 Thank you Cloudflare & BentoML for Hosting!

2

3 of 50

Topics

vLLM Introduction
Project Update
Roadmap
Spec Decode Deep Dive

3

4 of 50

Topics

vLLM Introduction
Project Update
Roadmap
Spec Decode Deep Dive

4

5 of 50

The era of Large Language Models (LLMs)

Served by

GPUs

5

Chat

Program

Search

6 of 50

Serving LLMs is (surprisingly) slow and expensive

A single A100 GPU can only serve < 1 request per second

Moderate size of model (13B parameters) and input

A ton of GPUs are required for production-scale LLM services

6

7 of 50

Our Goal

Build the fastest and

easiest-to-use open-source

LLM inference & serving engine

7

8 of 50

KV Cache

Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

8

https://medium.com/@joaolages/kv-caching-explained-276520203249

9 of 50

PagedAttention

An attention algorithm that allows for storing continuous keys and values in non-contiguous memory space.

9

10 of 50

Manage KV cache like OS virtual memory

10

Alan	Turing	is	a
computer	scientist	and	mathematician
renowned

Logical KV blocks

Request

A

Block Table


computer	scientist	and	mathe-�matician

Artificial	Intelli- gence	is	the

renowned
future	of	tech-�nology
Alan	Turing	is	a

Physical KV blocks

Artificial	Intelli-�gence	is	the
future	of	tech-�nology

Logical KV blocks

Request

B

Block Table

11 of 50

vLLM Github Repo

11

https://github.com/vllm-project/vllm

$ pip install vllm

20.8K Stars

Official

release!

12 of 50

vLLM API (1): LLM class

12

from vllm import LLM

# Example prompts.

prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.

llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Generate texts from the prompts.

outputs = llm.generate(prompts)

A Python interface for offline batched inference

13 of 50

vLLM API (2): OpenAI-compatible server

13

$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B

$ curl http://localhost:8000/v1/completions \

-H "Content-Type: application/json" \

-d '{

"model": "meta-llama/Meta-Llama-3-8B",

"prompt": "San Francisco is a",

"max_tokens": 7,

"temperature": 0

}'

A FastAPI-based server for online serving

Server

Client

14 of 50

Why vLLM?

14

Wide range of model support

35+ model architectures including vision language models
Collaborating with model vendors

Diverse hardware support

NVIDIA, AMD, Intel GPUs
Intel/AMD CPU
Inferentia, TPU, Gaudi

End-to-end inference optimizations

CUDA graph
Speculative decoding
Quantization (GPTQ, AWQ, FP8)
Automatic prefix caching

Chunked prefills (a.k.a. Dynamic SplitFuse)
Multi-LoRA serving
Constrained decoding
FlashAttention & FlashDecoding

15 of 50

Contributors

15

…

Thanks to 389+ contributors who raised issues, participated in discussions, and submitted PRs!

16 of 50

vLLM is a true community project!

16

17 of 50

vLLM Adopters

17

lm-sys/FastChat

Open-Source Projects

Companies

…

allenai/open-instruct

EleutherAI/lm-�evaluation-harness

outlines-dev/outlines

…

18 of 50

Topics

vLLM Introduction
Project Update
Roadmap
Spec Decode Deep Dive

18

19 of 50

Extensive Model Support

Jan: Deepseek MoE

Feb: OLMo, Gemma

March: StarCoder2, Command R, Qwen2 MoE, DBRX, XVerse, Jais, LLaVA

April: Command R+, minicpm, Meta Llama 3, Mixtral 8x22b

May: Phi-3-mini, Phi-3-small, Arctic, IBM Granite, E5 Mistral

June (in progress): Jamba, PaliGemma

Blue: Added by model vendor; Red: Exciting new architecture

19

Out-of-tree proprietary model support! (#3871)

20 of 50

OpenAI API Compatibility

Addition of the Embedding API
Support for OpenAI batch file format
Initial support for tool_use

20

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.

openai_api_key = "EMPTY"

- openai_api_base = "http://api.openai.com/v1"

+ openai_api_base = "http://localhost:8000/v1"

client = OpenAI(

api_key=openai_api_key,

base_url=openai_api_base,

)

A plug-and-play replacement

21 of 50

OpenAI Vision API Compatibility �(by @ywang96, @DarkLight1337)

Extensive refactoring around infrastructure support in progress for multi-modal models to improve user & developer experience
End-to-end serving compatibility with OpenAI Vision API

21

Server

Client

$ python -m \

vllm.entrypoints.openai.api_server \

--model llava-hf/llava-1.5-7b-hf \

--image-input-type pixel_values \

--image-token-id 32000 \

--image-input-shape 1,3,336,336 \

--image-feature-size 576 \

--chat-template template_llava.jinja

$ curl http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "llava-hf/llava-1.5-7b-hf",

"messages": [{

"role": "user",

"content": [

{"type": "text", "text": "What's in this image?"},

{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/c/ce/City-of-the-future.jpg"}}]}],

"max_tokens": 7}'

22 of 50

Diverse Hardware Support

Already supports:

In development:

22

NVIDIA GPU

AMD GPU

AWS Neuron

Google TPU

Intel Gaudi

x86 CPU

Intel GPU

23 of 50

Feature Highlights

Chunked Prefill: trade off time-to-first-token for inter-token-latency
FP8 Quantization: trade off accuracy for lower latency
Prefix Caching: trade off higher memory for lower latency
Speculative Decoding: trade off higher compute for lower latency

Deep dive later!

23

All features ready for testing and production usage.

They will be on by default in upcoming releases!

24 of 50

Chunked Prefill

24

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

25 of 50

FP8 Quantization

vLLM now supports loading:

FP8 E4M3 checkpoints with per-tensor static weight scales and activation scales
FP16/BF16 checkpoints quantized at runtime with dynamic scaling enabled through quantization=”fp8”

Leverages PyTorch and CUTLASS for hardware acceleration on Ada Lovelace and Hopper GPUs
Collection on HF with quantized checkpoints

25

26 of 50

Automatic Prefix Caching

26

Example 1: Shared system prompt

Request A

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello!

Request B

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?

Related work: [SGLang], [PromptCache], [Hydragen], [Parrot], …

27 of 50

Automatic Prefix Caching

27

Example 2: Multi-round conversation

Prompt (round 1)

Human: What's AI?

LLM Result (round 1)

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Prompt (round 2)

Human: What's AI?

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Human: Cool, thanks!

LLM Result (round 2)

LLM: No problem!

Shared

Related work: [SGLang], [PromptCache], [Hydragen], [Parrot], …

28 of 50

Speculative Decoding

28

Lookahead Scheduler

(schedule more than 1 slots)

Speculative Worker

Target Model KV

Draft Model

KV

Draft Worker

Target Worker

Small Draft model

In the future:

Other propose methods (ngram, RAG, …)

Propose

Rejection Sampler

Score

Accept

29 of 50

Continuous Optimization

State-of-the-art kernels: FlashAttention, FlashInfer
GPU-GPU communication: NCCL, custom allreduce
Better architecture: Multiprocessing GPU executor
……

29

vLLM will become faster and faster!

30 of 50

Integration with FlashAttention and FlashInfer

30

31 of 50

vLLM handles distributed inference for you

As LLMs grow rapidly in size, multi-GPU/multi-node inference is required
GPU communication is difficult to configure

hang/crash/versions of communication library

vLLM takes care of it for you

Debugging guide for hanging/crashing
Select best NCCL versions and set proper flags
Optimized custom allreduce for high-end machines

31

32 of 50

Multi-Processing Based GPU Executor

(by @njhill)

Added an option for using multiprocessing as an alternative backend
See more details in PR #3466
Benefits

Simpler deployment model using Python’s built-in library
Performance benefits

~4-5% latency speedup (TTFT and TPOT) for Llama 70B and Mixtral 8x7B

Try it out using --distributed-backend=mp
Will be the default option in the future

32

33 of 50

Other Improvements

Usability

Centralized and documented environment variables (#4407)

Continuous performance optimization:

Reduced per-step scheduling overhead (#4894)
Automatic prefix caching optimization (#4696)
Kernel tuning (#4343, #5294, …)

33

34 of 50

Topics

vLLM Introduction
Project Update
Roadmap
Spec Decode Deep Dive

34

35 of 50

Development Roadmap: Major Themes

Broad Model Support
Excellent Hardware Coverage
Production Level Engine
Performance Optimization
Extensible Architectures
Strong OSS Community

35

36 of 50

Preliminary Q3 Roadmap

36

Feedback Welcome!

Model Support

More VLMs
Encoder-decoder
SSM (Jamba)

Hardware Support

fp8 tensor cores
MI300x perf
TPU, Intel GPU

Production Engine

Chunked prefill on by default
APC on by default
Pipeline parallelism

Performance

Better fp8
More spec decode
KV Cache swapping

Re-Architecture

Scheduler out of critical loop
torch.compile

OSS Community

Continuous Benchmark
Better CI
Code quality

37 of 50

Our Goal

Build the fastest and

easiest-to-use open-source

LLM inference & serving engine

37

38 of 50

Topics

vLLM Introduction
Project Update
Roadmap
Spec Decode Deep Dive

38

39 of 50

What is Speculative Decoding?

Methods to reduce generation latency
Intuition: Some tokens are easier to generate than others. Use small models to generate easy tokens, and large models to generate difficult ones

39

LLM (7B)

Small Model (80M)

LLM (7B)

Prompt

T1

T2

T3

Prompt

T1

T2

T3’

T1

T2

T3’

T1

T2

T3’

✅

❌

T3

40 of 50

Vanilla speculative decoding

with continuous batching

40

41 of 50

Speculative Decoding @cadedaniel

41

Lookahead Scheduler

(schedule more than 1 slots)

Speculative Worker

Target Model KV

Draft Model

KV

Draft Worker

Target Worker

Small Draft model

Prompt Lookup Decoding

In the future:

RAG, Medusa etc.

Propose

Rejection Sampler

Score

Accept

42 of 50

Vanilla Speculative Decoding is not the final solution!

Vanilla Speculative Decoding (VSD) is not always beneficial
Speculative Decoding can even hurt the performance
Different request rates have different optimal ‘proposed length (k)’

42

43 of 50

Dynamic Speculative Decoding (DSD) (RFC)

Decide the best proposed length at different QPS
Decide when to stop doing speculative decoding
General to different types of Spec Dec

Draft model based
Prompt lookup decoding
Medusa etc.

43

44 of 50

Prompt Lookup Decoding + DSD

(by @leiwen83, @comaniac)

Up to 1.95x speedup on low QPS
Almost no performance degradation

44

More detailed benchmark numbers here.

45 of 50

Optimize Speculative Decoding (by @cadedaniel)

Correct and fully functional
Performance optimization (RFC)

Optimizing proposal time

P0: Reduce draft model control-plane communication
P0: Support draft model on different TP than target model

Optimizations for scoring time

P0: Re-enable bonus tokens to increase % accepted tokens
P1: Replace CPU-based batch expansion with multi-query attention kernel call
P1 : Dynamic speculative decoding

Optimizations for both proposal and scoring time

P0: Decouple sampling serialization from sampling
P1: Amortize prepare_inputs over multiple forward passes

Optimizations for scheduling time

P0: Profile & optimize block manager V2

45

46 of 50

More Types of Speculative Decoding in Progress

46

47 of 50

Spec Decode Usage

from vllm import LLM

# Example prompts.

prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.

llm = LLM(model="meta-llama/Meta-Llama-3-8B",

speculative_model="facebook/opt-125m",

num_speculative_tokens=5,

use_v2_block_manager=True)

# Generate texts from the prompts.

outputs = llm.generate(prompts)

47

speculative_model="[ngram]"

48 of 50

Speedup of DSD on Online Chatting (Arena) Dataset

48

QPS	Llama-7B (160M Draft)	QPS	Llama-33B (1.1B Draft)	QPS	Llama-70B (7B Draft)
2	1.8x	1	1.7x	1	1.3x
4	2.2x	2	1.8x	2	1.2x
6	1.2x	3	1.2x	3	1.1x
8	1.0x	4	1.1x	4	1.0x

Speedup = request latency of non-SD / request latency with SD

49 of 50

Sponsors (funding compute!)

49

50 of 50

50

https://github.com/vllm-project/vllm

https://vllm.ai

https://arxiv.org/abs/2309.06180

https://discord.gg/jz7wjKhh6g

Building the fastest and easiest-to-use open-source LLM inference & serving engine!

https://twitter.com/vllm_project

https://opencollective.com/vllm

Ray Summit (09/30-10/02) now has vLLM Track. CFP Open!