1 of 50

Project Update

The Fourth vLLM Meetup

x AGI Builders Meetup

The vLLM Team

2 of 50

🙏 Thank you Cloudflare & BentoML for Hosting!

2

3 of 50

Topics

  • vLLM Introduction
  • Project Update
  • Roadmap
  • Spec Decode Deep Dive

3

4 of 50

Topics

  • vLLM Introduction
  • Project Update
  • Roadmap
  • Spec Decode Deep Dive

4

5 of 50

The era of Large Language Models (LLMs)

Served by

GPUs

5

Chat

Program

Search

6 of 50

Serving LLMs is (surprisingly) slow and expensive

  • A single A100 GPU can only serve < 1 request per second
    • Moderate size of model (13B parameters) and input
  • A ton of GPUs are required for production-scale LLM services

6

7 of 50

Our Goal

Build the fastest and

easiest-to-use open-source

LLM inference & serving engine

7

8 of 50

KV Cache

Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

8

https://medium.com/@joaolages/kv-caching-explained-276520203249

9 of 50

PagedAttention

An attention algorithm that allows for storing continuous keys and values in non-contiguous memory space.

9

10 of 50

Manage KV cache like OS virtual memory

10

Alan

Turing

is

a

computer

scientist

and

mathematician

renowned

Logical KV blocks

Request

A

Block Table

computer

scientist

and

mathe-�matician

Artificial

Intelli-

gence

is

the

renowned

future

of

tech-�nology

Alan

Turing

is

a

Physical KV blocks

Artificial

Intelli-�gence

is

the

future

of

tech-�nology

Logical KV blocks

Request

B

Block Table

11 of 50

vLLM Github Repo

11

$ pip install vllm

20.8K Stars

Official

release!

12 of 50

vLLM API (1): LLM class

12

from vllm import LLM

# Example prompts.

prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.

llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Generate texts from the prompts.

outputs = llm.generate(prompts)

A Python interface for offline batched inference

13 of 50

vLLM API (2): OpenAI-compatible server

13

$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B

$ curl http://localhost:8000/v1/completions \

-H "Content-Type: application/json" \

-d '{

"model": "meta-llama/Meta-Llama-3-8B",

"prompt": "San Francisco is a",

"max_tokens": 7,

"temperature": 0

}'

A FastAPI-based server for online serving

Server

Client

14 of 50

Why vLLM?

14

Wide range of model support

  • 35+ model architectures including vision language models
  • Collaborating with model vendors

Diverse hardware support

  • NVIDIA, AMD, Intel GPUs
  • Intel/AMD CPU
  • Inferentia, TPU, Gaudi

End-to-end inference optimizations

  • CUDA graph
  • Speculative decoding
  • Quantization (GPTQ, AWQ, FP8)
  • Automatic prefix caching
  • Chunked prefills (a.k.a. Dynamic SplitFuse)
  • Multi-LoRA serving
  • Constrained decoding
  • FlashAttention & FlashDecoding

15 of 50

Contributors

15

Thanks to 389+ contributors who raised issues, participated in discussions, and submitted PRs!

16 of 50

vLLM is a true community project!

16

17 of 50

vLLM Adopters

17

lm-sys/FastChat

Open-Source Projects

Companies

allenai/open-instruct

EleutherAI/lm-�evaluation-harness

outlines-dev/outlines

18 of 50

Topics

  • vLLM Introduction
  • Project Update
  • Roadmap
  • Spec Decode Deep Dive

18

19 of 50

Extensive Model Support

Jan: Deepseek MoE

Feb: OLMo, Gemma

March: StarCoder2, Command R, Qwen2 MoE, DBRX, XVerse, Jais, LLaVA

April: Command R+, minicpm, Meta Llama 3, Mixtral 8x22b

May: Phi-3-mini, Phi-3-small, Arctic, IBM Granite, E5 Mistral

June (in progress): Jamba, PaliGemma

Blue: Added by model vendor; Red: Exciting new architecture

19

Out-of-tree proprietary model support! (#3871)

20 of 50

OpenAI API Compatibility

  • Addition of the Embedding API
  • Support for OpenAI batch file format
  • Initial support for tool_use

20

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.

openai_api_key = "EMPTY"

- openai_api_base = "http://api.openai.com/v1"

+ openai_api_base = "http://localhost:8000/v1"

client = OpenAI(

api_key=openai_api_key,

base_url=openai_api_base,

)

A plug-and-play replacement

21 of 50

OpenAI Vision API Compatibility �(by @ywang96, @DarkLight1337)

  • Extensive refactoring around infrastructure support in progress for multi-modal models to improve user & developer experience
  • End-to-end serving compatibility with OpenAI Vision API

21

Server

Client

$ python -m \

vllm.entrypoints.openai.api_server \

--model llava-hf/llava-1.5-7b-hf \

--image-input-type pixel_values \

--image-token-id 32000 \

--image-input-shape 1,3,336,336 \

--image-feature-size 576 \

--chat-template template_llava.jinja

$ curl http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "llava-hf/llava-1.5-7b-hf",

"messages": [{

"role": "user",

"content": [

{"type": "text", "text": "What's in this image?"},

{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/c/ce/City-of-the-future.jpg"}}]}],

"max_tokens": 7}'

22 of 50

Diverse Hardware Support

  • Already supports:

  • In development:

22

NVIDIA GPU

AMD GPU

AWS Neuron

Google TPU

Intel Gaudi

x86 CPU

Intel GPU

23 of 50

Feature Highlights

  • Chunked Prefill: trade off time-to-first-token for inter-token-latency
  • FP8 Quantization: trade off accuracy for lower latency
  • Prefix Caching: trade off higher memory for lower latency
  • Speculative Decoding: trade off higher compute for lower latency
    • Deep dive later!

23

All features ready for testing and production usage.

They will be on by default in upcoming releases!

24 of 50

24

25 of 50

  • vLLM now supports loading:
    • FP8 E4M3 checkpoints with per-tensor static weight scales and activation scales
    • FP16/BF16 checkpoints quantized at runtime with dynamic scaling enabled through quantization=”fp8”
  • Leverages PyTorch and CUTLASS for hardware acceleration on Ada Lovelace and Hopper GPUs
  • Collection on HF with quantized checkpoints

25

26 of 50

26

Example 1: Shared system prompt

Request A

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello!

Request B

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?

27 of 50

27

Example 2: Multi-round conversation

Prompt (round 1)

Human: What's AI?

LLM Result (round 1)

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Prompt (round 2)

Human: What's AI?

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Human: Cool, thanks!

LLM Result (round 2)

LLM: No problem!

Shared

28 of 50

Speculative Decoding

28

Lookahead Scheduler

(schedule more than 1 slots)

Speculative Worker

Target Model KV

Draft Model

KV

Draft Worker

Target Worker

Small Draft model

In the future:

Other propose methods (ngram, RAG, …)

Propose

Rejection Sampler

Score

Accept

29 of 50

Continuous Optimization

  • State-of-the-art kernels: FlashAttention, FlashInfer
  • GPU-GPU communication: NCCL, custom allreduce
  • Better architecture: Multiprocessing GPU executor
  • ……

29

vLLM will become faster and faster!

30 of 50

Integration with FlashAttention and FlashInfer

30

31 of 50

vLLM handles distributed inference for you

  • As LLMs grow rapidly in size, multi-GPU/multi-node inference is required
  • GPU communication is difficult to configure
    • hang/crash/versions of communication library
  • vLLM takes care of it for you
    • Debugging guide for hanging/crashing
    • Select best NCCL versions and set proper flags
    • Optimized custom allreduce for high-end machines

31

32 of 50

Multi-Processing Based GPU Executor

(by @njhill)

  • Added an option for using multiprocessing as an alternative backend
  • See more details in PR #3466
  • Benefits
    • Simpler deployment model using Python’s built-in library
    • Performance benefits
      • ~4-5% latency speedup (TTFT and TPOT) for Llama 70B and Mixtral 8x7B
  • Try it out using --distributed-backend=mp
  • Will be the default option in the future

32

33 of 50

Other Improvements

  • Usability
    • Centralized and documented environment variables (#4407)
  • Continuous performance optimization:
    • Reduced per-step scheduling overhead (#4894)
    • Automatic prefix caching optimization (#4696)
    • Kernel tuning (#4343, #5294, …)

33

34 of 50

Topics

  • vLLM Introduction
  • Project Update
  • Roadmap
  • Spec Decode Deep Dive

34

35 of 50

Development Roadmap: Major Themes

  1. Broad Model Support
  2. Excellent Hardware Coverage
  3. Production Level Engine
  4. Performance Optimization
  5. Extensible Architectures
  6. Strong OSS Community

35

36 of 50

Preliminary Q3 Roadmap

36

Feedback Welcome!

Model Support

  • More VLMs
  • Encoder-decoder
  • SSM (Jamba)

Hardware Support

  • fp8 tensor cores
  • MI300x perf
  • TPU, Intel GPU

Production Engine

  • Chunked prefill on by default
  • APC on by default
  • Pipeline parallelism

Performance

  • Better fp8
  • More spec decode
  • KV Cache swapping

Re-Architecture

  • Scheduler out of critical loop
  • torch.compile

OSS Community

  • Continuous Benchmark
  • Better CI
  • Code quality

37 of 50

Our Goal

Build the fastest and

easiest-to-use open-source

LLM inference & serving engine

37

38 of 50

Topics

  • vLLM Introduction
  • Project Update
  • Roadmap
  • Spec Decode Deep Dive

38

39 of 50

What is Speculative Decoding?

  • Methods to reduce generation latency
  • Intuition: Some tokens are easier to generate than others. Use small models to generate easy tokens, and large models to generate difficult ones

39

LLM (7B)

Small Model (80M)

LLM (7B)

Prompt

T1

T2

T3

Prompt

T1

T2

T3’

T1

T2

T3’

T1

T2

T3’

T3

40 of 50

Vanilla speculative decoding

with continuous batching

40

41 of 50

Speculative Decoding @cadedaniel

41

Lookahead Scheduler

(schedule more than 1 slots)

Speculative Worker

Target Model KV

Draft Model

KV

Draft Worker

Target Worker

Small Draft model

Prompt Lookup Decoding

In the future:

RAG, Medusa etc.

Propose

Rejection Sampler

Score

Accept

42 of 50

Vanilla Speculative Decoding is not the final solution!

  • Vanilla Speculative Decoding (VSD) is not always beneficial
  • Speculative Decoding can even hurt the performance
  • Different request rates have different optimal ‘proposed length (k)’

42

43 of 50

Dynamic Speculative Decoding (DSD) (RFC)

  • Decide the best proposed length at different QPS
  • Decide when to stop doing speculative decoding
  • General to different types of Spec Dec
    • Draft model based
    • Prompt lookup decoding
    • Medusa etc.

43

44 of 50

Prompt Lookup Decoding + DSD

(by @leiwen83, @comaniac)

  • Up to 1.95x speedup on low QPS
  • Almost no performance degradation

44

More detailed benchmark numbers here.

45 of 50

Optimize Speculative Decoding (by @cadedaniel)

  • Correct and fully functional
  • Performance optimization (RFC)
    • Optimizing proposal time
      • P0: Reduce draft model control-plane communication
      • P0: Support draft model on different TP than target model
    • Optimizations for scoring time
      • P0: Re-enable bonus tokens to increase % accepted tokens
      • P1: Replace CPU-based batch expansion with multi-query attention kernel call
      • P1 : Dynamic speculative decoding
    • Optimizations for both proposal and scoring time
      • P0: Decouple sampling serialization from sampling
      • P1: Amortize prepare_inputs over multiple forward passes
    • Optimizations for scheduling time
      • P0: Profile & optimize block manager V2

45

46 of 50

More Types of Speculative Decoding in Progress

46

47 of 50

Spec Decode Usage

from vllm import LLM

# Example prompts.

prompts = ["Hello, my name is", "The capital of France is"]

# Create an LLM with HF model name.

llm = LLM(model="meta-llama/Meta-Llama-3-8B",

speculative_model="facebook/opt-125m",

num_speculative_tokens=5,

use_v2_block_manager=True)

# Generate texts from the prompts.

outputs = llm.generate(prompts)

47

speculative_model="[ngram]"

48 of 50

Speedup of DSD on Online Chatting (Arena) Dataset

48

QPS

Llama-7B

(160M Draft)

QPS

Llama-33B

(1.1B Draft)

QPS

Llama-70B

(7B Draft)

2

1.8x

1

1.7x

1

1.3x

4

2.2x

2

1.8x

2

1.2x

6

1.2x

3

1.2x

3

1.1x

8

1.0x

4

1.1x

4

1.0x

Speedup = request latency of non-SD / request latency with SD

49 of 50

Sponsors (funding compute!)

49

50 of 50

50

Building the fastest and easiest-to-use open-source LLM inference & serving engine!

https://twitter.com/vllm_project

https://opencollective.com/vllm

Ray Summit (09/30-10/02) now has vLLM Track. CFP Open!