JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 103

NYC vLLM Meetup

May 7, 2025

2 of 103

Welcome!

5:30

5:50

6:20

6:50

7:10

7:30

Intro to vLLM & Project Update

Intro to torch.compile and How It Works with vLLM

Demo of Production Deployment of vLLM on AMD

Intro to Mamba SSM Architecture

Q&A and Open Discussion

Pizza and Networking 🍕 🤝

3 of 103

Intro to vLLM & Project Update

Robert Shaw

Director of Engineering, Red Hat

vLLM Committer

4 of 103

Intro to vLLM

5 of 103

Build the fastest and easiest-to-use open-source LLM inference & serving engine

6 of 103

What Problem is vLLM Solving?

Production Inference Serving

Batch Size > 1 & Data Center Hardwares
How do you?

Efficiently schedule requests into the next forward pass?
Manage KV cache context and runtime memory footprint?
GPUs go brrrr

7 of 103

Why Is This A Hard Problem?

A LLM is a function to predict the next token in a sequence

P(X_n | X_0 … X_n-1)

To generate text, we “chain together” passes through the model

→ A single request requires multiple passes through the model
→ A single generation request can last multiple seconds

Key Challenge: How to handle multiple concurrent requests

8 of 103

Challenge 1: Batching

Dynamic Batching 🙅🙅🙅

Continuous Batching 🙏🙏🙏

9 of 103

Challenge 2: KV Caching

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

10 of 103

Challenge 2: KV Caching

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory!

11 of 103

vLLM’s Original Innovation

Paged Attention + Continuous Batching

12 of 103

vLLM’s Original Innovation

Alan	Turing	is	a
computer	scientist	and	mathematician
renowned

Logical KV blocks

Request

Block Table


computer	scientist	and	mathe-�matician

Artificial	Intelli- gence	is	the

renowned
future	of	tech-�nology
Alan	Turing	is	a

Physical KV blocks

Artificial	Intelli-�gence	is	the
future	of	tech-�nology

Logical KV blocks

Request

Block Table

Paged Attention + Continuous Batching

13 of 103

vLLM Ecosystem

14 of 103

From this base, we have built…

Modeling interface: multi-modality too!
Hardware interface & optimized kernels on each

Quantization, Attention, Collectives

Class of Optimizations

Distributed inference: Tensor, Pipeline, Data, Expert Parallelism
Chunked prefill scheduling
Prefix caching
Speculative decoding
Multi-LoRA
…

15 of 103

2 Year Journey Of vLLM

vLLM has rapidly evolved from a research project to the open source default.

Pervasive → 100k daily installs in 2025

Explosive Growth → 10x usage increase in 2024

16 of 103

Community Flywheel

Hardware Vendors

Contribution Trajectory

Model Creators

Choice for MI300X

TPU enablement

Neuron enablement

Gaudi enablement

Features for new HW

Llama

Qwen

Mistral

Molmo

Arctic

Phi

Jamba

DBRX

Transformers

Commits By Organization

vLLM’s position is attracting investment from key ecosystem participants.

17 of 103

Cross Platform

vLLM supports the key models on the key hardware accelerators.

CPU

Neuron

TPU

Gaudi

Instinct

GPU

Llama

Qwen

DeepSeek

Gemma

Mistral

Molmo

Phi

Nemotron

Granite

Single Platform To Run Your Models Across Accelerators and OEMs.

Edge

Virtual

Public Cloud

Private Cloud

Physical

18 of 103

Who Uses vLLM?

Model as a Service: AWS, GCP, Azure, …

AI in Scaled Production: Amazon, LinkedIn, Meta, …

Proprietary Deployments: Snowflake, IBM, …

Foundation Model Labs: Mistral, Cohere, Qwen, …

Fine-tuning Frameworks: veRL, TRL, OpenRLHF, …

Hardware Platforms: NVIDIA, AMD, Intel, …

19 of 103

vLLM Features

20 of 103

Why vLLM For Performance?

vLLM implements the key optimizations for fast inference

Inference

Optimizations

To make your models faster

Distributed

Inference

To deploy large models efficiently

21 of 103

Quantization

Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute

Weight Quantization

Reduced storage & memory footprints

E.g. 100B model (BFloat16 → 200GB / FP8 → 100GB)

KV Cache Quantization

Reduced KV cache storage & faster attention

Crucial for long context workloads

Activation Quantization

Faster matrix multiplication & communication

22 of 103

Get Started with Model Optimization Now

LLM Compression Tools

Optimized Model Hub

Llama

Qwen

Mistral

DeepSeek

Gemma

Phi

Molmo

Granite

Nemotron

→ Optimized Model Hub

red.ht/optimized-models

→ LLM Compressor Tools

red.ht/llm-compressor

23 of 103

Impact of Quantization

Quantization Enables More Tokens For Fixed Hardware

24 of 103

Automatic Prefix Caching

Re-use KV caches across requests!

Request A

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello!

Request B

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?

25 of 103

Speculative Decoding

Accelerate decoding phase with speculation - variety of methods.

26 of 103

Impact of Speculative Decoding

Spec Decoding Enables Better Latency In Bandwidth Bound Regimes

27 of 103

vLLM Combines All Optimizations Together

Without Optimizations

28 of 103

vLLM Goes Distributed

Single Device

Single Host

Multi-Device

Multi-Host

Multi-Device

29 of 103

Forms of Parallelism in vLLM

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Expert Parallelism (EP)

Data Parallelism (DP)

Disaggregated Serving

30 of 103

Tensor Parallelism

Partition the model’s hidden dimension → All-reduce to aggregate the outputs

Works well for ≤ 8 devices

vLLM provides optimized all-reduce implementation

Limited scalability

31 of 103

Pipeline Parallelism

Distribute layers to different devices → execute in a pipelined fashion

Point-to-point communication instead of expensive all-reduce

Load imbalance b/t stages

Doesn’t reduce latency

32 of 103

Expert Parallelism

Place experts to different devices

→ All-to-all to exchange tokens

Lower communication overheads than tensor parallelism

Load imbalance between experts

33 of 103

Data Parallelism

Partition the inputs instead of the model → Model weights are replicated

Lower communication overheads

Load imbalance between replicas

Increased memory consumption for model weights

34 of 103

Disaggregated Serving

Partition the “time” dimension

→ Separate instances for prompt processing & token generation

Separation of concern

Better control over latency

KV cache transfer overheads

Lower device utilization

35 of 103

vLLM Supports Mixed Parallelism

Data + Expert Parallelism

(e.g., DeepSeek V3)

Tensor + Pipeline Parallelism

(e.g., Llama 3 405B)

36 of 103

Intro to torch.compile and How It Works with vLLM

Richard Zou

Staff Software Engineer, Meta

Luka Govedič

Software Engineer II, Red Hat

37 of 103

What is torch.compile?

Just-in-time compiler for PyTorch code

38 of 103

What is torch.compile?

Just-in-time compiler for PyTorch code

39 of 103

Why use torch.compile?

Our value proposition: Fast baseline performance to save YOU development time from tuning model performance.

40 of 103

Why use torch.compile?

Performance with the Flexibility of PyTorch

41 of 103

How does torch.compile work?

Our frontend (TorchDynamo) captures graphs via a custom Python bytecode interpreter. It produces graphs and bytecode.

42 of 103

How does torch.compile work?

Our frontend (TorchDynamo) captures 1+ straightline graphs.

43 of 103

torch.compile optimization highlights

Pointwise + reduction fusion
autotuning (e.g. block size selection)

44 of 103

torch.compile optimization highlights

3. matmul selection and fusion:

Given a matmul (and pointwise epilogue), with mode=“max-autotune”, we will benchmark (1) torch.matmul vs (2) triton matmul vs (3) cutlass matmul (hidden behind a config)

In the triton config the pointwise epilogue may be fused onto the matmul.

45 of 103

torch.compile optimization highlights

4. CUDAGraphs

CUDAGraphs: captures a sequence of kernel launches so that they can be re-launched in the future with low overhead

Does not capture CPU Compute, only CUDA kernels

It is difficult to use the raw CUDAGraphs API safely; torch.compile has more safety in that it’ll split the graph on known unsupported operators.

46 of 103