NYC vLLM Meetup
May 7, 2025
Welcome!
5:30
5:50
6:20
6:50
7:10
7:30
Intro to vLLM & Project Update
Intro to torch.compile and How It Works with vLLM
Demo of Production Deployment of vLLM on AMD
Intro to Mamba SSM Architecture
Q&A and Open Discussion
Pizza and Networking 🍕 🤝
Intro to vLLM & Project Update
Robert Shaw
Director of Engineering, Red Hat
vLLM Committer
Intro to vLLM
5
Build the fastest and easiest-to-use open-source LLM inference & serving engine
What Problem is vLLM Solving?
6
Production Inference Serving
Why Is This A Hard Problem?
7
Challenge 1: Batching
8
Dynamic Batching 🙅🙅🙅
Continuous Batching 🙏🙏🙏
Challenge 2: KV Caching
9
KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding
Challenge 2: KV Caching
10
KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory!
vLLM’s Original Innovation
11
Paged Attention + Continuous Batching
vLLM’s Original Innovation
12
Alan | Turing | is | a |
computer | scientist | and | mathematician |
renowned | | | |
| | | |
Logical KV blocks
Request
A
| |
| |
| |
| |
| |
Block Table
| | | |
computer | scientist | and | mathe-�matician |
| | | |
Artificial | Intelli- gence | is | the |
| | | |
renowned | | | |
future | of | tech-�nology | |
Alan | Turing | is | a |
Physical KV blocks
Artificial | Intelli-�gence | is | the |
future | of | tech-�nology | |
| | | |
| | | |
Logical KV blocks
Request
B
| |
| |
| |
| |
| |
Block Table
Paged Attention + Continuous Batching
vLLM Ecosystem
From this base, we have built…
14
2 Year Journey Of vLLM
15
vLLM has rapidly evolved from a research project to the open source default.
Pervasive → 100k daily installs in 2025
Explosive Growth → 10x usage increase in 2024
Community Flywheel
16
Hardware Vendors
Contribution Trajectory
Model Creators
Choice for MI300X
TPU enablement
Neuron enablement
Gaudi enablement
Features for new HW
Llama
Qwen
Mistral
Molmo
Arctic
Phi
Jamba
DBRX
Transformers
Commits By Organization
vLLM’s position is attracting investment from key ecosystem participants.
Cross Platform
17
vLLM supports the key models on the key hardware accelerators.
CPU
Neuron
TPU
Gaudi
Instinct
GPU
Llama
Qwen
DeepSeek
Gemma
Mistral
Molmo
Phi
Nemotron
Granite
Single Platform To Run Your Models Across Accelerators and OEMs.
Edge
Virtual
Public Cloud
Private Cloud
Physical
Who Uses vLLM?
vLLM Features
Why vLLM For Performance?
20
vLLM implements the key optimizations for fast inference
Inference
Optimizations
To make your models faster
Distributed
Inference
To deploy large models efficiently
Quantization
21
Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute
Weight Quantization
Reduced storage & memory footprints
E.g. 100B model (BFloat16 → 200GB / FP8 → 100GB)
KV Cache Quantization
Reduced KV cache storage & faster attention
Crucial for long context workloads
Activation Quantization
Faster matrix multiplication & communication
1
2
3
Get Started with Model Optimization Now
22
LLM Compression Tools
Optimized Model Hub
Llama
Qwen
Mistral
DeepSeek
Gemma
Phi
Molmo
Granite
Nemotron
→ Optimized Model Hub
red.ht/optimized-models
→ LLM Compressor Tools
red.ht/llm-compressor
Impact of Quantization
23
Quantization Enables More Tokens For Fixed Hardware
Automatic Prefix Caching
24
Re-use KV caches across requests!
Request A
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello!
Request B
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?
Speculative Decoding
25
Accelerate decoding phase with speculation - variety of methods.
Impact of Speculative Decoding
26
Spec Decoding Enables Better Latency In Bandwidth Bound Regimes
vLLM Combines All Optimizations Together
27
Without Optimizations
vLLM Goes Distributed
28
Single Device
Single Host
Multi-Device
Multi-Host
Multi-Device
Forms of Parallelism in vLLM
29
Tensor Parallelism (TP)
Pipeline Parallelism (PP)
Expert Parallelism (EP)
Data Parallelism (DP)
Disaggregated Serving
Tensor Parallelism
30
Partition the model’s hidden dimension → All-reduce to aggregate the outputs
Pipeline Parallelism
31
Distribute layers to different devices → execute in a pipelined fashion
Expert Parallelism
32
Place experts to different devices
→ All-to-all to exchange tokens
Data Parallelism
33
Partition the inputs instead of the model → Model weights are replicated
Disaggregated Serving
34
Partition the “time” dimension
→ Separate instances for prompt processing & token generation
vLLM Supports Mixed Parallelism
35
Data + Expert Parallelism
(e.g., DeepSeek V3)
Tensor + Pipeline Parallelism
(e.g., Llama 3 405B)
Intro to torch.compile and How It Works with vLLM
Richard Zou
Staff Software Engineer, Meta
Luka Govedič
Software Engineer II, Red Hat
What is torch.compile?
37
Just-in-time compiler for PyTorch code
What is torch.compile?
38
Just-in-time compiler for PyTorch code
Why use torch.compile?
39
Our value proposition: Fast baseline performance to save YOU development time from tuning model performance.
Why use torch.compile?
40
Performance with the Flexibility of PyTorch
How does torch.compile work?
41
How does torch.compile work?
42
Our frontend (TorchDynamo) captures 1+ straightline graphs.
torch.compile optimization highlights
43
torch.compile optimization highlights
44
3. matmul selection and fusion:
torch.compile optimization highlights
45
4. CUDAGraphs
How does vLLM use torch.compile?
46
How does vLLM use torch.compile?
Simple mental model for compilation. We’ll see how vLLM customizes it.
47
How does vLLM use torch.compile?
Caching (cold start, warm start)
48
How does vLLM use torch.compile?
dynamic shapes, compile_sizes, and autotuning
49
How does vLLM use torch.compile?
50
How does vLLM use torch.compile?
Piecewise CUDAGraphs
51
How does vLLM use torch.compile?
FlexAttention integration (coming soon)
52
How does vLLM use torch.compile?
FlexAttention examples (in standard PyTorch)
53
Custom torch.compile passes in vLLM
54
Custom torch.compile passes in vLLM: Why?
55
Performance/simplicity tradeoff:
Why do we need custom passes?
LLaMa model capture example
56
SiLU-Mul + Quant fusion
57
Custom torch.compile passes in vLLM: How?
58
Custom torch.compile passes in vLLM: Perf?
59
Custom torch.compile passes in vLLM: Perf?
60
Custom torch.compile passes in vLLM
61
Existing passes in vLLM:
Coming soon:
Ways to add custom passes:
Demo of Production Deployment of vLLM on AMD
Andy Luo
Director of AI Product Marketing, AMD
Mahdi Ghodsi
AI Solution Architect, AMD
AMD ROCm™ Software, Ecosystem & Performance Optimizations
Andy Luo, Mahdi Ghodsi - AMD AI Group
May 7, 2025
Delivering New Capabilities for Generative AI
65
vLLM on AMD CI Coverage and Plans
66
Typical LLM Inference Optimizations
67
AMD Quark: Unified Model Optimization & Quantization Library
WHAT IS AMD QUARK?
KEY CAPABILITIES:
MATURITY & VALIDATION:
68
Quark + vLLM Integration
GOAL: Seamlessly connect Quark's quantization capabilities with the vLLM inference engine, allowing direct loading and execution of Quark-quantized models.
Core Integrations:
WIP Features by Quark Team:
69
AMD AI Tensor Engine for ROCm (AITER)
70
AITER Solution for Deepseek-R1/V3
71
AITER op Plug-In
72
AITER’s Integration in vLLM for DeepSeek V3/R1
73
AITER’s Integration in vLLM for DeepSeek V3/R1
VLLM_USE_V1=1
VLLM_ROCM_USE_AITER=1
1.rocm_aiter_ck_moe
2.rocm_aiter_fmoe_fp8_blockscale_g1u1
3.rocm_aiter_asm_moe
4.rocm_aiter_topk_softmax
5.rocm_aiter_shuffle_weight
6.rocm_aiter_asm_moe_tkw1
https://github.com/vllm-project/vllm/pull/16752
https://github.com/vllm-project/vllm/pull/16727
https://github.com/vllm-project/vllm/pull/16674
74
AITER Upstream
to vLLM
75
AMD vLLM Roadmap
76
vLLM on ROCm
Latest ROCm developer vLLM Docker image :
77
vLLM Based ROCm AI Tutorials
78
AI Developer Hub for AMD GPU� Simplified Onboarding with Clear Guidance & Tutorials
Recommended Flows: Docker containers
vLLM | Megatron-LM | PyTorch
Tutorials (Jupyter notebook based)
Inference | Training | Fine-tuning
Performance Results on MI300X
Selected LLM model inference benchmarks
Selected LLM model training benchmarks
Supported open-source projects
Open-source projects popular among AI developers
79
Demo #1: Open WebUI + MCP
80
Demo #2: Browser Use (Web UI) Agent
81
Title - Text Only
83
Intro to Mamba SSM Architecture
Thomas Parnell
Principal Research Scientist, IBM
Attention is all you need?
85
Device | TFLOPS (fp8/int8) | Memory bandwidth TB/s | Arithmetic intensity (ops/B) |
A100 | 624 | 1.6 | 390 |
H100 | 2000 | 3.2 | 625 |
B100 | 3500 | 8 | 437 |
Why are long sequences important?
86
RAG Pattern
system
docs
user
Agentic Pattern
system
user
gen
tool
user
gen
tool
Thinking/Reasoning Pattern
system
user
gen
think
gen
gen
think
How can we scale language modeling to extremely long sequences without killing performance?
Structured State Space Sequence Models (S4)
87
Mamba-1 aka S6 (Gu + Dao, 2023)
88
S4
Mamba-1 aka S6
SSMs are Matrix Transformations
89
Source: T. Dao + A. Gu, “Transformers are SSMs” (2024)
Mamba-2 (Dao + Gu, 2024)
90
Mamba-1
Mamba-2
Source: T. Dao + A. Gu, “Transformers are SSMs” (2024)
Mamba-2 is Linear Attention
91
Mamba-2 (Dao + Gu 2024)
Now matrix transformation can be written:
Since the At are just scalars:
Linear Attention (Katharopoulos et al. 2020)
If at=1 for all t then Mamba-2 is equivalent to Linear Attention! 🤯
Hybrid Architectures
92
Source: T. Dao + A. Gu, “Transformers are SSMs” (2024)
✅
✅
✅
✅
Supported in vLLM:
Model Quality (Bamba-9B-v2)
93
Source: R. Ganti et al., “Bamba-9B-v2 - Fast and powerful!” (2025)
Note: Bamba-9B v2 is trained on 2T tokens; most other models are trained with 10T+ tokens
Performance in vLLM (Throughput)
94
Granite 4.0 Tiny Preview
95
Granite 4.0 Tiny is already supported in vLLM (via GraniteMoeHybrid model) 🎉
Limitations and next steps
96
Source: C. Zhang et al. - “Jenga: Effective Memory Management for Serving LLM with Heterogeneity” (2025)
Join the Open Source AI Movement
Saša Zelenović
Principal Product Marketing Manager, Red Hat
Contribute to Key vLLM Features via GitHub
98
Join Bi-Weekly vLLM Office Hours
99
Let Us Know How We Did Today!
Take 1 minute now to complete our 3-question survey.
100
red.ht/nyc-vllm-meetup-survey
Join us Women in Data Science on May 16!
101
Q&A
Thank You!