1 of 103

NYC vLLM Meetup

May 7, 2025

2 of 103

Welcome!

5:30

5:50

6:20

6:50

7:10

7:30

Intro to vLLM & Project Update

Intro to torch.compile and How It Works with vLLM

Demo of Production Deployment of vLLM on AMD

Intro to Mamba SSM Architecture

Q&A and Open Discussion

Pizza and Networking 🍕 🤝

3 of 103

Intro to vLLM & Project Update

Robert Shaw

Director of Engineering, Red Hat

vLLM Committer

4 of 103

Intro to vLLM

5 of 103

5

Build the fastest and easiest-to-use open-source LLM inference & serving engine

6 of 103

What Problem is vLLM Solving?

6

Production Inference Serving

  • Batch Size > 1 & Data Center Hardwares
  • How do you?
    • Efficiently schedule requests into the next forward pass?
    • Manage KV cache context and runtime memory footprint?
    • GPUs go brrrr

7 of 103

Why Is This A Hard Problem?

7

  • A LLM is a function to predict the next token in a sequence
    • P(X_n | X_0 … X_n-1)

  • To generate text, we “chain together” passes through the model
    • → A single request requires multiple passes through the model
    • → A single generation request can last multiple seconds

  • Key Challenge: How to handle multiple concurrent requests

8 of 103

Challenge 1: Batching

8

Dynamic Batching 🙅🙅🙅

Continuous Batching 🙏🙏🙏

9 of 103

Challenge 2: KV Caching

9

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

10 of 103

Challenge 2: KV Caching

10

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory!

11 of 103

vLLM’s Original Innovation

11

Paged Attention + Continuous Batching

12 of 103

vLLM’s Original Innovation

12

Alan

Turing

is

a

computer

scientist

and

mathematician

renowned

Logical KV blocks

Request

A

Block Table

computer

scientist

and

mathe-�matician

Artificial

Intelli-

gence

is

the

renowned

future

of

tech-�nology

Alan

Turing

is

a

Physical KV blocks

Artificial

Intelli-�gence

is

the

future

of

tech-�nology

Logical KV blocks

Request

B

Block Table

Paged Attention + Continuous Batching

13 of 103

vLLM Ecosystem

14 of 103

From this base, we have built…

14

  • Modeling interface: multi-modality too!
  • Hardware interface & optimized kernels on each
    • Quantization, Attention, Collectives
  • Class of Optimizations
    • Distributed inference: Tensor, Pipeline, Data, Expert Parallelism
    • Chunked prefill scheduling
    • Prefix caching
    • Speculative decoding
    • Multi-LoRA

15 of 103

2 Year Journey Of vLLM

15

vLLM has rapidly evolved from a research project to the open source default.

Pervasive → 100k daily installs in 2025

Explosive Growth → 10x usage increase in 2024

16 of 103

Community Flywheel

16

Hardware Vendors

Contribution Trajectory

Model Creators

Choice for MI300X

TPU enablement

Neuron enablement

Gaudi enablement

Features for new HW

Llama

Qwen

Mistral

Molmo

Arctic

Phi

Jamba

DBRX

Transformers

Commits By Organization

vLLM’s position is attracting investment from key ecosystem participants.

17 of 103

Cross Platform

17

vLLM supports the key models on the key hardware accelerators.

CPU

Neuron

TPU

Gaudi

Instinct

GPU

Llama

Qwen

DeepSeek

Gemma

Mistral

Molmo

Phi

Nemotron

Granite

Single Platform To Run Your Models Across Accelerators and OEMs.

Edge

Virtual

Public Cloud

Private Cloud

Physical

18 of 103

Who Uses vLLM?

  • Model as a Service: AWS, GCP, Azure, …

  • AI in Scaled Production: Amazon, LinkedIn, Meta, …

  • Proprietary Deployments: Snowflake, IBM, …

  • Foundation Model Labs: Mistral, Cohere, Qwen, …

  • Fine-tuning Frameworks: veRL, TRL, OpenRLHF, …

  • Hardware Platforms: NVIDIA, AMD, Intel, …

19 of 103

vLLM Features

20 of 103

Why vLLM For Performance?

20

vLLM implements the key optimizations for fast inference

Inference

Optimizations

To make your models faster

Distributed

Inference

To deploy large models efficiently

21 of 103

Quantization

21

Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute

Weight Quantization

Reduced storage & memory footprints

E.g. 100B model (BFloat16 → 200GB / FP8 → 100GB)

KV Cache Quantization

Reduced KV cache storage & faster attention

Crucial for long context workloads

Activation Quantization

Faster matrix multiplication & communication

1

2

3

22 of 103

Get Started with Model Optimization Now

22

LLM Compression Tools

Optimized Model Hub

Llama

Qwen

Mistral

DeepSeek

Gemma

Phi

Molmo

Granite

Nemotron

→ Optimized Model Hub

red.ht/optimized-models

→ LLM Compressor Tools

red.ht/llm-compressor

23 of 103

Impact of Quantization

23

Quantization Enables More Tokens For Fixed Hardware

24 of 103

Automatic Prefix Caching

24

Re-use KV caches across requests!

Request A

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello!

Request B

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?

25 of 103

Speculative Decoding

25

Accelerate decoding phase with speculation - variety of methods.

26 of 103

Impact of Speculative Decoding

26

Spec Decoding Enables Better Latency In Bandwidth Bound Regimes

27 of 103

vLLM Combines All Optimizations Together

27

Without Optimizations

28 of 103

vLLM Goes Distributed

28

Single Device

Single Host

Multi-Device

Multi-Host

Multi-Device

29 of 103

Forms of Parallelism in vLLM

29

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Expert Parallelism (EP)

Data Parallelism (DP)

Disaggregated Serving

30 of 103

Tensor Parallelism

30

Partition the model’s hidden dimension → All-reduce to aggregate the outputs

  • Works well for ≤ 8 devices

  • vLLM provides optimized all-reduce implementation

  • Limited scalability

31 of 103

Pipeline Parallelism

31

Distribute layers to different devices → execute in a pipelined fashion

  • Point-to-point communication instead of expensive all-reduce

  • Load imbalance b/t stages

  • Doesn’t reduce latency

32 of 103

Expert Parallelism

32

Place experts to different devices

→ All-to-all to exchange tokens

  • Lower communication overheads than tensor parallelism

  • Load imbalance between experts

33 of 103

Data Parallelism

33

Partition the inputs instead of the model → Model weights are replicated

  • Lower communication overheads

  • Load imbalance between replicas

  • Increased memory consumption for model weights

34 of 103

Disaggregated Serving

34

Partition the “time” dimension

→ Separate instances for prompt processing & token generation

  • Separation of concern

  • Better control over latency

  • KV cache transfer overheads

  • Lower device utilization

35 of 103

vLLM Supports Mixed Parallelism

35

Data + Expert Parallelism

(e.g., DeepSeek V3)

Tensor + Pipeline Parallelism

(e.g., Llama 3 405B)

36 of 103

Intro to torch.compile and How It Works with vLLM

Richard Zou

Staff Software Engineer, Meta

Luka Govedič

Software Engineer II, Red Hat

37 of 103

What is torch.compile?

37

Just-in-time compiler for PyTorch code

38 of 103

What is torch.compile?

38

Just-in-time compiler for PyTorch code

39 of 103

Why use torch.compile?

39

Our value proposition: Fast baseline performance to save YOU development time from tuning model performance.

40 of 103

Why use torch.compile?

40

Performance with the Flexibility of PyTorch

41 of 103

How does torch.compile work?

41

  • Our frontend (TorchDynamo) captures graphs via a custom Python bytecode interpreter. It produces graphs and bytecode.

42 of 103

How does torch.compile work?

42

Our frontend (TorchDynamo) captures 1+ straightline graphs.

43 of 103

torch.compile optimization highlights

43

  1. Pointwise + reduction fusion
  2. autotuning (e.g. block size selection)

44 of 103

torch.compile optimization highlights

44

3. matmul selection and fusion:

  • Given a matmul (and pointwise epilogue), with mode=“max-autotune”, we will benchmark (1) torch.matmul vs (2) triton matmul vs (3) cutlass matmul (hidden behind a config)

  • In the triton config the pointwise epilogue may be fused onto the matmul.

45 of 103

torch.compile optimization highlights

45

4. CUDAGraphs

  • CUDAGraphs: captures a sequence of kernel launches so that they can be re-launched in the future with low overhead

  • Does not capture CPU Compute, only CUDA kernels

  • It is difficult to use the raw CUDAGraphs API safely; torch.compile has more safety in that it’ll split the graph on known unsupported operators.

46 of 103

How does vLLM use torch.compile?

  • torch.compile is on by default with vLLM V1 engine
  • To turn it off, pass --enforce-eager

46

47 of 103

How does vLLM use torch.compile?

Simple mental model for compilation. We’ll see how vLLM customizes it.

47

48 of 103

How does vLLM use torch.compile?

Caching (cold start, warm start)

  • Compilation is guaranteed to finish before request serving
  • If you are serving on multiple machines, copy the cache directory across machines to speed up spin-up time

48

49 of 103

How does vLLM use torch.compile?

dynamic shapes, compile_sizes, and autotuning

  • By default vLLM compiles with dynamic batch size (e.g. the graph can be used for multiple batch sizes)

  • torch.compile is able to tune performance better via specializing on the batch size (especially at batch_size=1)

  • Use compile_sizes={1, 2, 4} to control what batch sizes torch.compile should further specialize on.

49

50 of 103

How does vLLM use torch.compile?

50

51 of 103

How does vLLM use torch.compile?

Piecewise CUDAGraphs

  • CUDAGraphs: captures a sequence of kernel launches so that they can be re-launched in the future with low overhead
  • Not all operators are supported (e.g. cascade attention)

51

52 of 103

How does vLLM use torch.compile?

FlexAttention integration (coming soon)

  • FlexAttention is an API that allows the easy implementation of many attention variants (Causal, Alibi, Tanh Soft-Capping) in a few lines of PyTorch code

  • The attention variant gets lowered into a fused FlashAttention kernel via torch.compile.

52

53 of 103

How does vLLM use torch.compile?

FlexAttention examples (in standard PyTorch)

53

54 of 103

Custom torch.compile passes in vLLM

54

  1. Why?
  2. How?
  3. Perf?
  4. Status?

55 of 103

Custom torch.compile passes in vLLM: Why?

55

Performance/simplicity tradeoff:

  • vLLM model definitions are declarative and expressive;
    • model writers care about what, not how
  • vLLM users want maximum performance
  • We can achieve both with torch.compile!

Why do we need custom passes?

  • Transformations involving custom kernels
  • Additional optimizations not present in Torch Inductor

56 of 103

LLaMa model capture example

56

  • Model definition can read config, involve control flow, and use abstractions
  • Traced graph can optimize out CPU overhead and fuse operations, producing efficient code
  • What if custom kernels are involved?

57 of 103

SiLU-Mul + Quant fusion

57

  • SiLU-Mul (orange) is an activation function
  • Quantize (green) happens at the start of Linear varieties for a quantized model
  • They cannot be fused in eager mode without breaking abstractions

58 of 103

Custom torch.compile passes in vLLM: How?

58

  • Using the Inductor pattern-matcher, we can do match-and-replace on a pattern
  • Pattern and replacement functions traced with fake tensors to produce fx.Graphs

59 of 103

Custom torch.compile passes in vLLM: Perf?

59

  • Setup: 8xMI300, meta-llama/Llama-3.1-405B-Instruct-FP8, TP=8
  • -1%-8% improvement in throughput - close to maximum (~8%)
  • Torch-based implementation excellent as of torch==2.7

60 of 103

Custom torch.compile passes in vLLM: Perf?

60

  • Setup: H100, RedHatAI/Meta-Llama-3-70B-Instruct-FP8, TP=1
  • Up to 4% TPOT improvement and 5% TTFT improvement (lower is better)
  • Torch-based impls of layernorm and silu-mul even faster than fusion!

61 of 103

Custom torch.compile passes in vLLM

61

Existing passes in vLLM:

  • RMSNorm + Quant fusion (#9138, #10906)
  • SiluMul + Quant fusion (#10867)
  • Sequence Parallelism (#16155 - community contribution)
  • No-op Elimination & Fix Functionalization - complementary graph opts

Coming soon:

  • Attention + Quant fusion (#16756)

Ways to add custom passes:

  • Add to PostGradPassManager
  • -O {’inductor_passes’:{‘post_grad_custom_post_pass’:<pass>}}
  • LLM(compilation_config=...) (same as above)

62 of 103

Demo of Production Deployment of vLLM on AMD

Andy Luo

Director of AI Product Marketing, AMD

Mahdi Ghodsi

AI Solution Architect, AMD

63 of 103

AMD ROCm™ Software, Ecosystem & Performance Optimizations

Andy Luo, Mahdi Ghodsi - AMD AI Group

May 7, 2025

64 of 103

Delivering New Capabilities for Generative AI

65 of 103

65

66 of 103

vLLM on AMD CI Coverage and Plans

66

67 of 103

Typical LLM Inference Optimizations

67

68 of 103

AMD Quark: Unified Model Optimization & Quantization Library

WHAT IS AMD QUARK?

  • AMD's new open-source model optimization library, focusing on AI model quantization.

KEY CAPABILITIES:

  • Supports major AI ecosystems: PyTorch & ONNX.
  • Models compatible with multiple execution stacks: PyTorch, ONNXRuntime, vLLM, HF Transformers, Llama.cpp, etc.
  • Provides user-centric quantization flows (PyTorch Native, ONNX-to-ONNX, Hybrid) to "meet customers where they are".
  • Advanced algorithms and datatype support: QuaRot, SmoothQuant, AWQ, GPTQ, LSQ and others

MATURITY & VALIDATION:

  • Widely used in production deployments by internal and external AMD customers.
  • Proven results: Contributed high-accuracy models for AMD's MLPerf Inference submissions.

68

69 of 103

Quark + vLLM Integration

GOAL: Seamlessly connect Quark's quantization capabilities with the vLLM inference engine, allowing direct loading and execution of Quark-quantized models.

Core Integrations:

  • Native Quark Format Support: Direct support for Quark quantization format within vLLM . (#10765)
  • Quantization Schemes: W8A8 (INT8), W8A8 (FP8), FP8 KV Cache. (#16236)
  • Advanced algorithms: AutoSmoothQuant (#15861)

WIP Features by Quark Team:

  • OCP MXFP4: Enable vLLM to load and run Quark-quantized OCP MXFP4 models.(#16943)
  • FP8 Attention: Extend Triton FAv2 kernel to run Quark-quantized models with FP8 attention. (#12591) (jointly with AMD vLLM team)
  • Rotations Algo for Quantization: Introduce "online rotation" capabilities in vLLM to efficiently run advanced 4-bit Quark-quantized models such as QuaRot. (#16443)
  • Enhancements for ROCm: Improve FP8 quantized model support in vLLM on AMD GPUs . (#12612)

69

70 of 103

AMD AI Tensor Engine for ROCm (AITER)

  • High-performance AI operators designed to accelerate various AI workloads
  • AITER serves as a unified platform where customers can easily find and integrate optimized operators into their existing frameworks—be it private, public, or custom-built

70

71 of 103

AITER Solution for Deepseek-R1/V3

71

72 of 103

AITER op Plug-In

72

73 of 103

AITER’s Integration in vLLM for DeepSeek V3/R1

  • Incorporating AITER’s Optimizations
    • 2x performance improvement 6484.76 to 13704.36 tok/s

73

74 of 103

AITER’s Integration in vLLM for DeepSeek V3/R1

VLLM_USE_V1=1

VLLM_ROCM_USE_AITER=1

1.rocm_aiter_ck_moe

2.rocm_aiter_fmoe_fp8_blockscale_g1u1

3.rocm_aiter_asm_moe

4.rocm_aiter_topk_softmax

5.rocm_aiter_shuffle_weight

6.rocm_aiter_asm_moe_tkw1

https://github.com/vllm-project/vllm/pull/16752

https://github.com/vllm-project/vllm/pull/16727

https://github.com/vllm-project/vllm/pull/16674

74

75 of 103

AITER Upstream

to vLLM

75

76 of 103

AMD vLLM Roadmap

76

77 of 103

vLLM on ROCm

Latest ROCm developer vLLM Docker image :

77

78 of 103

vLLM Based ROCm AI Tutorials

78

79 of 103

AI Developer Hub for AMD GPUSimplified Onboarding with Clear Guidance & Tutorials

Recommended Flows: Docker containers

vLLM | Megatron-LM | PyTorch

Tutorials (Jupyter notebook based)

Inference | Training | Fine-tuning

Performance Results on MI300X

Selected LLM model inference benchmarks

Selected LLM model training benchmarks

Supported open-source projects

Open-source projects popular among AI developers

79

80 of 103

Demo #1: Open WebUI + MCP

80

81 of 103

Demo #2: Browser Use (Web UI) Agent

81

82 of 103

83 of 103

Title - Text Only

83

  • Text 1
  • Text 2
  • Text 3
  • Text 5
  • Text 5

84 of 103

Intro to Mamba SSM Architecture

Thomas Parnell

Principal Research Scientist, IBM

85 of 103

Attention is all you need?

85

  • Attention is incredibly good at language modeling.
  • In vLLM, we use lots of clever methods for efficient implementation on modern GPUs:
    • Paged KV caching, Kernel fusion, Tensor cores, TMA
  • Despite all this, there are still some more “fundamental” issues for inference:
    • KV cache grows linearly with the sequence length and batch size.
    • Arithmetic intensity during decode is low (e.g., ~20 ops/byte for llama3-8b at bs=16)
    • Prefill time (TTFT) is quadratic in the prompt length

Device

TFLOPS (fp8/int8)

Memory bandwidth TB/s

Arithmetic intensity (ops/B)

A100

624

1.6

390

H100

2000

3.2

625

B100

3500

8

437

86 of 103

Why are long sequences important?

86

RAG Pattern

system

docs

user

Agentic Pattern

system

user

gen

tool

user

gen

tool

Thinking/Reasoning Pattern

system

user

gen

think

gen

gen

think

How can we scale language modeling to extremely long sequences without killing performance?

87 of 103

Structured State Space Sequence Models (S4)

87

  • S4 models [Gu, Goel, Ré, 2021] are functions that map sequence x to sequence y through implicit latent state h.
  • Matrix A is structured to enable efficient implementation.
  • Theoretical benefits over attention:
    • Linear in sequence length (T)
    • State size (N) does not grow with T
  • Decoding can be implemented very efficiently using recurrent form but S4 also admits a convolutional representation for efficient prefill + training.
  • However, S4 is not good at:
    • Selective copying
    • In-context reasoning

88 of 103

Mamba-1 aka S6 (Gu + Dao, 2023)

88

  • Matrices A, B, C become dependent on the timestep.
    • Model can selectively focus on individual tokens.
    • Huge improvement in selective copying and in-context reasoning.
  • However, the model no longer admits a conv. form making training/prefill inefficient.
  • To overcome this, the authors proposed a hardware-efficient method for unrolling the recurrence
    • “Selective Scan” + S4 = S6
  • However, Mamba-1 is still less efficient than attention-based models since it does not make use of matrix multiplication units (e.g., tensor cores).

S4

Mamba-1 aka S6

89 of 103

SSMs are Matrix Transformations

89

Source: T. Dao + A. Gu, “Transformers are SSMs” (2024)

  • SSMs (including both S4 and Mamba-1) can be re-written as a matrix transformation from input sequence X to output sequence Y [Dao + Gu, 2024]
  • This matrix has a special structure (“semiseparable matrix of rank N”) that allows us to perform matrix multiplication efficiently.
  • Exactly how efficiently depends on structure of matrices At

90 of 103

Mamba-2 (Dao + Gu, 2024)

90

  • In Mamba-2, the authors introduce even more structure to the matrices At
  • They use this structure to derive a matrix multiplication-based algorithm that uses O(TN2) FLOPs and O(TN) memory.
  • This algorithm has been implemented in vLLM using Triton kernels and enables:
    • Efficient use of tensor cores.
    • Support for tensor parallelism.
    • Chunked prefill

Mamba-1

Mamba-2

Source: T. Dao + A. Gu, “Transformers are SSMs” (2024)

91 of 103

Mamba-2 is Linear Attention

91

Mamba-2 (Dao + Gu 2024)

Now matrix transformation can be written:

Since the At are just scalars:

Linear Attention (Katharopoulos et al. 2020)

If at=1 for all t then Mamba-2 is equivalent to Linear Attention! 🤯

92 of 103

Hybrid Architectures

92

  • The Mamba-2 sequence transformation can be integrated into neural networks architectures by forming blocks.
  • From these blocks, one can then construct complex hybrid architectures interleaving Mamba-2 layers with other layers such as RMSNorm, Linear + SoftMax, MoE, Attention layers.
  • Example: Bamba (IBM Research, Princeton, CMU, and UIUC) mixes Mamba-2 and Attention layers in a 9:1 ratio.

Source: T. Dao + A. Gu, “Transformers are SSMs” (2024)

Supported in vLLM:

93 of 103

Model Quality (Bamba-9B-v2)

93

Source: R. Ganti et al., “Bamba-9B-v2 - Fast and powerful!” (2025)

Note: Bamba-9B v2 is trained on 2T tokens; most other models are trained with 10T+ tokens

94 of 103

Performance in vLLM (Throughput)

94

95 of 103

Granite 4.0 Tiny Preview

95

  • IBM Granite 4.0 Tiny Preview, is a preliminary version of the smallest model in the upcoming Granite 4.0 family of language models.
  • Like Bamba, it is a hybrid model mixing Mamba-2 and Attention layers in a 9:1 ratio.
  • Unlike Bamba, it uses MoE with 7B total parameters and 64 experts, yielding 1B active parameters at inference time.

Granite 4.0 Tiny is already supported in vLLM (via GraniteMoeHybrid model) 🎉

96 of 103

Limitations and next steps

96

  • Performance issue on serving benchmark using relatively small prompts (e.g., ShareGPT)
    • May be related to chunked prefill implementation and/or Triton launch overhead.
    • We are working on it!
  • Enabling Mamba-based models for vLLM V1
    • Need to enable prefix caching.
    • Requires some work to properly manage state for complex hybrid models.

Source: C. Zhang et al. - “Jenga: Effective Memory Management for Serving LLM with Heterogeneity” (2025)

97 of 103

Join the Open Source AI Movement

Saša Zelenović

Principal Product Marketing Manager, Red Hat

98 of 103

Contribute to Key vLLM Features via GitHub

  • Comment/review PRs that are interesting to you
  • Join the discussions on RFCs
  • Check out “good first issue” tags
  • Build AI apps with open source models and vLLM - and let us know!

98

99 of 103

Join Bi-Weekly vLLM Office Hours

  • Happening every other Thursday at 2:00PM ET
  • Hear the bi-weekly vLLM update
  • Give feedback & ask questions
  • Deep dive into cutting-edge developments to accelerate your vLLM inference

99

100 of 103

Let Us Know How We Did Today!

Take 1 minute now to complete our 3-question survey.

100

red.ht/nyc-vllm-meetup-survey

101 of 103

Join us Women in Data Science on May 16!

101

102 of 103

Q&A

103 of 103

Thank You!