1 of 99

Toronto Developer Community Meetup

September 25 2025

2 of 99

Agenda

Meetup Sponsors | 2

2025 Toronto Meetup

Agenda item

Presenter

Time

Doors open and meet the vLLM Team

5:00 - 5:30 PM

Intro to vLLM and project update

Lucas Wilkinson

5:30 - 6:00 PM

Topic #1: Tackling inference at scale

Lucas Wilkinson, Ryan McCormick

6:00 - 6:30 PM

15 min break

6:30 - 6:45 PM

Topic #2: Reducing latency with EAGLE speculative decoding

Benjamin Chislett

6:45 - 7:05 PM

Topic #3: Enabling SpecDec with any model using Speculators

Dipika Sikka

7:05 - 7:20 PM

Ways to contribute & closing remarks

Aaron Pham

7:20 - 7:30 PM

Networking with light refreshments

7:30 - 9:00 PM

3 of 99

Event Sponsor

Vector Institute

4 of 99

CentML officially joins NVIDIA

Past vLLM major contributions by CentML:

  • Benjamin Chislett: Fully overlapped scheduler¹
  • Benjamin Chislett: SpecDec + structured outputs²
  • Vadim Gimpelson: Async guided decoding (Open)³
  • Benjamin Chislett: SpecDec, EAGLE1, EAGLE3, and MTP⁴
  • Murali Andoorveedu: Pipeline parallel support⁵

And many more to come!

vLLM PRs: (1) #23569 (2) #14702 (3) #23224 (4) #12915, #13626, #16937 (5) #4412

5 of 99

Event Sponsor

6 of 99

Intro to vLLM and

Project Update

Lucas Wilkinson

vLLM Core Committer

Principal Software Engineer, Red Hat

2025 Toronto Meetup

Meetup Sponsors | 6

7 of 99

vLLM’s Goal

Build the fastest and easiest-to-use open-source LLM inference & serving engine

2025 Toronto Meetup

Meetup Sponsors | 7

8 of 99

What Problem is vLLM Solving?

  • Batch Size > 1 & Data Center Hardwares
    • Not the same workload as on-device inference for a single user
  • How do you …
    • Efficiently schedule requests into the next forward pass?
    • Manage KV cache context and runtime memory footprint?

2025 Toronto Meetup

Meetup Sponsors | 8

9 of 99

Why Is This A Hard Problem?

  • A LLM is a function to predict the next token in a sequence
    • P(X_n | X_0 … X_n-1)
  • To generate text, we “chain together” passes through the model
    • → A single request requires multiple passes through the model
    • → A single generation request can last multiple seconds
  • Key Challenge: How to handle multiple concurrent requests

2025 Toronto Meetup

Meetup Sponsors | 9

10 of 99

Challenge 1: Batching

2025 Toronto Meetup

Meetup Sponsors | 10

Naive Batching 🙅🙅🙅

Continuous Batching 🙏🙏🙏

11 of 99

Challenge 2: KV Caching

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

2025 Toronto Meetup

Meetup Sponsors | 11

12 of 99

Challenge 2: KV Caching

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory!

2025 Toronto Meetup

Meetup Sponsors | 12

13 of 99

vLLM’s Original Innovation

2025 Toronto Meetup

Meetup Sponsors | 13

14 of 99

vLLM’s Original Innovation

Paged Attention + Continuous Batching

2025 Toronto Meetup

Meetup Sponsors | 14

Alan

Turing

is

a

computer

scientist

and

mathematician

renowned

Logical KV blocks

Request

A

Block Table

computer

scientist

and

mathe-�matician

Artificial

Intelli-

gence

is

the

renowned

future

of

tech-�nology

Alan

Turing

is

a

Physical KV blocks

Artificial

Intelli-�gence

is

the

future

of

tech-�nology

Logical KV blocks

Request

B

Block Table

15 of 99

2 Year Journey Of vLLM

vLLM has rapidly evolved from a research project to the open source default

  • Pervasive → 1+ million weekly installs; 58k GitHub stars
  • Explosive Growth → Millions of deployed GPU hours per day
  • Vibrant Community → 1500+ contributors

2025 Toronto Meetup

Meetup Sponsors | 15

16 of 99

2025 Toronto Meetup

Meetup Sponsors | 16

Neuron

TPU

Gaudi

Instinct

GPU

Llama

Qwen

DeepSeek

Gemma

Mistral

Molmo

Phi

Nemotron

Granite

Spyre

Edge

Private Cloud

Physical

Virtual

Public Cloud

vLLM: The De Facto Open GenAI Inference Platform

vLLM has emerged as the Linux of GenAI Inference

17 of 99

From this base, we have built…

2025 Toronto Meetup

Meetup Sponsors | 17

18 of 99

Why vLLM For Performance?

2025 Toronto Meetup

Meetup Sponsors | 18

Inference

Optimizations

To make your models faster

Distributed

Inference

To deploy large models efficiently

19 of 99

Automatic Prefix Caching

2025 Toronto Meetup

Meetup Sponsors | 19

Example: Multi-turn conversation

Prompt (round 1)

Human: What's AI?

LLM Result (round 1)

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Prompt (round 2)

Human: What's AI?

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Human: Cool, thanks!

LLM Result (round 2)

LLM: No problem!

Cached

20 of 99

Quantization in vLLM

2025 Toronto Meetup

Meetup Sponsors | 20

  1. Weight Quantization
    • Reduced storage & memory footprint
    • E.g.) 100B parameter model takes up 200GB@BF16 but only 50GB@INT4

  • Activation Quantization
    • Quantized weights and activations
    • Faster linear layers by utilizing low precision tensor cores

  • KV Cache Quantization
    • Reduced KV cache footprint & faster attention
    • Crucial for long context workloads

BF16

INT4

W4A16

INT8

W8A8

21 of 99

Speculative Decoding

Accelerate decoding phase with speculation - variety of methods: ngram, draft model, EAGLE, etc

2025 Toronto Meetup

Meetup Sponsors | 21

22 of 99

torch.compile

2025 Toronto Meetup

Meetup Sponsors | 22

  • Automatic kernel generation
    • Fast iteration, automatic fusion, performance portability
    • RoPE, RMSNorm, SiluMul, QuantFP8, etc.
  • Applying optimizations independently of model definitions
    • Graph-level transformations (Nanoflow, DBO)
    • Custom op and collective fusion passes
      • Attention + Quant (FP8) (~7% improvement)
      • Attention + Quant (FP4) (~11% improvement)
      • SiLU-Mul + Quant (FP4) (~3% improvement)
      • Sequence Parallelism & Async TP (~10% improvement)
      • AllReduce + RMSNorm + Quant (FP8) (~8% improvement)
      • AllReduce + RMSNorm + Quant (FP4) (~10% improvement)

23 of 99

FlashInfer

Nvidia’s SOTA Inference Kernels for LLM Research and Deployment

Upcoming Release Highlights

  • Native integration with vLLM
    • Attention, MoE, MLA, sampling
    • Distributed communication primitives (Allreduce, AlltoAllv)
    • Various fused kernels via torch.compile
  • Improved Blackwell support with extended kernel support for:
    • GB300, B300
    • GeForce RTX Pro 6000
    • Nvidia Jetson Thor
    • Nvidia DGX Spark �
  • Available with AOT (faster startup) or JIT (small binary size)

23

24 of 99

FlashInfer

GitHub First

github.com/flashinfer-ai/flashinfer

  • Please raise Feature Requests and Issues on Github

  • Every Github issue to be looked at and resolved

  • Every feature request to be prioritized and addressed.

  • Get started today and tell us how we are doing!

24

25 of 99

vLLM Combines All Optimizations Together

2025 Toronto Meetup

Meetup Sponsors | 25

Without Optimizations

26 of 99

vLLM Goes Distributed

Single Device

Single-Node

Multi-Device

Multi-Node

Multi-Device

2025 Toronto Meetup

Meetup Sponsors | 26

In-depth Later

27 of 99

Recent Features

2025 Toronto Meetup

Meetup Sponsors | 27

  • V0 Deprecation
    • v0.9.2: Last version with V0 intact
    • v0.10: Begin to remove V0 code
    • v0.11: All V0 code removed
  • Async Scheduling
    • Run the scheduler one step ahead of model executor --async-scheduling
    • Improves TPOT and E2E latency at the cost of longer TTFT
  • CUDA Graph Mode
    • New API --compilation-config ‘{"cudagraph_mode": "FULL"}’ with options:
      • NONE - Mixed and Decode are not captured in CUDAGraph
      • PIECEWISE (default now) - Mixed and Decode are captured in PIECEWISE CUDAGraph
      • FULL - Mixed and Decode are captured in FULL CUDAGraph)
      • FULL_DECODE_ONLY - Mixed is not captured, but Decode is captured in FULL CUDAGraph
      • FULL_AND_PIECEWISE (default, v0.11) - Mixed is captured in PIECEWISE and Decode in FULL CUDAGraph

We are here!

28 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 28

29 of 99

Topic #1�Tackling inference at scale

Lucas Wilkinson, Ryan McCormick

2025 Toronto Meetup

Meetup Sponsors | 29

30 of 99

Why Distributed Inference

2025 Toronto Meetup

Meetup Sponsors | 30

Graph credit: MIT Han Lab

31 of 99

Ways to distribute a models weights and kv-cache

2025 Toronto Meetup

Meetup Sponsors | 31

Distributing Weights

  • Tensor Parallelism
  • Pipeline Parallelism
  • Expert Parallelism

Distributing KV-caches

  • Data Parallel Attention
  • Context Parallel
  • Prefill/Decode Disaggregation (sort-of)

32 of 99

Transform Model

2025 Toronto Meetup

Meetup Sponsors | 32

33 of 99

Tensor Parallelism (TP)

2025 Toronto Meetup

Meetup Sponsors | 33

Distributed Across Heads

Distributed Across Hidden Dim

34 of 99

Pipeline Parallelism (PP)

2025 Toronto Meetup

Meetup Sponsors | 34

35 of 99

Pipeline Parallelism + Tensor Parallelism

2025 Toronto Meetup

Meetup Sponsors | 35

36 of 99

Pipeline Parallelism + Tensor Parallelism Summary

2025 Toronto Meetup

Meetup Sponsors | 36

Tensor Parallelism

  • Splits each layer across devices
  • All devices (inside a TP group) are being used for same forward pass simultaneously
  • More communication
  • Lower latency if communication is fast

Pipeline Parallelism

  • Splits across layers
  • Better for multi-node deployments
  • Increases latency

37 of 99

The Deepseek Era

Meetup Sponsors | 37

MLA Attention

Mixture of Experts

38 of 99

The Deepseek ERA

2025 Toronto Meetup

Meetup Sponsors | 38

Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

Mixture of Experts

Activated Params

Total Params

39 of 99

The Deepseek Era - Smaller Experts

More Smaller Experts

  • Mixtral had 8

(pick 2)

  • DeepSeek R1 has 256!

(pick 8)

40 of 99

Tensor Parallel MoE

2025 Toronto Meetup

Meetup Sponsors | 40

Experts are small; can only subdivide them efficiently so many times

41 of 99

Expert Parallelism

2025 Toronto Meetup

Meetup Sponsors | 41

Device 1

Device 2

Full Experts on each device

42 of 99

Multi-Latent Attention

Meetup Sponsors | 42

MLA Attention

43 of 99

Tensor Parallelism and MLA

2025 Toronto Meetup

Meetup Sponsors | 43

Can shared along kv-heads

GQA

MLA

Can’t shard latent dimension

(Cache must be wastefully replicated across TP ranks)

44 of 99

Expert Parallelism

2025 Toronto Meetup

Meetup Sponsors | 44

45 of 99

Data Parallel Attention + Expert Parallelism

2025 Toronto Meetup

Meetup Sponsors | 45

Sparse A2A

Sparse AG

(DS-DeepEP)

Reqs 0-4

Reqs 5-9

KV-caches parallelized across requests

46 of 99

Decode Context Parallel

2025 Toronto Meetup

Meetup Sponsors | 46

Context/KV-cache tokens are sharded across devices (distributed round robin)

Run with: `-tp 8 -dcp 8`�8× larger KV cache

2–3× throughput gain on single-node H200

47 of 99

Prefill Decode Disaggregation

2025 Toronto Meetup

Meetup Sponsors | 47

Partition the “time” dimension

→ Separate instances for prompt processing & token generation

  • Separation of concern

  • Better control over latency

  • KV cache transfer overheads

  • Lower device utilization

NIXL

48 of 99

NVIDIA Inference Transfer Library (NIXL)

 Low-latency, hardware-agnostic communication 

NIXL Core

Metadata

Memory

Post

Request

DRAM

DRAM

DRAM

DRAM

DRAM

HBM

DRAM

DRAM

FILE

Request

completion

Backend API

NIXL API

FILE

FILE

Obj

UCX

GDS

S3

Custom Backend

  • Optimized for inference data movement
  • Consistent API across heterogenous data paths
  • Supports different types of memory, SSDs, and networked storage

48

49 of 99

Prefill Decode Disaggregation

2025 Toronto Meetup

Meetup Sponsors | 49

50 of 99

Other MoE Optimizations

2025 Toronto Meetup

Meetup Sponsors | 50

  • Dual Batch Overlap
  • Expert parallel load-balancer

➡️

51 of 99

Summary

2025 Toronto Meetup

Meetup Sponsors | 51

Parallelism

Partitions across devices

Best used for

Tensor Parallelism

within layers (hidden/head-dim)

Single Node / Low Latency

Pipeline Parallelism

layers

Multi Node

Data Parallelism

requests

MLA Models

Expert Parallelism

experts in MoE

Multi Node or Highly Sparse MoEs

Prefill Decode Disaggregation

prefill / decode stages

Large scale deployments

Decode Context Parallelism

tokens in the KV-cache

Latency / Very long context

(TP > Num-KV-Heads)

52 of 99

Backed by industry leaders: founded in collaboration with CoreWeave, Google, IBM Research, NVIDIA, and AMD

53 of 99

Meetup Sponsors | 53

2025 Toronto Meetup

Operationalizability

  • Modular and resilient architecture with native integration into Kubernetes via Inference Gateway API

Flexibility

  • Cross-platform with extensible implementations of key composable layers of the stack

Performance

  • Leverage distributed optimizations like prefix-aware routing and disaggregation to achieve the highest throughput while meeting SLOs

54 of 99

Why Distributed LLM Inference?

Meetup Sponsors | 54

2025 Toronto Meetup

We can exploit the unique properties of LLM inference to improve perf/$ over naive load balancing

55 of 99

“Well-lit” Paths

  • Intelligent Inference Scheduling
  • P/D Disaggregation
  • Wide Expert Parallelism

See guides: https://github.com/llm-d/llm-d/tree/main/guides

Meetup Sponsors | 55

2025 Toronto Meetup

llm-d provides “well-lit” paths for running LLM inference workloads

Questions? What to know More?

Join the Slack: https://inviter.co/llm-d-slack

56 of 99

“Well-lit” Path: Intelligent Inference Scheduling

56

vLLM-aware load-balancing enables smarter request routing that improve SLOs

Maintain prefix-tree.

“Pod A has hit”

Prefix-Aware Routing

EPP

Pod A

Pod B

Pod C

GET completions

Load-Aware Routing

EPP

Pod A

Pod B

Pod C

GET completions

Prompt

Dramatically increase prefix-cache hit rate

Scrape metrics. “A has low load”

Load-balancing based on actual replica state

Pod A

Prompt

Pod A

/metrics

57 of 99

Intelligent Inference Scheduling

57

Inference scheduling is a no-brainer optimization which can have huge impacts on repeated prompts

58 of 99

58

Disaggregated Serving

EPP

Decode Pod A

Prefill Pod B

GET completions

Specialization of pods for prefill and decode phase

Long prompt. “Use Disagg”

KV Xfer (NIXL)

Sidecar

Prompt

Pod A+B

“Well-lit” Path: P/D Disaggregation

P/D disaggregation is a key optimization for demanding inference workloads

NIXL

UCX

NVLink

Infiniband

Heterogenous Transfer Protocols

RoCE

ICI

EFA

TCP

API

Networking

Transport

Support variety of transports via UCX

59 of 99

P/D Disagg

59

LeaderWorkerSet

“Well-lit” Path: Wide Expert Parallelism

llm-d’s K8s-native design composes the EP implementation with the rest of the llm-d system

EPP

Pod A

(DP-0)

GET completions

Pod B

(DP-1)

Pod G

(DP-6)

Pod H

(DP-7)

Expert-Parallel MLP

*Composes with other scorers*

Prompt

Pod D

Pod C

(DP-2)

Pod D

(DP-3)

Pod E

(DP-4)

Pod F

(DP-5)

60 of 99

NVIDIA DynamoRyan McCormick, NVIDIA

61 of 99

NVIDIA Dynamo

 Systematic approach to AI inference at scale

# P/D Disagg Quickstart with Dynamo + vLLM

uv pip install ai-dynamo[vllm]

# Start Frontend (auto-discover workers)

python -m dynamo.frontend

# Start Decode Worker(s)

python3 -m dynamo.vllm \

--model Qwen/Qwen3-0.6B

# Start Prefill Worker(s)

python3 -m dynamo.vllm \

--model Qwen/Qwen3-0.6B \

--is-prefill-worker

Scheduling

Data

Transfer

Memory

Management

Disaggregated

Serving

62 of 99

Scheduling

Router

Planner

Worker

Prefill

worker

Decode

Worker

KV

load

Prefill queue

Worker

Worker

Router

KV

load

KV

load

KV

load

KV hit

rate

KV hit

rate

KV hit

rate

Routing requests and real time perf tuning

63 of 99

Data Transfer

NIXL (Nvidia Inference Xfer Library)

HBM

Host memory

Local SSD

Network storage

HBM

Low latency KV transfer

Backbone for KV Block Manager

  • Unified API for both

storage and memory access

  • Best in-class networking perf

  • Dynamically auto-scalable

 Move data from point to point with low latency and high bandwidth

64 of 99

Memory Management

KV Block Manager (KVBM)

  • Faster TTFT
  • Super long context (> 100K tokens)

Unlimited use cases:

  • Multi-turn chat
  • Agentic calls (offload & prefetch)
  • Prefill only for generative recommender
  • Separation of knowledge and reasoning

 Leverage all memory and storage available in the data center

65 of 99

NVIDIA Dynamo Community Updates

KVBM - vLLM Integration

# KVBM + vLLM Serve

vllm serve \

--kv-transfer-config '{ \

"kv_role": "kv_both", \

"kv_connector": "DynamoConnector", \

"kv_connector_module_path": � "dynamo.llm.vllm_integration.connector"}'

Qwen/Qwen3-8B

# KVBM + Dynamo + vLLM

python -m dynamo.vllm \

--model Qwen/Qwen3-8B \

--connector kvbm

Qwen/Qwen3-8B

66 of 99

NVIDIA Dynamo Community Updates

SLA Planner

Best�Static

SLA�Planner

💡Goodput = Throughput of requests meeting SLA

67 of 99

NVIDIA Dynamo Community Updates

Smart Router - High Availability, Replica Syncing, and Warm Starts

Ability to sync state between replicas for routing consistency

Ability to warm start replicas from router state snapshots

High Availability with replica support

68 of 99

NVIDIA Dynamo Community Updates

GPT-OSS / Harmony Support

69 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 69

Join and follow Dynamo today!github.com/ai-dynamo/dynamo

70 of 99

Break until 7:00 PM

2025 Toronto Meetup

Meetup Sponsors | 70

71 of 99

Topic #2�Reducing latency with EAGLE Speculative Decoding

Ben Chislett

2025 Toronto Meetup

Meetup Sponsors | 71

72 of 99

Opportunities in low-latency inference

  • At low concurrency, running a larger batch is very cheap

2025 Toronto Meetup

Meetup Sponsors | 72

73 of 99

Speculative Decoding Overview

  • Large and small models tend to agree on most tokens

2025 Toronto Meetup

Meetup Sponsors | 73

74 of 99

Speculative Decoding Challenges

  • Speeding up one part always reveals another bottleneck
  • Constant costs are killers: we are doing N+1 forward passes for each step!

2025 Toronto Meetup

Meetup Sponsors | 74

75 of 99

EAGLE: Custom-made for Drafting

  • EAGLE is a 1-layer LLM designed specifically for efficient drafting

2025 Toronto Meetup

Meetup Sponsors | 75

76 of 99

Optimizing EAGLE in vLLM

  • Framework-level overheads are amplified relative to a tiny drafter
    • vLLM recently merged “overlapped execution”, a step towards zero-overhead drafting
    • Fused kernels and optimized logic is coming soon,�reducing drafting overhead
  • Optimized kernels are being integrated for even faster verification
  • Increased parallelism via “tree” drafting is ongoing, partly merged

2025 Toronto Meetup

Meetup Sponsors | 76

77 of 99

Future work on EAGLE

  • Lots of ways to build out from EAGLE!
    • PARD for parallel drafting
    • Combining EAGLE with other speculation methods
    • Custom fast attention kernels for verification
    • Multimodal support
    • Extended CUDA graphs to fuse drafting and verification

2025 Toronto Meetup

Meetup Sponsors | 77

78 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 78

79 of 99

Topic #3

Enabling SpecDec with any model using Speculators

Dipika Sikka | Principal Software Engineer, Red Hat

2025 Toronto Meetup

Meetup Sponsors | 79

80 of 99

2025 Toronto Meetup

Meetup Sponsors | 80

80

  • Lossless technique to speed up LLM inference by using a draft model (i.e the speculator, “small model”) to propose tokens
  • Boost performance without sacrificing quality
    • Draft model is responsible for doing the heavy lifting
      • Intelligently draft multiple tokens ahead of time
      • Able to predict simple tokens (e.g. “the”, “of”)
    • Every accepted token is guaranteed to match what the base model would have generated independently

Speculative Decoding

81 of 99

2025 Toronto Meetup

Meetup Sponsors | 81

  • Draft model speculates K tokens
    • Worst case scenario → generate 1 token (equivalent to vanilla generation)
    • Best case scenario → K + 1 tokens in a single pass

It

is

orange

and

0.7

0.9

0.8

0.9

0.9

0.7

0.8

82 of 99

Introducing Speculators

https://github.com/vllm-project/speculators

  • The speculators repository provides a unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference
    • Create draft models for vLLM deployment
  • Provides a Hugging Face-compatible format
  • Tools to convert from external repositories into a standard speculators format
    • If you have a trained model from another library, you can apply the speculators format which allows it to run seamlessly in vLLM

82

83 of 99

84 of 99

Introducing Speculators

  • Example trained speculator: RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3
    • Follows the Eagle3 Algorithm

84

85 of 99

RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3

Checkpoint includes the draft model, which consists of a single decoder layer

85

86 of 99

Updated config.json, consisting of metadata relevant to the model and how it should be served in vLLM i.e the “speculators format”

86

Base Model

87 of 99

Run the trained models through vLLM

vllm serve RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3

87

88 of 99

Latency Improvements

88

89 of 99

Latency Improvements

89

We can combine the power of speculative decoding with quantization to maximize LLM Performance!

90 of 99

Check out existing speculator models

90

Missing a model? Generate your own!

pip install speculators

#feat-spec-decode

#sig-quantization

#llm-compressor

91 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 91

92 of 99

Ways to contribute and closing remarks

Aaron Pham

2025 Toronto Meetup

Meetup Sponsors | 92

93 of 99

Help make future vLLM events better with a 2-min survey!

Meetup Sponsors | 93

2025 Toronto Meetup

94 of 99

Journey to contribute to vLLM

2025 Toronto Meetup

Meetup Sponsors | 94

2024

Contribute to upstream

have some frustrations with structured outputs, tool calling, contribute�10785 (xgrammar support)�12388 (V1 structured outputs)�16577 (thinking + structured) … (hardening)

2025-

Regular committer and contributor

Interests:�Frontend/tool support + UX

Structured outputs and speculative decoding

2022

Building LLM serving offering

vLLM stood out, outperforms transformers with additional scheduling loop. ��Supports both continuous batching and paged attention https://arxiv.org/abs/2309.06180

2023

Using vLLM in Production

most of our deployments uses vLLM�Pros: easy-to-use, broad support, hackable�Cons: requires a lot of tuning

95 of 99

Roadmap

  1. Large MoE support
  2. RL integration
  3. Hardened CI, developer experience
  4. Stability and performance
    1. Async scheduler feature parity
    2. Large Scale MoE Serving
    3. Attn-FFN disaggregation, Hybrid models
    4. Kernels, cold start, etc.
  5. Vertical integrations
    • LlamaStack, prime-rl
  6. roadmap.vllm.ai <- <you can add it here!>

2025 Toronto Meetup

Meetup Sponsors | 95

96 of 99

Ways to contribute

Try out and contribute to vLLM Recipes

2025 Toronto Meetup

Meetup Sponsors | 96

Participate in topics/RFCs

97 of 99

Welcome to the vLLM community!

Contribute to key vLLM featuresComment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags.

Join vLLM Developer SlackAsk questions and engage with us via Slack slack.vllm.ai

Come to vLLM Office HoursProject updates and special topics biweekly red.ht/office-hours

2025 Toronto Meetup

Meetup Sponsors | 97

$ uv pip install vllm

Thanks to the over 1500+ contributors!

98 of 99

Join our [Virtual] Red Hat AI Events

Your Path to Enterprise-Ready AI

Red Hat AI: Day of Learning�October 16, 2025

  • Deep-dive’s from Red Hat and IBM experts on:
    • ⚡ Fast & efficient inference
    • 🎯 Model customization
    • 🤖 Agentic AI
    • 🌐 Scaling AI over hybrid cloud
  • Hands-on insights to accelerate AI adoption
  • Practical guidance to scale AI securely & efficiently

Red Hat AI: What’s New & What’s Next �October 14, 2025

  • Latest advancements in Red Hat AI addressing:
    • 🚀 Cost
    • ⚙️ Complexity
    • 🔒 Control
    • 📈 Scalability
  • Inference across any model, hardware & cloud
  • New features: agents, enterprise data, model optimization

98

red.ht/ai-day-of-learning

red.ht/whats-new-whats-next

Sign up →

Sign up →

99 of 99

2025 Toronto Meetup

Meetup Sponsors | 99

https://blog.vllm.ai

Building the fastest and easiest-to-use open-source LLM inference & serving engine!

https://twitter.com/vllm_project

https://opencollective.com/vllm

https://slack.vllm.ai