Toronto Developer Community Meetup
September 25 2025
Agenda
Meetup Sponsors | 2
2025 Toronto Meetup
Agenda item | Presenter | Time |
Doors open and meet the vLLM Team | | 5:00 - 5:30 PM |
Intro to vLLM and project update | Lucas Wilkinson | 5:30 - 6:00 PM |
Topic #1: Tackling inference at scale | Lucas Wilkinson, Ryan McCormick | 6:00 - 6:30 PM |
15 min break | | 6:30 - 6:45 PM |
Topic #2: Reducing latency with EAGLE speculative decoding | Benjamin Chislett | 6:45 - 7:05 PM |
Topic #3: Enabling SpecDec with any model using Speculators | Dipika Sikka | 7:05 - 7:20 PM |
Ways to contribute & closing remarks | Aaron Pham | 7:20 - 7:30 PM |
Networking with light refreshments | | 7:30 - 9:00 PM |
Event Sponsor
Vector Institute
CentML officially joins NVIDIA
Past vLLM major contributions by CentML:
And many more to come!
Event Sponsor
Intro to vLLM and
Project Update
Lucas Wilkinson
vLLM Core Committer
Principal Software Engineer, Red Hat
2025 Toronto Meetup
Meetup Sponsors | 6
vLLM’s Goal
Build the fastest and easiest-to-use open-source LLM inference & serving engine
2025 Toronto Meetup
Meetup Sponsors | 7
What Problem is vLLM Solving?
2025 Toronto Meetup
Meetup Sponsors | 8
Why Is This A Hard Problem?
2025 Toronto Meetup
Meetup Sponsors | 9
Challenge 1: Batching
2025 Toronto Meetup
Meetup Sponsors | 10
Naive Batching 🙅🙅🙅
Continuous Batching 🙏🙏🙏
…
Challenge 2: KV Caching
KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding
2025 Toronto Meetup
Meetup Sponsors | 11
Challenge 2: KV Caching
KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory!
2025 Toronto Meetup
Meetup Sponsors | 12
vLLM’s Original Innovation
2025 Toronto Meetup
Meetup Sponsors | 13
vLLM’s Original Innovation
Paged Attention + Continuous Batching
2025 Toronto Meetup
Meetup Sponsors | 14
Alan | Turing | is | a |
computer | scientist | and | mathematician |
renowned | | | |
| | | |
Logical KV blocks
Request
A
| |
| |
| |
| |
| |
Block Table
| | | |
computer | scientist | and | mathe-�matician |
| | | |
Artificial | Intelli- gence | is | the |
| | | |
renowned | | | |
future | of | tech-�nology | |
Alan | Turing | is | a |
Physical KV blocks
Artificial | Intelli-�gence | is | the |
future | of | tech-�nology | |
| | | |
| | | |
Logical KV blocks
Request
B
| |
| |
| |
| |
| |
Block Table
2 Year Journey Of vLLM
vLLM has rapidly evolved from a research project to the open source default
2025 Toronto Meetup
Meetup Sponsors | 15
2025 Toronto Meetup
Meetup Sponsors | 16
Neuron
TPU
Gaudi
Instinct
GPU
Llama
Qwen
DeepSeek
Gemma
Mistral
Molmo
Phi
Nemotron
Granite
Spyre
Edge
Private Cloud
Physical
Virtual
Public Cloud
vLLM: The De Facto Open GenAI Inference Platform
vLLM has emerged as the Linux of GenAI Inference
From this base, we have built…
2025 Toronto Meetup
Meetup Sponsors | 17
Why vLLM For Performance?
2025 Toronto Meetup
Meetup Sponsors | 18
Inference
Optimizations
To make your models faster
Distributed
Inference
To deploy large models efficiently
Automatic Prefix Caching
2025 Toronto Meetup
Meetup Sponsors | 19
Example: Multi-turn conversation
Prompt (round 1)
Human: What's AI?
LLM Result (round 1)
LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.
Prompt (round 2)
Human: What's AI?
LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.
Human: Cool, thanks!
LLM Result (round 2)
LLM: No problem!
Cached
Quantization in vLLM
2025 Toronto Meetup
Meetup Sponsors | 20
BF16
INT4
W4A16
INT8
W8A8
Speculative Decoding
Accelerate decoding phase with speculation - variety of methods: ngram, draft model, EAGLE, etc
2025 Toronto Meetup
Meetup Sponsors | 21
torch.compile
2025 Toronto Meetup
Meetup Sponsors | 22
FlashInfer
Nvidia’s SOTA Inference Kernels for LLM Research and Deployment
Upcoming Release Highlights
23
FlashInfer
GitHub First
github.com/flashinfer-ai/flashinfer �
24
vLLM Combines All Optimizations Together
2025 Toronto Meetup
Meetup Sponsors | 25
Without Optimizations
vLLM Goes Distributed
Single Device
Single-Node
Multi-Device
Multi-Node
Multi-Device
2025 Toronto Meetup
Meetup Sponsors | 26
In-depth Later
Recent Features
2025 Toronto Meetup
Meetup Sponsors | 27
We are here!
Q&A
2025 Toronto Meetup
Meetup Sponsors | 28
Topic #1�Tackling inference at scale
Lucas Wilkinson, Ryan McCormick
2025 Toronto Meetup
Meetup Sponsors | 29
Why Distributed Inference
2025 Toronto Meetup
Meetup Sponsors | 30
Graph credit: MIT Han Lab
Ways to distribute a models weights and kv-cache
2025 Toronto Meetup
Meetup Sponsors | 31
Distributing Weights
Distributing KV-caches
Transform Model
2025 Toronto Meetup
Meetup Sponsors | 32
Tensor Parallelism (TP)
2025 Toronto Meetup
Meetup Sponsors | 33
Distributed Across Heads
Distributed Across Hidden Dim
Pipeline Parallelism (PP)
2025 Toronto Meetup
Meetup Sponsors | 34
Pipeline Parallelism + Tensor Parallelism
2025 Toronto Meetup
Meetup Sponsors | 35
Pipeline Parallelism + Tensor Parallelism Summary
2025 Toronto Meetup
Meetup Sponsors | 36
Tensor Parallelism
Pipeline Parallelism
The Deepseek Era
Meetup Sponsors | 37
MLA Attention
Mixture of Experts
The Deepseek ERA
2025 Toronto Meetup
Meetup Sponsors | 38
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
Mixture of Experts
Activated Params
Total Params
The Deepseek Era - Smaller Experts
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models - Jan 2024
More Smaller Experts
(pick 2)
(pick 8)
Tensor Parallel MoE
2025 Toronto Meetup
Meetup Sponsors | 40
Experts are small; can only subdivide them efficiently so many times
Expert Parallelism
2025 Toronto Meetup
Meetup Sponsors | 41
Device 1
Device 2
Full Experts on each device
Multi-Latent Attention
Meetup Sponsors | 42
MLA Attention
Tensor Parallelism and MLA
2025 Toronto Meetup
Meetup Sponsors | 43
Can shared along kv-heads
GQA
MLA
Can’t shard latent dimension
(Cache must be wastefully replicated across TP ranks)
Expert Parallelism
2025 Toronto Meetup
Meetup Sponsors | 44
Data Parallel Attention + Expert Parallelism
2025 Toronto Meetup
Meetup Sponsors | 45
Sparse A2A
Sparse AG
(DS-DeepEP)
Reqs 0-4
Reqs 5-9
KV-caches parallelized across requests
Decode Context Parallel
2025 Toronto Meetup
Meetup Sponsors | 46
Context/KV-cache tokens are sharded across devices (distributed round robin)
Run with: `-tp 8 -dcp 8`�8× larger KV cache
2–3× throughput gain on single-node H200
Prefill Decode Disaggregation
2025 Toronto Meetup
Meetup Sponsors | 47
Partition the “time” dimension
→ Separate instances for prompt processing & token generation
NIXL
NVIDIA Inference Transfer Library (NIXL)
Low-latency, hardware-agnostic communication
�
NIXL Core
Metadata
Memory
Post
Request
DRAM
DRAM
DRAM
DRAM
DRAM
HBM
DRAM
DRAM
FILE
Request
completion
Backend API
NIXL API
FILE
FILE
Obj
UCX
GDS
S3
Custom Backend
48
Prefill Decode Disaggregation
2025 Toronto Meetup
Meetup Sponsors | 49
Other MoE Optimizations
2025 Toronto Meetup
Meetup Sponsors | 50
➡️
Summary
2025 Toronto Meetup
Meetup Sponsors | 51
Parallelism | Partitions across devices | Best used for |
Tensor Parallelism | within layers (hidden/head-dim) | Single Node / Low Latency |
Pipeline Parallelism | layers | Multi Node |
Data Parallelism | requests | MLA Models |
Expert Parallelism | experts in MoE | Multi Node or Highly Sparse MoEs |
Prefill Decode Disaggregation | prefill / decode stages | Large scale deployments |
Decode Context Parallelism | tokens in the KV-cache | Latency / Very long context (TP > Num-KV-Heads) |
Backed by industry leaders: founded in collaboration with CoreWeave, Google, IBM Research, NVIDIA, and AMD
Meetup Sponsors | 53
2025 Toronto Meetup
Operationalizability
Flexibility
Performance
Why Distributed LLM Inference?
Meetup Sponsors | 54
2025 Toronto Meetup
We can exploit the unique properties of LLM inference to improve perf/$ over naive load balancing
“Well-lit” Paths
See guides: https://github.com/llm-d/llm-d/tree/main/guides
Meetup Sponsors | 55
2025 Toronto Meetup
llm-d provides “well-lit” paths for running LLM inference workloads
Questions? What to know More?
Join the Slack: https://inviter.co/llm-d-slack
“Well-lit” Path: Intelligent Inference Scheduling
56
vLLM-aware load-balancing enables smarter request routing that improve SLOs
Maintain prefix-tree.
“Pod A has hit”
Prefix-Aware Routing
EPP
Pod A
Pod B
Pod C
GET completions
Load-Aware Routing
EPP
Pod A
Pod B
Pod C
GET completions
Prompt
Dramatically increase prefix-cache hit rate
Scrape metrics. “A has low load”
Load-balancing based on actual replica state
Pod A
Prompt
Pod A
/metrics
Intelligent Inference Scheduling
57
Inference scheduling is a no-brainer optimization which can have huge impacts on repeated prompts
58
Disaggregated Serving
EPP
Decode Pod A
Prefill Pod B
GET completions
Specialization of pods for prefill and decode phase
Long prompt. “Use Disagg”
KV Xfer (NIXL)
Sidecar
Prompt
Pod A+B
“Well-lit” Path: P/D Disaggregation
P/D disaggregation is a key optimization for demanding inference workloads
NIXL
UCX
NVLink
Infiniband
Heterogenous Transfer Protocols
RoCE
ICI
EFA
TCP
API
Networking
Transport
Support variety of transports via UCX
P/D Disagg
59
LeaderWorkerSet
“Well-lit” Path: Wide Expert Parallelism
llm-d’s K8s-native design composes the EP implementation with the rest of the llm-d system
EPP
Pod A
(DP-0)
GET completions
Pod B
(DP-1)
Pod G
(DP-6)
Pod H
(DP-7)
Expert-Parallel MLP
*Composes with other scorers*
Prompt
Pod D
Pod C
(DP-2)
Pod D
(DP-3)
Pod E
(DP-4)
Pod F
(DP-5)
NVIDIA Dynamo �Ryan McCormick, NVIDIA
NVIDIA Dynamo
Systematic approach to AI inference at scale
# P/D Disagg Quickstart with Dynamo + vLLM
uv pip install ai-dynamo[vllm]
# Start Frontend (auto-discover workers)
python -m dynamo.frontend
# Start Decode Worker(s)
python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B
# Start Prefill Worker(s)
python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker
Scheduling
Data
Transfer
Memory
Management
Disaggregated
Serving
Scheduling
Router
Planner
Worker
Prefill
worker
Decode
Worker
KV
load
Prefill queue
Worker
Worker
Router
KV
load
KV
load
KV
load
KV hit
rate
KV hit
rate
KV hit
rate
Routing requests and real time perf tuning
Data Transfer
NIXL (Nvidia Inference Xfer Library)
HBM
Host memory
Local SSD
Network storage
HBM
Low latency KV transfer
Backbone for KV Block Manager
storage and memory access
Move data from point to point with low latency and high bandwidth
Memory Management
KV Block Manager (KVBM)
Unlimited use cases:
Leverage all memory and storage available in the data center
NVIDIA Dynamo Community Updates
KVBM - vLLM Integration
# KVBM + vLLM Serve
vllm serve \
--kv-transfer-config '{ \
"kv_role": "kv_both", \
"kv_connector": "DynamoConnector", \
"kv_connector_module_path": � "dynamo.llm.vllm_integration.connector"}'
Qwen/Qwen3-8B
# KVBM + Dynamo + vLLM
python -m dynamo.vllm \
--model Qwen/Qwen3-8B \
--connector kvbm
Qwen/Qwen3-8B
For more details on Dynamo and KV caching: https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
NVIDIA Dynamo Community Updates
SLA Planner
Best�Static
SLA�Planner
💡Goodput = Throughput of requests meeting SLA
NVIDIA Dynamo Community Updates
Smart Router - High Availability, Replica Syncing, and Warm Starts
Ability to sync state between replicas for routing consistency
Ability to warm start replicas from router state snapshots
High Availability with replica support
NVIDIA Dynamo Community Updates
GPT-OSS / Harmony Support
Achieving 1.5M+ toks/s on GPT-OSS on GB200 NVL72 with NVIDIA Dynamo:
Q&A
2025 Toronto Meetup
Meetup Sponsors | 69
Join and follow Dynamo today!�github.com/ai-dynamo/dynamo
Break until 7:00 PM
2025 Toronto Meetup
Meetup Sponsors | 70
Topic #2�Reducing latency with EAGLE Speculative Decoding
Ben Chislett
2025 Toronto Meetup
Meetup Sponsors | 71
Opportunities in low-latency inference
2025 Toronto Meetup
Meetup Sponsors | 72
Speculative Decoding Overview
2025 Toronto Meetup
Meetup Sponsors | 73
Speculative Decoding Challenges
2025 Toronto Meetup
Meetup Sponsors | 74
EAGLE: Custom-made for Drafting
2025 Toronto Meetup
Meetup Sponsors | 75
Optimizing EAGLE in vLLM
2025 Toronto Meetup
Meetup Sponsors | 76
Future work on EAGLE
2025 Toronto Meetup
Meetup Sponsors | 77
Q&A
2025 Toronto Meetup
Meetup Sponsors | 78
Topic #3
Enabling SpecDec with any model using Speculators
Dipika Sikka | Principal Software Engineer, Red Hat
2025 Toronto Meetup
Meetup Sponsors | 79
2025 Toronto Meetup
Meetup Sponsors | 80
80
Speculative Decoding
2025 Toronto Meetup
Meetup Sponsors | 81
It | is | orange | and |
0.7 | 0.9 | 0.8 | |
0.9 | 0.9 | 0.7 | 0.8 |
Introducing Speculators
https://github.com/vllm-project/speculators
82
Introducing Speculators
84
RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3
Checkpoint includes the draft model, which consists of a single decoder layer
85
Updated config.json, consisting of metadata relevant to the model and how it should be served in vLLM i.e the “speculators format”
86
Base Model
Run the trained models through vLLM
vllm serve RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3
87
Latency Improvements
88
Latency Improvements
89
We can combine the power of speculative decoding with quantization to maximize LLM Performance!
Check out existing speculator models
90
Missing a model? Generate your own!
pip install speculators
#feat-spec-decode
#sig-quantization
#llm-compressor
Q&A
2025 Toronto Meetup
Meetup Sponsors | 91
Ways to contribute and closing remarks
Aaron Pham
2025 Toronto Meetup
Meetup Sponsors | 92
Help make future vLLM events better with a 2-min survey!
Meetup Sponsors | 93
2025 Toronto Meetup
Journey to contribute to vLLM
2025 Toronto Meetup
Meetup Sponsors | 94
2025-
Regular committer and contributor
Interests:�Frontend/tool support + UX
Structured outputs and speculative decoding
2022
Building LLM serving offering
vLLM stood out, outperforms transformers with additional scheduling loop. ��Supports both continuous batching and paged attention https://arxiv.org/abs/2309.06180
2023
Using vLLM in Production
most of our deployments uses vLLM�Pros: easy-to-use, broad support, hackable�Cons: requires a lot of tuning
Roadmap
2025 Toronto Meetup
Meetup Sponsors | 95
Ways to contribute
Try out and contribute to vLLM Recipes �
2025 Toronto Meetup
Meetup Sponsors | 96
Contribute to “Good first issues”
Participate in topics/RFCs
Welcome to the vLLM community!
Contribute to key vLLM features�Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags.
Join vLLM Developer Slack�Ask questions and engage with us via Slack slack.vllm.ai
Come to vLLM Office Hours�Project updates and special topics biweekly red.ht/office-hours
2025 Toronto Meetup
Meetup Sponsors | 97
$ uv pip install vllm
Thanks to the over 1500+ contributors!
Join our [Virtual] Red Hat AI Events
Your Path to Enterprise-Ready AI
Red Hat AI: Day of Learning�October 16, 2025
Red Hat AI: What’s New & What’s Next �October 14, 2025
98
red.ht/ai-day-of-learning
red.ht/whats-new-whats-next
Sign up →
Sign up →
2025 Toronto Meetup
Meetup Sponsors | 99
https://blog.vllm.ai
Building the fastest and easiest-to-use open-source LLM inference & serving engine!
https://twitter.com/vllm_project
https://opencollective.com/vllm
https://slack.vllm.ai