1 of 99

Toronto Developer Community Meetup

September 25 2025

2 of 99

Agenda

Meetup Sponsors | 2

2025 Toronto Meetup

Agenda item	Presenter	Time
Doors open and meet the vLLM Team		5:00 - 5:30 PM
Intro to vLLM and project update	Lucas Wilkinson	5:30 - 6:00 PM
Topic #1: Tackling inference at scale	Lucas Wilkinson, Ryan McCormick	6:00 - 6:30 PM
15 min break		6:30 - 6:45 PM
Topic #2: Reducing latency with EAGLE speculative decoding	Benjamin Chislett	6:45 - 7:05 PM
Topic #3: Enabling SpecDec with any model using Speculators	Dipika Sikka	7:05 - 7:20 PM
Ways to contribute & closing remarks	Aaron Pham	7:20 - 7:30 PM
Networking with light refreshments		7:30 - 9:00 PM

3 of 99

Event Sponsor

Vector Institute

4 of 99

CentML officially joins NVIDIA

Past vLLM major contributions by CentML:

Benjamin Chislett: Fully overlapped scheduler¹
Benjamin Chislett: SpecDec + structured outputs²
Vadim Gimpelson: Async guided decoding (Open)³
Benjamin Chislett: SpecDec, EAGLE1, EAGLE3, and MTP⁴
Murali Andoorveedu: Pipeline parallel support⁵

And many more to come!

vLLM PRs: (1) #23569 (2) #14702 (3) #23224 (4) #12915, #13626, #16937 (5) #4412

5 of 99

Event Sponsor

6 of 99

Intro to vLLM and

Project Update

Lucas Wilkinson

vLLM Core Committer

Principal Software Engineer, Red Hat

2025 Toronto Meetup

Meetup Sponsors | 6

7 of 99

vLLM’s Goal

Build the fastest and easiest-to-use open-source LLM inference & serving engine

2025 Toronto Meetup

Meetup Sponsors | 7

8 of 99

What Problem is vLLM Solving?

Batch Size > 1 & Data Center Hardwares

Not the same workload as on-device inference for a single user

How do you …

Efficiently schedule requests into the next forward pass?
Manage KV cache context and runtime memory footprint?

2025 Toronto Meetup

Meetup Sponsors | 8

9 of 99

Why Is This A Hard Problem?

A LLM is a function to predict the next token in a sequence

P(X_n | X_0 … X_n-1)

To generate text, we “chain together” passes through the model

→ A single request requires multiple passes through the model
→ A single generation request can last multiple seconds

Key Challenge: How to handle multiple concurrent requests

2025 Toronto Meetup

Meetup Sponsors | 9

10 of 99

Challenge 1: Batching

2025 Toronto Meetup

Meetup Sponsors | 10

Naive Batching 🙅🙅🙅

Continuous Batching 🙏🙏🙏

…

11 of 99

Challenge 2: KV Caching

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

2025 Toronto Meetup

Meetup Sponsors | 11

12 of 99

Challenge 2: KV Caching

KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory!

2025 Toronto Meetup

Meetup Sponsors | 12

13 of 99

vLLM’s Original Innovation

2025 Toronto Meetup

Meetup Sponsors | 13

14 of 99

vLLM’s Original Innovation

Paged Attention + Continuous Batching

2025 Toronto Meetup

Meetup Sponsors | 14

Alan	Turing	is	a
computer	scientist	and	mathematician
renowned

Logical KV blocks

Request

A

Block Table


computer	scientist	and	mathe-�matician

Artificial	Intelli- gence	is	the

renowned
future	of	tech-�nology
Alan	Turing	is	a

Physical KV blocks

Artificial	Intelli-�gence	is	the
future	of	tech-�nology

Logical KV blocks

Request

B

Block Table

15 of 99

2 Year Journey Of vLLM

vLLM has rapidly evolved from a research project to the open source default

Pervasive → 1+ million weekly installs; 58k GitHub stars
Explosive Growth → Millions of deployed GPU hours per day
Vibrant Community → 1500+ contributors

2025 Toronto Meetup

Meetup Sponsors | 15

16 of 99

2025 Toronto Meetup

Meetup Sponsors | 16

Neuron

TPU

Gaudi

Instinct

GPU

Llama

Qwen

DeepSeek

Gemma

Mistral

Molmo

Phi

Nemotron

Granite

Spyre

Edge

Private Cloud

Physical

Virtual

Public Cloud

vLLM: The De Facto Open GenAI Inference Platform

vLLM has emerged as the Linux of GenAI Inference

17 of 99

From this base, we have built…

2025 Toronto Meetup

Meetup Sponsors | 17

18 of 99

Why vLLM For Performance?

2025 Toronto Meetup

Meetup Sponsors | 18

Inference

Optimizations

To make your models faster

Distributed

Inference

To deploy large models efficiently

19 of 99

Automatic Prefix Caching

2025 Toronto Meetup

Meetup Sponsors | 19

Example: Multi-turn conversation

Prompt (round 1)

Human: What's AI?

LLM Result (round 1)

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Prompt (round 2)

Human: What's AI?

LLM: AI is technology that simulates human intelligence, like Siri or Google Maps.

Human: Cool, thanks!

LLM Result (round 2)

LLM: No problem!

Cached

20 of 99

Quantization in vLLM

2025 Toronto Meetup

Meetup Sponsors | 20

Weight Quantization

Reduced storage & memory footprint
E.g.) 100B parameter model takes up 200GB@BF16 but only 50GB@INT4

Activation Quantization

Quantized weights and activations
Faster linear layers by utilizing low precision tensor cores

KV Cache Quantization

Reduced KV cache footprint & faster attention
Crucial for long context workloads

BF16

INT4

W4A16

INT8

W8A8

21 of 99

Speculative Decoding

Accelerate decoding phase with speculation - variety of methods: ngram, draft model, EAGLE, etc

2025 Toronto Meetup

Meetup Sponsors | 21

22 of 99

torch.compile

2025 Toronto Meetup

Meetup Sponsors | 22

Automatic kernel generation

Fast iteration, automatic fusion, performance portability
RoPE, RMSNorm, SiluMul, QuantFP8, etc.

Applying optimizations independently of model definitions

Graph-level transformations (Nanoflow, DBO)
Custom op and collective fusion passes

Attention + Quant (FP8) (~7% improvement)
Attention + Quant (FP4) (~11% improvement)
SiLU-Mul + Quant (FP4) (~3% improvement)
Sequence Parallelism & Async TP (~10% improvement)
AllReduce + RMSNorm + Quant (FP8) (~8% improvement)
AllReduce + RMSNorm + Quant (FP4) (~10% improvement)

23 of 99

FlashInfer

Nvidia’s SOTA Inference Kernels for LLM Research and Deployment

Upcoming Release Highlights

Native integration with vLLM

Attention, MoE, MLA, sampling
Distributed communication primitives (Allreduce, AlltoAllv)
Various fused kernels via torch.compile�

Improved Blackwell support with extended kernel support for:

GB300, B300
GeForce RTX Pro 6000
Nvidia Jetson Thor
Nvidia DGX Spark �

Available with AOT (faster startup) or JIT (small binary size)

23

24 of 99

FlashInfer

GitHub First

github.com/flashinfer-ai/flashinfer �

Please raise Feature Requests and Issues on Github

Every Github issue to be looked at and resolved

Every feature request to be prioritized and addressed.

Get started today and tell us how we are doing!

24

25 of 99

vLLM Combines All Optimizations Together

2025 Toronto Meetup

Meetup Sponsors | 25

Without Optimizations

26 of 99

vLLM Goes Distributed

Single Device

Single-Node

Multi-Device

Multi-Node

Multi-Device

2025 Toronto Meetup

Meetup Sponsors | 26

In-depth Later

27 of 99

Recent Features

2025 Toronto Meetup

Meetup Sponsors | 27

V0 Deprecation

v0.9.2: Last version with V0 intact
v0.10: Begin to remove V0 code
v0.11: All V0 code removed

Async Scheduling

Run the scheduler one step ahead of model executor --async-scheduling
Improves TPOT and E2E latency at the cost of longer TTFT

CUDA Graph Mode

New API --compilation-config ‘{"cudagraph_mode": "FULL"}’ with options:

NONE - Mixed and Decode are not captured in CUDAGraph
PIECEWISE (default now) - Mixed and Decode are captured in PIECEWISE CUDAGraph
FULL - Mixed and Decode are captured in FULL CUDAGraph)
FULL_DECODE_ONLY - Mixed is not captured, but Decode is captured in FULL CUDAGraph
FULL_AND_PIECEWISE (default, v0.11) - Mixed is captured in PIECEWISE and Decode in FULL CUDAGraph

We are here!

28 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 28

29 of 99

Topic #1�Tackling inference at scale

Lucas Wilkinson, Ryan McCormick

2025 Toronto Meetup

Meetup Sponsors | 29

30 of 99

Why Distributed Inference

2025 Toronto Meetup

Meetup Sponsors | 30

Graph credit: MIT Han Lab

31 of 99

Ways to distribute a models weights and kv-cache

2025 Toronto Meetup

Meetup Sponsors | 31

Distributing Weights

Tensor Parallelism
Pipeline Parallelism
Expert Parallelism

Distributing KV-caches

Data Parallel Attention
Context Parallel
Prefill/Decode Disaggregation (sort-of)

32 of 99

Transform Model

2025 Toronto Meetup

Meetup Sponsors | 32

33 of 99

Tensor Parallelism (TP)

2025 Toronto Meetup

Meetup Sponsors | 33

Distributed Across Heads

Distributed Across Hidden Dim

34 of 99

Pipeline Parallelism (PP)

2025 Toronto Meetup

Meetup Sponsors | 34

35 of 99

Pipeline Parallelism + Tensor Parallelism

2025 Toronto Meetup

Meetup Sponsors | 35

36 of 99

Pipeline Parallelism + Tensor Parallelism Summary

2025 Toronto Meetup

Meetup Sponsors | 36

Tensor Parallelism

Splits each layer across devices
All devices (inside a TP group) are being used for same forward pass simultaneously
More communication
Lower latency if communication is fast

Pipeline Parallelism

Splits across layers
Better for multi-node deployments
Increases latency

37 of 99

The Deepseek Era

Meetup Sponsors | 37

MLA Attention

Mixture of Experts

38 of 99

The Deepseek ERA

2025 Toronto Meetup

Meetup Sponsors | 38

Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

Mixture of Experts

Activated Params

Total Params

39 of 99

The Deepseek Era - Smaller Experts

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models - Jan 2024

More Smaller Experts

Mixtral had 8

(pick 2)

DeepSeek R1 has 256!

(pick 8)

40 of 99

Tensor Parallel MoE

2025 Toronto Meetup

Meetup Sponsors | 40

Experts are small; can only subdivide them efficiently so many times

41 of 99

Expert Parallelism

2025 Toronto Meetup

Meetup Sponsors | 41

Device 1

Device 2

Full Experts on each device

42 of 99

Multi-Latent Attention

Meetup Sponsors | 42

MLA Attention

43 of 99

Tensor Parallelism and MLA

2025 Toronto Meetup

Meetup Sponsors | 43

Can shared along kv-heads

GQA

MLA

Can’t shard latent dimension

(Cache must be wastefully replicated across TP ranks)

44 of 99

Expert Parallelism

2025 Toronto Meetup

Meetup Sponsors | 44

45 of 99

Data Parallel Attention + Expert Parallelism

2025 Toronto Meetup

Meetup Sponsors | 45

Sparse A2A

Sparse AG

(DS-DeepEP)

Reqs 0-4

Reqs 5-9

KV-caches parallelized across requests

46 of 99

Decode Context Parallel

2025 Toronto Meetup

Meetup Sponsors | 46

Context/KV-cache tokens are sharded across devices (distributed round robin)

Run with: `-tp 8 -dcp 8`�8× larger KV cache

2–3× throughput gain on single-node H200

47 of 99

Prefill Decode Disaggregation

2025 Toronto Meetup

Meetup Sponsors | 47

Partition the “time” dimension

→ Separate instances for prompt processing & token generation

Separation of concern

Better control over latency

KV cache transfer overheads

Lower device utilization

NIXL

48 of 99

NVIDIA Inference Transfer Library (NIXL)

Low-latency, hardware-agnostic communication

�

NIXL Core

Metadata

Memory

Post

Request

DRAM

HBM

DRAM

FILE

Request

completion

Backend API

NIXL API

FILE

Obj

UCX

GDS

S3

Custom Backend

Optimized for inference data movement
Consistent API across heterogenous data paths
Supports different types of memory, SSDs, and networked storage

48

Large-scale distributed inference leverages model parallelism techniques such as Tensor, pipeline, and expert parallelism, which rely on internode and intranode, low-latency, high-throughput communication leveraging GPUDirect RDMA. These systems also require rapid KV cache transfer between prefill and decode GPU workers in disaggregated serving environments.

Additionally, they must support accelerated communication libraries that are both hardware- and network-agnostic, capable of efficiently moving data across GPUs and memory hierarchies including storage —like CPU memory, and block, file, and object storage—and compatible with a range of networking protocols.

NVIDIA Inference Transfer Library (NIXL) is a high-throughput, low-latency point-to-point communication library that provides a consistent data movement API to move data rapidly and asynchronously across different tiers of memory and storage using the same semantics. It is specifically optimized for inference data movement, supporting nonblocking and noncontiguous data transfers between various types of memory and storage.

NIXL supports heterogeneous data paths as well as different types of memory and local SSDs, plus networked storage from key NVIDIA storage partners.

NIXL allows Dynamo to interface with other communications libraries such as GPUDirect Storage, UCX, and S3 with a common API regardless if the transfer is over NVLink (C2C or NVSwitch), InfiniBand, RoCE, or Ethernet. NIXL, in conjunction with the Dynamo policy engine, automatically chooses the best backend connection and abstracts away the differences between multiple types of memory and storage. This is accomplished through generalized “memory sections” which can be HBM, DRAM, local SSD, or networked storage (Block, Object, or File).

49 of 99

Prefill Decode Disaggregation

2025 Toronto Meetup

Meetup Sponsors | 49

Source: https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md

50 of 99

Other MoE Optimizations

2025 Toronto Meetup

Meetup Sponsors | 50

Dual Batch Overlap

Expert parallel load-balancer

➡️

51 of 99

Summary

2025 Toronto Meetup

Meetup Sponsors | 51

Parallelism	Partitions across devices	Best used for
Tensor Parallelism	within layers (hidden/head-dim)	Single Node / Low Latency
Pipeline Parallelism	layers	Multi Node
Data Parallelism	requests	MLA Models
Expert Parallelism	experts in MoE	Multi Node or Highly Sparse MoEs
Prefill Decode Disaggregation	prefill / decode stages	Large scale deployments
Decode Context Parallelism	tokens in the KV-cache	Latency / Very long context (TP > Num-KV-Heads)

52 of 99

Backed by industry leaders: founded in collaboration with CoreWeave, Google, IBM Research, NVIDIA, and AMD

53 of 99

Meetup Sponsors | 53

2025 Toronto Meetup

Operationalizability

Modular and resilient architecture with native integration into Kubernetes via Inference Gateway API

Flexibility

Cross-platform with extensible implementations of key composable layers of the stack

Performance

Leverage distributed optimizations like prefix-aware routing and disaggregation to achieve the highest throughput while meeting SLOs

54 of 99

Why Distributed LLM Inference?

Meetup Sponsors | 54

2025 Toronto Meetup

We can exploit the unique properties of LLM inference to improve perf/$ over naive load balancing

55 of 99

“Well-lit” Paths

Intelligent Inference Scheduling
P/D Disaggregation
Wide Expert Parallelism

See guides: https://github.com/llm-d/llm-d/tree/main/guides

Meetup Sponsors | 55

2025 Toronto Meetup

llm-d provides “well-lit” paths for running LLM inference workloads

Questions? What to know More?

Join the Slack: https://inviter.co/llm-d-slack

56 of 99

“Well-lit” Path: Intelligent Inference Scheduling

56

vLLM-aware load-balancing enables smarter request routing that improve SLOs

Maintain prefix-tree.

“Pod A has hit”

Prefix-Aware Routing

EPP

Pod A

Pod B

Pod C

GET completions

Load-Aware Routing

EPP

Pod A

Pod B

Pod C

GET completions

Prompt

Dramatically increase prefix-cache hit rate

Scrape metrics. “A has low load”

Load-balancing based on actual replica state

Pod A

Prompt

Pod A

/metrics

57 of 99

Intelligent Inference Scheduling

57

Inference scheduling is a no-brainer optimization which can have huge impacts on repeated prompts

58 of 99

58

Disaggregated Serving

EPP

Decode Pod A

Prefill Pod B

GET completions

Specialization of pods for prefill and decode phase

Long prompt. “Use Disagg”

KV Xfer (NIXL)

Sidecar

Prompt

Pod A+B

“Well-lit” Path: P/D Disaggregation

P/D disaggregation is a key optimization for demanding inference workloads

NIXL

UCX

NVLink

Infiniband

Heterogenous Transfer Protocols

RoCE

ICI

EFA

TCP

API

Networking

Transport

Support variety of transports via UCX

59 of 99

P/D Disagg

59

LeaderWorkerSet

“Well-lit” Path: Wide Expert Parallelism

llm-d’s K8s-native design composes the EP implementation with the rest of the llm-d system

EPP

Pod A

(DP-0)

GET completions

Pod B

(DP-1)

Pod G

(DP-6)

Pod H

(DP-7)

Expert-Parallel MLP

*Composes with other scorers*

Prompt

Pod D

Pod C

(DP-2)

Pod D

(DP-3)

Pod E

(DP-4)

Pod F

(DP-5)

60 of 99

NVIDIA Dynamo �Ryan McCormick, NVIDIA

61 of 99

NVIDIA Dynamo

Systematic approach to AI inference at scale

# P/D Disagg Quickstart with Dynamo + vLLM

uv pip install ai-dynamo[vllm]

# Start Frontend (auto-discover workers)

python -m dynamo.frontend

# Start Decode Worker(s)

python3 -m dynamo.vllm \

--model Qwen/Qwen3-0.6B

# Start Prefill Worker(s)

python3 -m dynamo.vllm \

--model Qwen/Qwen3-0.6B \

--is-prefill-worker

Scheduling

Data

Transfer

Memory

Management

Disaggregated

Serving

62 of 99

Scheduling

Router

Planner

Worker

Prefill

worker

Decode

Worker

KV

load

Prefill queue

Worker

Router

KV

load

KV

load

KV

load

KV hit

rate

KV hit

rate

KV hit

rate

Routing requests and real time perf tuning

63 of 99

Data Transfer

NIXL (Nvidia Inference Xfer Library)

HBM

Host memory

Local SSD

Network storage

HBM

Low latency KV transfer

Backbone for KV Block Manager

Unified API for both

storage and memory access

Best in-class networking perf

Dynamically auto-scalable

Move data from point to point with low latency and high bandwidth

64 of 99

Memory Management

KV Block Manager (KVBM)

Faster TTFT
Super long context (> 100K tokens)

Unlimited use cases:

Multi-turn chat
Agentic calls (offload & prefetch)
Prefill only for generative recommender
Separation of knowledge and reasoning

Leverage all memory and storage available in the data center

Memory Management is a critical piece to get right to gain performance and benefit from the economies of scale.

Many LLM and inference scenarios involve very long context with relatively short outputs. For example, summarization tasks or multi-turn chat tasks, as well as others listed here.

With long contexts comes a lot of KV Cache memory usage, which GPU memory is not always sufficient to hold by itself.

With KV Block Manager (KVBM), you can expand the pool of available KV Cache memory to CPU memory, Disk, and remote storage - greatly increasing the ability to get cache hits and lower TTFT.

For GPU-CPU G1-G2 memory movement, primarily CUDA H2D/D2H APIs or zero copy transfer with pinned host memory are used.

For movement to/from disk G3/G4, NIXL APIs are used, backed by POSIX or GPU Direct Storage APIs.

65 of 99

NVIDIA Dynamo Community Updates

KVBM - vLLM Integration

# KVBM + vLLM Serve

vllm serve \

--kv-transfer-config '{ \

"kv_role": "kv_both", \

"kv_connector": "DynamoConnector", \

"kv_connector_module_path": � "dynamo.llm.vllm_integration.connector"}'

Qwen/Qwen3-8B

# KVBM + Dynamo + vLLM

python -m dynamo.vllm \

--model Qwen/Qwen3-8B \

--connector kvbm

Qwen/Qwen3-8B

For more details on Dynamo and KV caching: https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/

Now I’d like to go over some recent updates to Dynamo from recent releases for the community to follow along with and try out.

—

We have recently integrated KVBM with vLLM directly, and on the right hand side here we demonstrate two different ways to get going with it.

For those already using vllm serve, you can install dynamo and use vllm’s --kv-transfer-config and kv_connector APIs to connect KVBM with vllm’s scheduler and inference engine.

For those using dynamo, you can similarly enable KVBM when starting Dynamo’s vLLM backend with the --connector kvbm flag.

In this chart here we ran an experiment where we benchmarked a multi-turn chat scenario using vllm serve –enable-prefix-caching with and without KVBM.

The purple line is vllm without KVBM, and the green line is vllm with KVBM.
On the X-axis we have QPS, and on the Y-axis we have TTFT.
For low QPS, we can see a consistent reduction of 50-60% in TTFT due to KVBM’s expanded KV Cache memory pool, allowing for more cache hit opportunities and less recomputing of those KV cache blocks.
As QPS increases, we can see a tipping point where the purple line which is recomputing KV Cache can longer keep up with the rate of requests coming in, causing requests to start queueing which is captured in the TTFT time per request.
However, with KVBM’s cache hits, it is still able to keep up with the QPS in this benchmark.

66 of 99

NVIDIA Dynamo Community Updates

SLA Planner

Best�Static

SLA�Planner

💡Goodput = Throughput of requests meeting SLA

We’ve also recently updated Planner to support user-defined SLAs as a metric to inform how it should scale the deployment. For example, a user may define an SLA like so:�- TTFT must be less than 2 seconds (Prefill dominated)

- Inter-Token Latency (ITL) must be less than 20ms (Decode dominated)

In this chart we demonstrate a benchmark with a variable traffic pattern such that no 1 static deployment is optimal for different traffic patterns, bursty traffic, variance in ISL/OSL, etc.

With SLA planner enabled, comparing against the best static deployment we found from offline profiling, we were still able to achieve a 52% increase in Goodput by allowing planner to dynamically adjust the number and ratio of P/D instances in the deployment. This is against the best case scenario.

Even without such offline profiling, we have two other baselines where deploy some sub-optimal P/D disagg configurations, one with an inefficient P/D ratio of 1:1 (blue bar), and one with inefficient settings of things like tensor parallelism in the engine configs themselves (green bar).

67 of 99

NVIDIA Dynamo Community Updates

Smart Router - High Availability, Replica Syncing, and Warm Starts

Ability to sync state between replicas for routing consistency

Ability to warm start replicas from router state snapshots

High Availability with replica support

68 of 99

NVIDIA Dynamo Community Updates

GPT-OSS / Harmony Support

Achieving 1.5M+ toks/s on GPT-OSS on GB200 NVL72 with NVIDIA Dynamo:

https://developer.nvidia.com/blog/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge/

Lastly, I want to highlight some recent additions in model support and performance we’ve seen the the recently released GPT-OSS model from OpenAI.

First and foremost, this model came with a response format called “Harmony”. This introduces new tokens/fields to the responses and changes the behavior of existing reasoning and tool call parsers.

In Dynamo we now support the harmony format and have guides such as the one posted here showcasing how to use and deploy it with Dynamo and the respective backend, such as vLLM.

Additionally, we also recently released a blog post where we showcase achieving over 1.5M tokens/sec with GPT-OSS deployed on GB200 NVL72 with NVIDIA Dynamo. This was an exercise in scaling Dynamo at extremely high concurrencies in the 10s of thousands.

We are now working on a Kubernetes deployment recipe to make it easy to deploy and reproduce these numbers on an NVL72 cluster.

69 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 69

Join and follow Dynamo today!�github.com/ai-dynamo/dynamo

70 of 99

Break until 7:00 PM

2025 Toronto Meetup

Meetup Sponsors | 70

71 of 99

Topic #2�Reducing latency with EAGLE Speculative Decoding

Ben Chislett

2025 Toronto Meetup

Meetup Sponsors | 71

72 of 99

Opportunities in low-latency inference

At low concurrency, running a larger batch is very cheap

2025 Toronto Meetup

Meetup Sponsors | 72

73 of 99

Speculative Decoding Overview

Large and small models tend to agree on most tokens

2025 Toronto Meetup

Meetup Sponsors | 73

74 of 99

Speculative Decoding Challenges

Speeding up one part always reveals another bottleneck
Constant costs are killers: we are doing N+1 forward passes for each step!

2025 Toronto Meetup

Meetup Sponsors | 74

75 of 99

EAGLE: Custom-made for Drafting

EAGLE is a 1-layer LLM designed specifically for efficient drafting

2025 Toronto Meetup

Meetup Sponsors | 75

76 of 99

Optimizing EAGLE in vLLM

Framework-level overheads are amplified relative to a tiny drafter

vLLM recently merged “overlapped execution”, a step towards zero-overhead drafting
Fused kernels and optimized logic is coming soon,�reducing drafting overhead

Optimized kernels are being integrated for even faster verification
Increased parallelism via “tree” drafting is ongoing, partly merged

2025 Toronto Meetup

Meetup Sponsors | 76

77 of 99

Future work on EAGLE

Lots of ways to build out from EAGLE!

PARD for parallel drafting
Combining EAGLE with other speculation methods
Custom fast attention kernels for verification
Multimodal support
Extended CUDA graphs to fuse drafting and verification

2025 Toronto Meetup

Meetup Sponsors | 77

78 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 78

79 of 99

Topic #3

Enabling SpecDec with any model using Speculators

Dipika Sikka | Principal Software Engineer, Red Hat

2025 Toronto Meetup

Meetup Sponsors | 79

80 of 99

2025 Toronto Meetup

Meetup Sponsors | 80

80

Lossless technique to speed up LLM inference by using a draft model (i.e the speculator, “small model”) to propose tokens
Boost performance without sacrificing quality

Draft model is responsible for doing the heavy lifting

Intelligently draft multiple tokens ahead of time
Able to predict simple tokens (e.g. “the”, “of”)

Every accepted token is guaranteed to match what the base model would have generated independently

Speculative Decoding

81 of 99

2025 Toronto Meetup

Meetup Sponsors | 81

Draft model speculates K tokens

Worst case scenario → generate 1 token (equivalent to vanilla generation)
Best case scenario → K + 1 tokens in a single pass

It	is	orange	and
0.7	0.9	0.8
0.9	0.9	0.7	0.8

82 of 99

Introducing Speculators

https://github.com/vllm-project/speculators

The speculators repository provides a unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference

Create draft models for vLLM deployment

Provides a Hugging Face-compatible format
Tools to convert from external repositories into a standard speculators format

If you have a trained model from another library, you can apply the speculators format which allows it to run seamlessly in vLLM

82

83 of 99

84 of 99

Introducing Speculators

Example trained speculator: RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3

Follows the Eagle3 Algorithm

84

85 of 99

RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3

Checkpoint includes the draft model, which consists of a single decoder layer

85

86 of 99

Updated config.json, consisting of metadata relevant to the model and how it should be served in vLLM i.e the “speculators format”

86

Base Model

87 of 99

Run the trained models through vLLM

vllm serve RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3

87

88 of 99

Latency Improvements

88

89 of 99

Latency Improvements

89

We can combine the power of speculative decoding with quantization to maximize LLM Performance!

90 of 99

Check out existing speculator models

https://huggingface.co/RedHatAI

90

Missing a model? Generate your own!

pip install speculators

#feat-spec-decode

#sig-quantization

#llm-compressor

91 of 99

Q&A

2025 Toronto Meetup

Meetup Sponsors | 91

92 of 99

Ways to contribute and closing remarks

Aaron Pham

2025 Toronto Meetup

Meetup Sponsors | 92

93 of 99

Help make future vLLM events better with a 2-min survey!

Meetup Sponsors | 93

2025 Toronto Meetup

94 of 99

Journey to contribute to vLLM

2025 Toronto Meetup

Meetup Sponsors | 94

2024

Contribute to upstream

have some frustrations with structured outputs, tool calling, contribute�10785 (xgrammar support)�12388 (V1 structured outputs)�16577 (thinking + structured) … (hardening)

2025-

Regular committer and contributor

Interests:�Frontend/tool support + UX

Structured outputs and speculative decoding

2022

Building LLM serving offering

vLLM stood out, outperforms transformers with additional scheduling loop. ��Supports both continuous batching and paged attention https://arxiv.org/abs/2309.06180

2023

Using vLLM in Production

most of our deployments uses vLLM�Pros: easy-to-use, broad support, hackable�Cons: requires a lot of tuning

95 of 99

Roadmap

Large MoE support
RL integration
Hardened CI, developer experience
Stability and performance

Async scheduler feature parity
Large Scale MoE Serving
Attn-FFN disaggregation, Hybrid models
Kernels, cold start, etc.

Vertical integrations

LlamaStack, prime-rl

roadmap.vllm.ai <- <you can add it here!>

Look out for Q4 roadmap soon by joining the vLLM slack!

2025 Toronto Meetup

Meetup Sponsors | 95

96 of 99

Ways to contribute

Try out and contribute to vLLM Recipes �

2025 Toronto Meetup

Meetup Sponsors | 96

Contribute to “Good first issues”

Participate in topics/RFCs

97 of 99

Welcome to the vLLM community!

Contribute to key vLLM features�Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags.

Join vLLM Developer Slack�Ask questions and engage with us via Slack slack.vllm.ai

Come to vLLM Office Hours�Project updates and special topics biweekly red.ht/office-hours

2025 Toronto Meetup

Meetup Sponsors | 97

https://github.com/vllm-project/vllm

$ uv pip install vllm

Thanks to the over 1500+ contributors!

98 of 99

Join our [Virtual] Red Hat AI Events

Your Path to Enterprise-Ready AI

Red Hat AI: Day of Learning�October 16, 2025

Deep-dive’s from Red Hat and IBM experts on:

⚡ Fast & efficient inference
🎯 Model customization
🤖 Agentic AI
🌐 Scaling AI over hybrid cloud

Hands-on insights to accelerate AI adoption
Practical guidance to scale AI securely & efficiently

Red Hat AI: What’s New & What’s Next �October 14, 2025

Latest advancements in Red Hat AI addressing:

🚀 Cost
⚙️ Complexity
🔒 Control
📈 Scalability

Inference across any model, hardware & cloud
New features: agents, enterprise data, model optimization

98

red.ht/ai-day-of-learning

red.ht/whats-new-whats-next

Sign up →

In October this year, we are hosting two virtual Red Hat AI events that you should check out. They are free to attend. ��On the left is our AI Day of Learning. This is a hands-on, deep-dive event co-led by Red Hat and IBM experts. We’ll cover four critical topics that organizations are grappling with today: fast and efficient inference, model customization, agentic AI, and scaling AI across hybrid cloud environments. The goal is to give you practical guidance and actionable insights that you can take back to your teams immediately.��On the right is our What’s New & What’s Next series. This is our regular quarterly touchpoint where we share the latest advancements in Red Hat AI addressing challenges around cost, complexity, control, and scalability. We’ll also show how you can run high-performance inference on any model, hardware, or cloud. And we’ll highlight key new features that make it easier to build AI agents, connect securely to enterprise data, and manage token-based workloads. These sessions also give you direct access to Red Hat AI leaders and roadmap insights, so you can stay ahead of what’s coming next.

99 of 99

2025 Toronto Meetup

Meetup Sponsors | 99

https://github.com/vllm-project/vllm

https://docs.vllm.ai

https://blog.vllm.ai

Building the fastest and easiest-to-use open-source LLM inference & serving engine!

https://twitter.com/vllm_project

https://opencollective.com/vllm

https://slack.vllm.ai