JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 45

meetup

Apr 2026

2 of 45

docs.vllm.ai/projects/vllm-omni/en/latest

Wednesday 19:30 PDT

Stage

3 of 45

Our Goal

Build the fastest and

easiest-to-use open-source

Omni-Modality model inference & serving engine

4 of 45

Omni-Modality models

Omni-modality: Text, image, video, and audio data processing

Non-autoregressive Architectures: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models

Heterogeneous outputs: from traditional text generation to multimodal outputs

5 of 45

Broad Model Support

vLLM-Omni supports 40+ popular omni and diffusion model architectures(growing rapidly)

Qwen-Omni

Qwen-Image

BAGEL

Mistral

Wan

Ovis-Image

LongCat

SD3

Flux

Image/3D

StepFun

GLM

MiMo

…plus Many more

6 of 45

Contributors

…

Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!

7 of 45

vLLM Github Repo

https://github.com/vllm-project/vllm-omni

$ uv pip install vllm==0.20.0 --torch-backend=auto

$ uv pip install vllm-omni

4700+ Stars

Official release

8 of 45

vLLM-Omni System Walkthrough

vLLM-Omni Team

9 of 45

Goal of the walkthrough

Understand how vLLM-Omni processes a multi-modal request and generates its multi-modal outputs.

2. Learn where to modify if you would like to make a specific modification/contribution.

10 of 45

Multi-modality models

Backbone: AR + DiT

Models: Qwen-Image/GLM-Image

Tasks: t2i, t2v, i2i...

Backbone: AR + Spec. Gen.

Models: BAGEL, Hunyuan Image 3.0

Tasks: t2i, i2i, i2t...

Backbone: (multi) AR + DiT

Models: Qwen-Omni/Ming-Omni

Tasks: any-to-any

Yin, Peiqi, et al. "vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models." arXiv preprint arXiv:2602.02204 (2026).

11 of 45

Multi-modality models: AR/DiT comparison

	AR	DiT
Use cases	Text generation	Multi-modelity generation
Generation process	Token-by-token KV Cache based	Diffusion step
Bottleneck	Prefill: compute bound Decode: memory bound	Compute bound
Seq length	varied	Fixed
Attention Mask	causal mask	Full mask
parallelism	TP/DP/EP/PP/CP/SP	TP/EP/USP/CFG

12 of 45

Main architecture of vLLM-Omni

imported

modified

new

vLLM-Omni

APIServer

Omni/AsyncOmni

Model/Layer/Ops

OmniConnector（E/P/D/G）

LLMEngine

Executor

ModelRunner

Worker

Cache Engine

Scheduler

Diffusion

Worker

ModelRunner

/Pipeline

Scheduler

DiffusionEngine

Main component

Entrypoints: offline/online serving, StageClient abstraction for model stages(AR/DiT). One runtime, multiple interfaces.
AR module: inherited from vLLM(CB/PA/Prefix Cache…) and adapted to the Omni-modality model
Diffusion module: implemented natively and optimized by acceleration components
Model/Layer/ops: parallelism, quantization, attention…
OmniConnector: natively supports E/P/D/G disaggregation

EntryPoints

Orchestrator

13 of 45

Interface Design

AsyncOmniEngine

engine/async_omni_engine.py

AsyncOmni

vllm_omni/entrypoints/async_omni.py

Synchronous

Asynchronous

Developer Interface�vllm/engine

User custom server

def add_request()

def abort_requests()

def step()

async def generate()

async def abort()

+ background engine loop

Omni

entrypoints/omni.py

openai_api_server

entrypoints/openai/

api_server.py

End-user Interface�vllm_omni/entrypoints

Batched inference

OpenAI-compatible API server

StageEngineCoreClient

StageDiffusionClient

Orchestrator

StageEngineCoreProc

StageDiffusion

Proc

14 of 45

When requests arrive

Example: Qwen3-Omni

AR-decoder(Thinker)

Visual

Encoder

Text

Tokenizer

Code2Wav

AR-decoder(Talker)

Audio

Encoder

StageClient

EngineCoreProc

StageClient

EngineCore

EngineCoreProc

15 of 45

Two-level Config system for Qwen3-Omni

OmniStage_0:

Thinker

OmniStage_1:

Talker

OmniStage_2:

Code2wav

- stage_id: 0

gpu_memory_utilization: 0.9

devices: "0"

default_sampling_params:

temperature: 0.4

top_p: 0.9

top_k: 1

max_tokens: 2048

seed: 42

repetition_penalty: 1.05

- stage_id: 1

gpu_memory_utilization: 0.6

devices: "1"

input_connectors:

from_stage_0: connector_of_shared_memory

default_sampling_params:

temperature: 0.9

top_k: 50

max_tokens: 4096

seed: 42

repetition_penalty: 1.05

- stage_id: 2

gpu_memory_utilization: 0.1

max_num_seqs: 1

enforce_eager: true

async_scheduling: false

max_num_batched_tokens: 51200

devices: "1"

input_connectors:

from_stage_1: connector_of_shared_memory

default_sampling_params:

temperature: 0.0

top_p: 1.0

top_k: -1

max_tokens: 65536

seed: 42

repetition_penalty: 1.1

https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/deploy/qwen3_omni_moe.yaml

Deploy Config: For setting of different deployment tasks, exposed to users

PR #2383

16 of 45

Two-level Config system for Qwen3-Omni

https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/model_executor/models/qwen3_omni/pipeline.py

Pipeline Config: For organizing data and computing pipeline of certain model, exposed to developers

StagePipelineConfig(

stage_id=0,

model_stage="thinker",

execution_type=StageExecutionType.LLM_AR,

input_sources=(),

final_output=True,

final_output_type="text",

owns_tokenizer=True,

requires_multimodal_data=True,

hf_config_name="thinker_config",

engine_output_type="latent",

…)

OmniStage_0:

Thinker

OmniStage_1:

Talker

OmniStage_2:

Code2wav

StagePipelineConfig(

stage_id=1,

model_stage="talker",

execution_type=StageExecutionType.LLM_AR,

input_sources=(0,),

hf_config_name="talker_config",

engine_output_type="latent",

custom_process_input_func=f“{_PROC}.thinker2talker”, custom_process_next_stage_input_func=(f"{_PROC}.talker2code2wav_async_chunk"),

sampling_constraints={

"detokenize": False,

"stop_token_ids": [2150],

)

StagePipelineConfig(

stage_id=2,

model_stage="code2wav", execution_type=StageExecutionType.LLM_GENERATION,

input_sources=(1,),

final_output=True,

final_output_type="audio",

hf_config_name="thinker_config",

engine_output_type="audio",

custom_process_input_func=f"{_PROC}.talker2code2wav",

sampling_constraints={"detokenize": True},

)

17 of 45

Natively Disaggregated Serving

Standardized Unified Abstraction: A generalized interface that handles heterogeneous data (Text, Image, Audio).

Control & Data Plane Decoupling: Metadata travels via lightweight control signals, while heavy payloads are offloaded to high-performance data planes.

Hybrid Backend Support: Native support for Shared Memory (SHM) and Mooncake for distributed transfers.

Disaggregated Multi-Modal Execution: Enables seamless communication between decoupled stages.

Multi-Instance Scaling: Supports multiple instances for each omni stage, enabling elastic deployment and efficient load distribution across distributed clusters.

PR #215

#979

Meta data flow

D2H2D flow

D2D flow

StageEngine-1

Scheduler

Worker

OmniConnector

ModelRunner

Memory Pool

APIServer/AsyncOmni

Transfer engine

StageEngine-2

Scheduler

Worker

OmniConnector

ModelRunner

StageEngine-K

Scheduler

Worker

OmniConnector

ModelRunner

…

N_1

N_2

N_k

PipelineOrchestrator（multi-stage）

ZMQ IPC

async.io

18 of 45

Async Chunk & Audio Streaming Output

Pipeline Between Stages: Asynchronous chunked computation and communication across stages

Audio Streaming Output: The waveform is output immediately after Talker generates each token.

APIServer Support: OpenAI v1/chat/completions with “stream” argument

PR #367

#727

#951

#1438

E2E generation

Streaming generation

19 of 45

Async Chunk & Audio Streaming Output

OmniConnector: Transmit data between stages.

OmniChunkTransferAdapter: Chunk-specific implementation that owns the full chunk lifecycle when async_chunk is enabled.

Stage Input Processors: Custom functions that process stage outputs into chunks for different models.

Schedulers: Modified to handle chunk-based scheduling with async IO-compute overlap.

Model Runners: Handle chunk processing.

20 of 45

Async Chunk & Audio Streaming Output

1 / 10 concurrency: 12.4x / 8.2x improves

1 / 10 concurrency: 1.1 x / 1.2 x improves

https://docs.vllm.ai/projects/vllm-omni/en/latest/design/feature/async_chunk_design/

Qwen3-Omni

21 of 45

Async Chunk & Audio Streaming Output

Qwen3-TTS

PR #1161

#1591

#1617

22 of 45

Scaling Disaggregated Serving

Command Line Interface (CLI) Design (PR #2020)

Distributed Deployment Structure

a. Multiple APIServers (PR #2020)

b. DP based Scaling for OmniStage

c. Coordinator for global management

Scheduling & Routing for load balancing

Example Multi-Node Deployment Diagram

23 of 45

Omni-Modality models Acceleration Support

Model	Output Modalities	Disaggregated deployment	Streaming	Async chunk	Graph	Quantization	NPU Support
Qwen2.5-Omni	Text/Audio	✅	Only text	❌	❌	❌	✅
Qwen3-Omni	Text/Audio	✅	✅	✅	✅	✅	✅
MiMo-Audio	Text/Audio	✅	✅	✅	✅	❌	❌
Qwen3-TTS	Audio	✅	✅	✅	✅	⏳	✅
Bagel	Text/Image	✅	Only text	N/A	Only AR	❌	❌
GLM-Image	Image	✅	N/A	N/A	✅	⏳	✅
Hunyuan-Image	Text/Image	✅	Only text	N/A	Only AR	⏳	✅

✅ = supported

❌ = not yet supported

⏳ = in progress (PR under review)

24 of 45

Diffusion Core

Natively implemented and optimized by configuring the acceleration layer
Aligned with the AR arch with model_runner abstraction

Encoders are disaggregated to run on vllm engine for higher throughput in large-scale deployment
Support step-wise scheduler with continuous batch for higher SLA
Connect with AR module to speed up AR+DIT model inference with OmniConnector

Acceleration Features

Cache backend: Cache-DiT, TeaCache…

Parallelism: TP/HSDP/SP/CFG/VAE…
Attention: interface abstraction for third-party integration(FA/SAGE/MindIE-SD…)
Quantization: FP8/GGUF…
CPU Offload: module-wise/layer-wise
Extensions: LoRA

Diffusion Module Design

vllm import

native

Third-party

Diffusion Engine

Acceleration Features

Attention Backend

Cache Backend

Parallelism

SDPA

CacheDiT

TeaCache

Sparse

CPU Offload

(module-wise/

layer-wise)

Quantization

(FP8/GGUF)

Extensions

LoRA

HSDP

CFG

VAE

Diffusion Worker

Scheduler(step-wise)

Diffusion Pipeline

Prompt Encode

N-step Sampling

VAE Decode

embed

latent

DiT

Diffusion Model Runner

25 of 45

Diffusion Dynamic Batching

Async arrival over time

time

Request A

Request C

Request B

Request D

…

Async

Diffusion

Engine

async_add_request()

(accept independently)

Future per request

(created immediately)

Scheduler

Pick compatible requests

across async arrivals

batch compatible

ready

same

shape

same

CFG

Selected for next batch

(example)

Worker

state manager

batch builder

Per-request lifestyle

waiting → scheduled → running → finished

InputBatch

(formed from async arrivals)

ModelRunner /

execute_stepwise()

(process step by step)

Return immediately

when finished

(resolve Futures)

Result C

t4.2

Result A

t5.1

Result D

t6.3

Result B

t7.8

Out-of-order

return

Motivation

Image/video requests arrive at different timestamps.
Synchronous DiffusionEngine execution misses batching opportunitiesand under-utilizes DiT compute.

Design

Engine accepts requests asynchronously.
Scheduler selects compatible requests across arrivals.
Worker manages states and builds InputBatch for step-wise execution.

Async arrival → dynamic batching → larger effective DiT batch → higher GPU utilization / MFU → higher throughput and lower latency.��CPU batching overhead is negligible:�make_batch < 1 ms vs. execute_stepwise ≈ 1000–1200 ms.

26 of 45

parallelism

Parallel Methods	Core Description	Configuration Parameters
Tensor Parallelism (TP)	Distributed weight to multiple devices	tensor_parallel_size
Sequence Parallelism (SP)	Split the input along the sequence dimension.	ulysses_degree, ring_degree
Expert Parallelism (EP)	Only MoE Expert MLP Blocks	enable_expert_parallel
CFG-Parallelism	CFG's positive and negative branches are assigned to different GPUs.	cfg_parallel_size
HSDP	Hybrid Sharding Data Parallelism	use_hsdp, hsdp_shard_size
VAE Patch Parallelism	Cross-GPU Distributed VAE Encoding and Decoding	vae_patch_parallel_size

Core Compatibility Matrix

Method	TP	Ulysses	Ring	CFG	HSDP	EP	VAE Patch
TP	—	✅	✅	✅	❌	✅	✅
Ulysses	✅	—	✅	✅	✅	✅	✅
Ring	✅	✅	-	✅	✅	✅	✅
CFG	✅	✅	✅	-	✅	✅	✅
HSDP	❌	✅	✅	✅	—	✅	✅
EP	✅	✅	✅	✅	✅	-	✅
VAE Patch	✅	✅	✅	✅	✅	✅	-

https://github.com/vllm-project/vllm-omni/issues/1217

Wan2.2 supports

Parallel Method	Wan2.2 Support
SP (Ulysses and Ring)	✅
CFG - Parallel	✅
Tensor Parallel	✅
HSDP	✅
VAE patch parallel	✅
Expert parallelism	❌

27 of 45

Quantization

28 of 45

CPU offload Support(1)

Sequential Offloading (coarse-grained model component level)

Register offload targets on model component (e.g., Encoder/DiT/…)
Pre-Hooks registered: offload targets; move self to GPU

29 of 45

CPU offload Support(2)

Layerwise Offloading

Keep a single TransformerBlock on GPU during computing
Hooks registered: Pre-fetch params for the next block; release memory after computation

30 of 45

Diffusion modules with Acceleration Support

https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion_features/#imagegen

Model	Output modality	TeaCache	Cache-DiT	SP (Ulysses & Ring)	CFG-Parallel	TP	HSDP	VAE-Patch-Parallel	CPU Offload (Layerwise)	Quant (fp8/fp4…)
FLUX.1-dev	Image	⏳	✅	❌	✅	✅	⏳	❌	❌	✅
FLUX.2-klein		⏳	⏳	⏳	✅	✅	⏳	❌	❌	✅
LongCat-Image		❌	✅	✅	✅	✅	⏳	❌	❌	❌
LongCat-Image-Edit		❌	✅	✅	✅	✅	⏳	❌	❌	❌
Qwen-Image		✅	✅	✅	✅	✅	✅	N.A.	✅	✅
Qwen-Image-Edit		✅	✅	✅	✅	✅	✅	N.A.	✅	✅
Qwen-Image-Layered		❌	✅	✅	✅	✅	⏳	N.A.	✅	❌
SD3.5		⏳	✅	⏳	✅	✅	⏳	❌	❌	❌
NextStep1.1		⏳	⏳	❌	❌	✅	⏳	❌	❌	❌
Z-Image		✅	✅	✅	❌	✅	⏳	✅	❌	✅
Wan2.2	Video	❌	✅	✅	✅	✅	✅	✅	✅	✅
HunyuanVideo1.5		⏳	✅	✅	✅	✅	✅	❌	⏳	⏳
Seed-Helios		❌	❌	❌	❌	⏳	❌	❌	❌	⏳

31 of 45

Hardware plugin system

Pluggable

vLLM-Omni

EntryPoints

APIServer

Omni/AsyncOmni

OmniStage

Pluggable Layer

OmniConnector（E/P/D/G）

LLM Engine

Executor

Model Runner

Worker

Cache Engine

Scheduler

Diffusion

Scheduler

Diffusion Engine

Plugin

Hardware Platform

vLLM-Omni Plugin Interface

Executor

Worker

Model Runner

NPU/XPU/ROCm

Model Runner

Worker

Attention Backend

Attention

Custom kernels

Custom Layer

32 of 45

RL Support: VeRL-Omni

Key Features

Efficient multimodal rollout: vLLM-Omni provides high-throughput async serving for multimodal generation
Modular training engines: allow easy integration of parallelism(FSDP/SP) and other optimizations for various diffusion models
Flexible reward engine: supports both rule-based and model-based reward computations
Fast diffusion RL trainers: FlowGRPO trainer achieves ~25% higher E2E training throughput than the diffusers-based implementation

Ongoing

Model support: Qwen3-Omni, BAGEL, …
Algorithm support: GSPO, MixGRPO, …

Fully async for efficient diffusion/omni-modality RL
Training stability: deterministic, staleness control, …
Hardware support: NPU
…

https://github.com/verl-project/verl-omni Welcome contributions!

33 of 45

LVSA — Long-Video Sparse Attention

Training-free block-sparse attention for long-video diffusion – vLLM-Omni plugin

PR4192

Key1: Adaptive Expanded Sparse Pattern

Plugin for versions 0.18.0 & 0.22.0,

support Wan 2.X, HunyuanVideo on NPU and GPU

https://arxiv.org/abs/2605.31057

https://github.com/JiusiServe/LongVideoSparseAttention

Key2: Rotating Keyframes

Eliminates long-range temporal artifacts while keeping per-step attention budget constant.

Problem: Dense Attention Hits a Wall

Quadratic Compute. Attention over N = T×P tokens scales as O(N²).

Quality Collapse. Beyond training horizon, dense → ‘frozen video’.

Capability Wall at 14B Scale. Dense 14B at 2×: 72.9 GB. Dense at 4×+: > 2h on single GPU. Long-video serving = not viable at dense cost.

34 of 45

Summary

Support Omni-Modality model inference & serving
Consistent and unified API interface with vLLM
Native disaggregated deployment for different model stages
Pipeline async streaming for multiple stage engines
Native support for diffusion stage acceleration

35 of 45

vLLM-Omni 0.20.0 release

vLLM-Omni Team

36 of 45

vllm-omni 0.20.0 release highlights（1）

Rebased to upstream vLLM v0.20.0, with CUDA 13.0 and PyTorch 2.11 alignment, Transformers 5.x compatibility fixes, removal of the old vLLM entrypoint hijack, and runtime changes needed for the 0.20.0 integration path. (#3232, #3082, #3352, #3393, #2306)

CLI and configuration refactor, including the stage CLI refactor, forwarding CLI tokenizer settings into per-stage engine configs, removal of legacy Omni CLI helpers, cleaner deploy/pipeline config migration, and updated CLI documentation. (#2020, #3120, #3144, #2383, #2978)

Hardware plugin and platform optimization, expanding MUSA flash attention and torch.accelerator support, aligning NPU with the v0.20.0/GPU model-runner path, restoring ROCm/AMD CI signal, and refreshing XPU Docker/CI readiness for the PyTorch 2.11 stack. (#2451, #3101, #3325, #3343, #3083, #3393)

More SOTA model support, including Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, HunyuanImage-3.0 IT2I, ERNIE image T2I, AudioX, Wan2.2-S2V, DreamID-Omni HSDP, LTX-2.3, and FastGen Wan 2.1 pipelines. (#2890, #3089, #2753, #2658, #3107, #2861, #2077, #2751, #3138, #2893, #2749)

37 of 45

vllm-omni 0.20.0 release highlights（2）

Diffusion continuous batching: adding async batch inference in the DiffusionEngine and strengthening step-level/diffusion serving paths with pipeline-declared offload modules, achieved a 7.8% increase in throughput and a 5.8% reduction in mean latency compared to the baseline. CFG/HSDP improvements, VAE tiling, and performance validation. (#2729, #2707, #2427, #2423, #2368, #2899, #2982)

Large-scale serving for Qwen3-Omni, by scaling stages such as talker and code2wav with multiple replicas, this change improves overall throughput for Qwen3-Omni deployment on H20 GPU at 32 concurrency from 0.241 req/s to 0.414 req/s (+72%), while also increasing per-GPU efficiency to 0.138 req/s/GPU (+14%). (#3203, #2376, #3306, #2396, #2598, #2600)

Expanded quantization coverage, including AutoRound W4A16 support for Qwen Omni, offline W4A16 quantized model support, OmniGen2 FP8, Z-Image text-encoder FP8 online quantization, HunyuanImage3 NPU quantization, GLM-Image quantization, and fixes for pre-quantized checkpoints. (#2670, #1777, #2441, #3279, #2979, #2292, #2702, #2795)

38 of 45

vllm-omni 0.20.0 release highlights（3）

TTS model speedups, improving through native decoder construction, CUDA graph capture and shared memory pools, streaming VAE with full-graph, global speaker/reference-audio caches, and deterministic Fast AR sampling, delivering roughly 9x VoxCPM2 RTF reduction (0.946 → 0.106 on H20), -53% Fish Speech Fast AR latency, and ~3.2 GiB Code2Wav memory savings for Qwen3-TTS/Voxtral-TTS. (#2690, #2758, #2803, #2341, #2630, #2657, #2520)

Omni model optimization, enable async scheduling and uni process executor to align with upstream AR performance, delivering TTFT/TPOT 50% reduction. (#3164, #3203)

Wan2.2 on NPU is now production-ready with major I2V performance optimizations, including MindIE-SD LA, fused RoPE/AdaLayerNorm/RMSNorm, VAE BF16 and parallelism fixes, HSDP/USP deployment recipes, delivering 50-60% performance improvement. (#2919, #2393, #2459, #2391, #2585, #2583, #2571, #3067, #2969, #2852, #3063, #2262, #2817)

39 of 45

Qwen3-Omni

High throughput (Concurrency 32)

Low latency(Concurrency 1)

The experiments were run on two NVIDIA H100 GPUs.

40 of 45

Qwen3-TTS

High throughput (Concurrency 32)

Low latency(Concurrency 1)

The experiments were run on Single H20 GPU.

41 of 45

Fish Speech S2 Pro

High throughput (Concurrency 8)

Low latency(Concurrency 1)

The experiments were run on Single H20 GPU.

42 of 45

Diffusion models

43 of 45

vLLM-Omni Roadmap

vLLM-Omni Team

44 of 45

Future Roadmap (Overview)

P0:

CI/CD: Performance monitoring
Large scale deployment
Diffusion continuous batching
Quantization(fp8/fp4/autoround w4a16…)

P1:

Auto-regressive DiT models (interactive/world models)
Diffusers backend support
ModelRunnerV2 fully support
Video streaming input/output

https://github.com/vllm-project/vllm-omni/issues/2136

45 of 45

vLLM Networking Hour!

https://github.com/vllm-project/vllm-omni

https://blog.vllm.ai/2025/11/30/vllm-omni.html

https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack