meetup
Apr 2026
2
docs.vllm.ai/projects/vllm-omni/en/latest
Wednesday 19:30 PDT
Stage
Our Goal
Build the fastest and
easiest-to-use open-source
Omni-Modality model inference & serving engine
3
Omni-Modality models
4
Omni-modality: Text, image, video, and audio data processing
Non-autoregressive Architectures: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models
Heterogeneous outputs: from traditional text generation to multimodal outputs
Broad Model Support
vLLM-Omni supports 40+ popular omni and diffusion model architectures(growing rapidly)
Qwen-Omni
Qwen-Image
BAGEL
Mistral
Wan
Ovis-Image
LongCat
SD3
Flux
Image/3D
StepFun
GLM
MiMo
…plus Many more
Contributors
6
…
Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!
vLLM Github Repo
7
$ uv pip install vllm==0.20.0 --torch-backend=auto
$ uv pip install vllm-omni
4700+ Stars
Official release
vLLM-Omni System Walkthrough
vLLM-Omni Team
Goal of the walkthrough
2. Learn where to modify if you would like to make a specific modification/contribution.
9
Multi-modality models
10
Backbone: AR + DiT
Models: Qwen-Image/GLM-Image
Tasks: t2i, t2v, i2i...
Backbone: AR + Spec. Gen.
Models: BAGEL, Hunyuan Image 3.0
Tasks: t2i, i2i, i2t...
Backbone: (multi) AR + DiT
Models: Qwen-Omni/Ming-Omni
Tasks: any-to-any
Yin, Peiqi, et al. "vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models." arXiv preprint arXiv:2602.02204 (2026).
Multi-modality models: AR/DiT comparison
11
| AR | DiT |
Use cases | Text generation | Multi-modelity generation |
Generation process | Token-by-token KV Cache based | Diffusion step |
Bottleneck | Prefill: compute bound Decode: memory bound | Compute bound |
Seq length | varied | Fixed |
Attention Mask | causal mask | Full mask |
parallelism | TP/DP/EP/PP/CP/SP | TP/EP/USP/CFG |
Main architecture of vLLM-Omni
12
imported
modified
new
vLLM-Omni
AR
APIServer
Omni/AsyncOmni
Model/Layer/Ops
OmniConnector(E/P/D/G)
LLMEngine
Executor
ModelRunner
Worker
Cache Engine
Scheduler
Diffusion
Worker
ModelRunner
/Pipeline
Scheduler
DiffusionEngine
Main component
EntryPoints
Orchestrator
Interface Design
13
AsyncOmniEngine
engine/async_omni_engine.py
AsyncOmni
vllm_omni/entrypoints/async_omni.py
Synchronous
Asynchronous
Developer Interface�vllm/engine
User custom server
def add_request()
def abort_requests()
def step()
async def generate()
async def abort()
+ background engine loop
Omni
entrypoints/omni.py
openai_api_server
entrypoints/openai/
api_server.py
End-user Interface�vllm_omni/entrypoints
Batched inference
OpenAI-compatible API server
StageEngineCoreClient
StageDiffusionClient
Orchestrator
StageEngineCoreProc
StageDiffusion
Proc
When requests arrive
Example: Qwen3-Omni
AR-decoder(Thinker)
Visual
Encoder
Text
Tokenizer
Code2Wav
AR-decoder(Talker)
Audio
Encoder
StageClient
StageClient
EngineCoreProc
EngineCoreProc
StageClient
EngineCore
EngineCore
EngineCore
EngineCoreProc
Two-level Config system for Qwen3-Omni
15
OmniStage_0:
Thinker
OmniStage_1:
Talker
OmniStage_2:
Code2wav
- stage_id: 0
gpu_memory_utilization: 0.9
devices: "0"
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
repetition_penalty: 1.05
- stage_id: 1
gpu_memory_utilization: 0.6
devices: "1"
input_connectors:
from_stage_0: connector_of_shared_memory
default_sampling_params:
temperature: 0.9
top_k: 50
max_tokens: 4096
seed: 42
repetition_penalty: 1.05
- stage_id: 2
gpu_memory_utilization: 0.1
max_num_seqs: 1
enforce_eager: true
async_scheduling: false
max_num_batched_tokens: 51200
devices: "1"
input_connectors:
from_stage_1: connector_of_shared_memory
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 65536
seed: 42
repetition_penalty: 1.1
https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/deploy/qwen3_omni_moe.yaml
PR #2383
Two-level Config system for Qwen3-Omni
16
https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/model_executor/models/qwen3_omni/pipeline.py
StagePipelineConfig(
stage_id=0,
model_stage="thinker",
execution_type=StageExecutionType.LLM_AR,
input_sources=(),
final_output=True,
final_output_type="text",
owns_tokenizer=True,
requires_multimodal_data=True,
hf_config_name="thinker_config",
engine_output_type="latent",
…)
OmniStage_0:
Thinker
OmniStage_1:
Talker
OmniStage_2:
Code2wav
StagePipelineConfig(
stage_id=1,
model_stage="talker",
execution_type=StageExecutionType.LLM_AR,
input_sources=(0,),
hf_config_name="talker_config",
engine_output_type="latent",
custom_process_input_func=f“{_PROC}.thinker2talker”, custom_process_next_stage_input_func=(f"{_PROC}.talker2code2wav_async_chunk"),
sampling_constraints={
"detokenize": False,
"stop_token_ids": [2150],
},
)
StagePipelineConfig(
stage_id=2,
model_stage="code2wav", execution_type=StageExecutionType.LLM_GENERATION,
input_sources=(1,),
final_output=True,
final_output_type="audio",
hf_config_name="thinker_config",
engine_output_type="audio",
custom_process_input_func=f"{_PROC}.talker2code2wav",
sampling_constraints={"detokenize": True},
)
Natively Disaggregated Serving
PR #215
#979
Meta data flow
D2H2D flow
D2D flow
StageEngine-1
Scheduler
Worker
OmniConnector
ModelRunner
Memory Pool
APIServer/AsyncOmni
Transfer engine
StageEngine-2
Scheduler
Worker
OmniConnector
ModelRunner
StageEngine-K
Scheduler
Worker
OmniConnector
ModelRunner
…
N_1
N_2
N_k
PipelineOrchestrator(multi-stage)
ZMQ IPC
ZMQ IPC
ZMQ IPC
async.io
Async Chunk & Audio Streaming Output
18
PR #367
#727
#951
#1438
E2E generation
Streaming generation
Async Chunk & Audio Streaming Output
19
Async Chunk & Audio Streaming Output
1 / 10 concurrency: 12.4x / 8.2x improves
1 / 10 concurrency: 1.1 x / 1.2 x improves
https://docs.vllm.ai/projects/vllm-omni/en/latest/design/feature/async_chunk_design/
Async Chunk & Audio Streaming Output
PR #1161
#1591
#1617
Scaling Disaggregated Serving
a. Multiple APIServers (PR #2020)
b. DP based Scaling for OmniStage
c. Coordinator for global management
Example Multi-Node Deployment Diagram
Omni-Modality models Acceleration Support
23
Model | Output Modalities | Disaggregated deployment | Streaming | Async chunk | Graph | Quantization | NPU Support |
Qwen2.5-Omni | Text/Audio | ✅ | Only text | ❌ | ❌ | ❌ | ✅ |
Qwen3-Omni | Text/Audio | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
MiMo-Audio | Text/Audio | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
Qwen3-TTS | Audio | ✅ | ✅ | ✅ | ✅ | ⏳ | ✅ |
Bagel | Text/Image | ✅ | Only text | N/A | Only AR | ❌ | ❌ |
GLM-Image | Image | ✅ | N/A | N/A | ✅ | ⏳ | ✅ |
Hunyuan-Image | Text/Image | ✅ | Only text | N/A | Only AR | ⏳ | ✅ |
Diffusion Core
Acceleration Features
Diffusion Module Design
vllm import
native
Third-party
Diffusion Engine
Acceleration Features
Attention Backend
Cache Backend
Parallelism
SP
TP
SDPA
FA
CacheDiT
TeaCache
Sparse
CPU Offload
(module-wise/
layer-wise)
Quantization
(FP8/GGUF)
Extensions
LoRA
HSDP
CFG
VAE
Diffusion Worker
Scheduler(step-wise)
Diffusion Pipeline
Prompt Encode
N-step Sampling
VAE Decode
embed
latent
DiT
Diffusion Model Runner
Diffusion Dynamic Batching
Async arrival over time
time
t0
Request A
t1
Request C
t2
Request B
t3
Request D
…
Async
Diffusion
Engine
async_add_request()
(accept independently)
Future per request
(created immediately)
Scheduler
Pick compatible requests
across async arrivals
batch compatible
ready
same
shape
same
CFG
Selected for next batch
(example)
A
C
D
Worker
state manager
batch builder
Per-request lifestyle
waiting → scheduled → running → finished
InputBatch
(formed from async arrivals)
A
C
D
ModelRunner /
execute_stepwise()
(process step by step)
Return immediately
when finished
(resolve Futures)
Result C
t4.2
Result A
t5.1
Result D
t6.3
Result B
t7.8
Out-of-order
return
Motivation
Design
Async arrival → dynamic batching → larger effective DiT batch → higher GPU utilization / MFU → higher throughput and lower latency.��CPU batching overhead is negligible:�make_batch < 1 ms vs. execute_stepwise ≈ 1000–1200 ms.
parallelism
Parallel Methods | Core Description | Configuration Parameters |
Tensor Parallelism (TP) | Distributed weight to multiple devices | tensor_parallel_size |
Sequence Parallelism (SP) | Split the input along the sequence dimension. | ulysses_degree, ring_degree |
Expert Parallelism (EP) | Only MoE Expert MLP Blocks | enable_expert_parallel |
CFG-Parallelism | CFG's positive and negative branches are assigned to different GPUs. | cfg_parallel_size |
HSDP | Hybrid Sharding Data Parallelism | use_hsdp, hsdp_shard_size |
VAE Patch Parallelism | Cross-GPU Distributed VAE Encoding and Decoding | vae_patch_parallel_size |
Core Compatibility Matrix
Method | TP | Ulysses | Ring | CFG | HSDP | EP | VAE Patch |
TP | — | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Ulysses | ✅ | — | ✅ | ✅ | ✅ | ✅ | ✅ |
Ring | ✅ | ✅ | - | ✅ | ✅ | ✅ | ✅ |
CFG | ✅ | ✅ | ✅ | - | ✅ | ✅ | ✅ |
HSDP | ❌ | ✅ | ✅ | ✅ | — | ✅ | ✅ |
EP | ✅ | ✅ | ✅ | ✅ | ✅ | - | ✅ |
VAE Patch | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | - |
https://github.com/vllm-project/vllm-omni/issues/1217
Wan2.2 supports
Parallel Method | Wan2.2 Support |
SP (Ulysses and Ring) | ✅ |
CFG - Parallel | ✅ |
Tensor Parallel | ✅ |
HSDP | ✅ |
VAE patch parallel | ✅ |
Expert parallelism | ❌ |
Quantization
CPU offload Support(1)
Sequential Offloading (coarse-grained model component level)
CPU offload Support(2)
Layerwise Offloading
Diffusion modules with Acceleration Support
https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion_features/#imagegen
Model | Output modality | TeaCache | Cache-DiT | SP (Ulysses & Ring) | CFG-Parallel | TP | HSDP | VAE-Patch-Parallel | CPU Offload (Layerwise) | Quant (fp8/fp4…) |
FLUX.1-dev | Image | ⏳ | ✅ | ❌ | ✅ | ✅ | ⏳ | ❌ | ❌ | ✅ |
FLUX.2-klein | ⏳ | ⏳ | ⏳ | ✅ | ✅ | ⏳ | ❌ | ❌ | ✅ | |
LongCat-Image | ❌ | ✅ | ✅ | ✅ | ✅ | ⏳ | ❌ | ❌ | ❌ | |
LongCat-Image-Edit | ❌ | ✅ | ✅ | ✅ | ✅ | ⏳ | ❌ | ❌ | ❌ | |
Qwen-Image | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | N.A. | ✅ | ✅ | |
Qwen-Image-Edit | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | N.A. | ✅ | ✅ | |
Qwen-Image-Layered | ❌ | ✅ | ✅ | ✅ | ✅ | ⏳ | N.A. | ✅ | ❌ | |
SD3.5 | ⏳ | ✅ | ⏳ | ✅ | ✅ | ⏳ | ❌ | ❌ | ❌ | |
NextStep1.1 | ⏳ | ⏳ | ❌ | ❌ | ✅ | ⏳ | ❌ | ❌ | ❌ | |
Z-Image | ✅ | ✅ | ✅ | ❌ | ✅ | ⏳ | ✅ | ❌ | ✅ | |
Wan2.2 | Video | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
HunyuanVideo1.5 | ⏳ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ⏳ | ⏳ | |
Seed-Helios | ❌ | ❌ | ❌ | ❌ | ⏳ | ❌ | ❌ | ❌ | ⏳ |
Hardware plugin system
31
Pluggable
vLLM-Omni
AR
EntryPoints
APIServer
Omni/AsyncOmni
OmniStage
Pluggable Layer
OmniConnector(E/P/D/G)
LLM Engine
Executor
Model Runner
Worker
Cache Engine
Scheduler
Diffusion
Scheduler
Diffusion Engine
Plugin
Hardware Platform
vLLM-Omni Plugin Interface
Executor
Worker
Model Runner
NPU/XPU/ROCm
Model Runner
Worker
Attention Backend
Attention
IR
Custom kernels
Custom Layer
RL Support: VeRL-Omni
Key Features
Ongoing
https://github.com/verl-project/verl-omni Welcome contributions!
Summary
33
vLLM-Omni 0.20.0 release
vLLM-Omni Team
vllm-omni 0.20.0 release highlights(1)
35
vllm-omni 0.20.0 release highlights(2)
36
vllm-omni 0.20.0 release highlights(3)
37
Qwen3-Omni
38
High throughput (Concurrency 32)
Low latency(Concurrency 1)
The experiments were run on two NVIDIA H100 GPUs.
Qwen3-TTS
39
High throughput (Concurrency 32)
Low latency(Concurrency 1)
The experiments were run on Single H20 GPU.
Fish Speech S2 Pro
40
High throughput (Concurrency 8)
Low latency(Concurrency 1)
The experiments were run on Single H20 GPU.
Diffusion models
41
vLLM-Omni Roadmap
vLLM-Omni Team
Future Roadmap (Overview)
P0:
P1:
https://github.com/vllm-project/vllm-omni/issues/2136
vLLM Networking Hour!
https://blog.vllm.ai/2025/11/30/vllm-omni.html