1 of 44

meetup

Apr 2026

2 of 44

2

docs.vllm.ai/projects/vllm-omni/en/latest

Wednesday 19:30 PDT

Stage

3 of 44

Our Goal

Build the fastest and

easiest-to-use open-source

Omni-Modality model inference & serving engine

3

4 of 44

Omni-Modality models

4

Omni-modality: Text, image, video, and audio data processing

Non-autoregressive Architectures: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models

Heterogeneous outputs: from traditional text generation to multimodal outputs

5 of 44

Broad Model Support

vLLM-Omni supports 40+ popular omni and diffusion model architectures(growing rapidly)

Qwen-Omni

Qwen-Image

BAGEL

Mistral

Wan

Ovis-Image

LongCat

SD3

Flux

Image/3D

StepFun

GLM

MiMo

…plus Many more

6 of 44

Contributors

6

Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!

7 of 44

vLLM Github Repo

7

$ uv pip install vllm==0.20.0 --torch-backend=auto

$ uv pip install vllm-omni

4700+ Stars

Official release

8 of 44

vLLM-Omni System Walkthrough

vLLM-Omni Team

9 of 44

Goal of the walkthrough

  1. Understand how vLLM-Omni processes a multi-modal request and generates its multi-modal outputs.

2. Learn where to modify if you would like to make a specific modification/contribution.

9

10 of 44

Multi-modality models

10

Backbone: AR + DiT

Models: Qwen-Image/GLM-Image

Tasks: t2i, t2v, i2i...

Backbone: AR + Spec. Gen.

Models: BAGEL, Hunyuan Image 3.0

Tasks: t2i, i2i, i2t...

Backbone: (multi) AR + DiT

Models: Qwen-Omni/Ming-Omni

Tasks: any-to-any

Yin, Peiqi, et al. "vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models." arXiv preprint arXiv:2602.02204 (2026).

11 of 44

Multi-modality models: AR/DiT comparison

11

AR

DiT

Use cases

Text generation

Multi-modelity generation

Generation process

Token-by-token KV Cache based

Diffusion step

Bottleneck

Prefill: compute bound

Decode: memory bound

Compute bound

Seq length

varied

Fixed

Attention Mask

causal mask

Full mask

parallelism

TP/DP/EP/PP/CP/SP

TP/EP/USP/CFG

12 of 44

Main architecture of vLLM-Omni

12

imported

modified

new

vLLM-Omni

AR

APIServer

Omni/AsyncOmni

Model/Layer/Ops

OmniConnector(E/P/D/G)

LLMEngine

Executor

ModelRunner

Worker

Cache Engine

Scheduler

Diffusion

Worker

ModelRunner

/Pipeline

Scheduler

DiffusionEngine

Main component

  • Entrypoints: offline/online serving, StageClient abstraction for model stages(AR/DiT). One runtime, multiple interfaces.
  • AR module: inherited from vLLM(CB/PA/Prefix Cache…) and adapted to the Omni-modality model
  • Diffusion module: implemented natively and optimized by acceleration components
  • Model/Layer/ops: parallelism, quantization, attention…
  • OmniConnector: natively supports E/P/D/G disaggregation

EntryPoints

Orchestrator

13 of 44

Interface Design

13

AsyncOmniEngine

engine/async_omni_engine.py

AsyncOmni

vllm_omni/entrypoints/async_omni.py

Synchronous

Asynchronous

Developer Interface�vllm/engine

User custom server

def add_request()

def abort_requests()

def step()

async def generate()

async def abort()

+ background engine loop

Omni

entrypoints/omni.py

openai_api_server

entrypoints/openai/

api_server.py

End-user Interface�vllm_omni/entrypoints

Batched inference

OpenAI-compatible API server

StageEngineCoreClient

StageDiffusionClient

Orchestrator

StageEngineCoreProc

StageDiffusion

Proc

14 of 44

When requests arrive

Example: Qwen3-Omni

AR-decoder(Thinker)

Visual

Encoder

Text

Tokenizer

Code2Wav

AR-decoder(Talker)

Audio

Encoder

StageClient

StageClient

EngineCoreProc

EngineCoreProc

StageClient

EngineCore

EngineCore

EngineCore

EngineCoreProc

15 of 44

Two-level Config system for Qwen3-Omni

15

OmniStage_0:

Thinker

OmniStage_1:

Talker

OmniStage_2:

Code2wav

- stage_id: 0

gpu_memory_utilization: 0.9

devices: "0"

default_sampling_params:

temperature: 0.4

top_p: 0.9

top_k: 1

max_tokens: 2048

seed: 42

repetition_penalty: 1.05

- stage_id: 1

gpu_memory_utilization: 0.6

devices: "1"

input_connectors:

from_stage_0: connector_of_shared_memory

default_sampling_params:

temperature: 0.9

top_k: 50

max_tokens: 4096

seed: 42

repetition_penalty: 1.05

- stage_id: 2

gpu_memory_utilization: 0.1

max_num_seqs: 1

enforce_eager: true

async_scheduling: false

max_num_batched_tokens: 51200

devices: "1"

input_connectors:

from_stage_1: connector_of_shared_memory

default_sampling_params:

temperature: 0.0

top_p: 1.0

top_k: -1

max_tokens: 65536

seed: 42

repetition_penalty: 1.1

https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/deploy/qwen3_omni_moe.yaml

  • Deploy Config: For setting of different deployment tasks, exposed to users

PR #2383

16 of 44

Two-level Config system for Qwen3-Omni

16

https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/model_executor/models/qwen3_omni/pipeline.py

  • Pipeline Config: For organizing data and computing pipeline of certain model, exposed to developers

StagePipelineConfig(

stage_id=0,

model_stage="thinker",

execution_type=StageExecutionType.LLM_AR,

input_sources=(),

final_output=True,

final_output_type="text",

owns_tokenizer=True,

requires_multimodal_data=True,

hf_config_name="thinker_config",

engine_output_type="latent",

…)

OmniStage_0:

Thinker

OmniStage_1:

Talker

OmniStage_2:

Code2wav

StagePipelineConfig(

stage_id=1,

model_stage="talker",

execution_type=StageExecutionType.LLM_AR,

input_sources=(0,),

hf_config_name="talker_config",

engine_output_type="latent",

custom_process_input_func=f“{_PROC}.thinker2talker”, custom_process_next_stage_input_func=(f"{_PROC}.talker2code2wav_async_chunk"),

sampling_constraints={

"detokenize": False,

"stop_token_ids": [2150],

},

)

StagePipelineConfig(

stage_id=2,

model_stage="code2wav", execution_type=StageExecutionType.LLM_GENERATION,

input_sources=(1,),

final_output=True,

final_output_type="audio",

hf_config_name="thinker_config",

engine_output_type="audio",

custom_process_input_func=f"{_PROC}.talker2code2wav",

sampling_constraints={"detokenize": True},

)

17 of 44

Natively Disaggregated Serving

  • Standardized Unified Abstraction: A generalized interface that handles heterogeneous data (Text, Image, Audio).

  • Control & Data Plane Decoupling: Metadata travels via lightweight control signals, while heavy payloads are offloaded to high-performance data planes.

  • Hybrid Backend Support: Native support for Shared Memory (SHM) and Mooncake for distributed transfers.

  • Disaggregated Multi-Modal Execution: Enables seamless communication between decoupled stages.

  • Multi-Instance Scaling: Supports multiple instances for each omni stage, enabling elastic deployment and efficient load distribution across distributed clusters.

PR #215

#979

Meta data flow

D2H2D flow

D2D flow

StageEngine-1

Scheduler

Worker

OmniConnector

ModelRunner

Memory Pool

APIServer/AsyncOmni

Transfer engine

StageEngine-2

Scheduler

Worker

OmniConnector

ModelRunner

StageEngine-K

Scheduler

Worker

OmniConnector

ModelRunner

N_1

N_2

N_k

PipelineOrchestrator(multi-stage)

ZMQ IPC

ZMQ IPC

ZMQ IPC

async.io

18 of 44

Async Chunk & Audio Streaming Output

18

  • Pipeline Between Stages: Asynchronous chunked computation and communication across stages

  • Audio Streaming Output: The waveform is output immediately after Talker generates each token.

  • APIServer Support: OpenAI v1/chat/completions with “stream” argument

PR #367

#727

#951

#1438

E2E generation

Streaming generation

19 of 44

Async Chunk & Audio Streaming Output

19

  • OmniConnector: Transmit data between stages.

  • OmniChunkTransferAdapter: Chunk-specific implementation that owns the full chunk lifecycle when async_chunk is enabled.

  • Stage Input Processors: Custom functions that process stage outputs into chunks for different models.

  • Schedulers: Modified to handle chunk-based scheduling with async IO-compute overlap.

  • Model Runners: Handle chunk processing.

20 of 44

Async Chunk & Audio Streaming Output

1 / 10 concurrency: 12.4x / 8.2x improves

1 / 10 concurrency: 1.1 x / 1.2 x improves

https://docs.vllm.ai/projects/vllm-omni/en/latest/design/feature/async_chunk_design/

  • Qwen3-Omni

21 of 44

Async Chunk & Audio Streaming Output

  • Qwen3-TTS

PR #1161

#1591

#1617

22 of 44

Scaling Disaggregated Serving

  • Command Line Interface (CLI) Design (PR #2020)

  • Distributed Deployment Structure

a. Multiple APIServers (PR #2020)

b. DP based Scaling for OmniStage

c. Coordinator for global management

  • Scheduling & Routing for load balancing

Example Multi-Node Deployment Diagram

23 of 44

Omni-Modality models Acceleration Support

23

Model

Output

Modalities

Disaggregated deployment

Streaming

Async chunk

Graph

Quantization

NPU Support

Qwen2.5-Omni

Text/Audio

Only text

Qwen3-Omni

Text/Audio

MiMo-Audio

Text/Audio

Qwen3-TTS

Audio

Bagel

Text/Image

Only text

N/A

Only AR

GLM-Image

Image

N/A

N/A

Hunyuan-Image

Text/Image

Only text

N/A

Only AR

  • ✅ = supported
  • ❌ = not yet supported
  • ⏳ = in progress (PR under review)

24 of 44

Diffusion Core

    • Natively implemented and optimized by configuring the acceleration layer
    • Aligned with the AR arch with model_runner abstraction
  • Encoders are disaggregated to run on vllm engine for higher throughput in large-scale deployment
  • Support step-wise scheduler with continuous batch for higher SLA
  • Connect with AR module to speed up AR+DIT model inference with OmniConnector

Acceleration Features

                  • Cache backend: Cache-DiT, TeaCache…
          • Parallelism: TP/HSDP/SP/CFG/VAE…
          • Attention: interface abstraction for third-party integration(FA/SAGE/MindIE-SD…)
          • Quantization: FP8/GGUF…
          • CPU Offload: module-wise/layer-wise
          • Extensions: LoRA

Diffusion Module Design

vllm import

native

Third-party

Diffusion Engine

Acceleration Features

Attention Backend

Cache Backend

Parallelism

SP

TP

SDPA

FA

CacheDiT

TeaCache

Sparse

CPU Offload

(module-wise/

layer-wise)

Quantization

(FP8/GGUF)

Extensions

LoRA

HSDP

CFG

VAE

Diffusion Worker

Scheduler(step-wise)

Diffusion Pipeline

Prompt Encode

N-step Sampling

VAE Decode

embed

latent

DiT

Diffusion Model Runner

25 of 44

Diffusion Dynamic Batching

Async arrival over time

time

t0

Request A

t1

Request C

t2

Request B

t3

Request D

Async

Diffusion

Engine

async_add_request()

(accept independently)

Future per request

(created immediately)

Scheduler

Pick compatible requests

across async arrivals

batch compatible

ready

same

shape

same

CFG

Selected for next batch

(example)

A

C

D

Worker

state manager

batch builder

Per-request lifestyle

waiting → scheduled → running → finished

InputBatch

(formed from async arrivals)

A

C

D

ModelRunner /

execute_stepwise()

(process step by step)

Return immediately

when finished

(resolve Futures)

Result C

t4.2

Result A

t5.1

Result D

t6.3

Result B

t7.8

Out-of-order

return

Motivation

  • Image/video requests arrive at different timestamps.
  • Synchronous DiffusionEngine execution misses batching opportunitiesand under-utilizes DiT compute.

Design

  • Engine accepts requests asynchronously.
  • Scheduler selects compatible requests across arrivals.
  • Worker manages states and builds InputBatch for step-wise execution.

Async arrival → dynamic batching → larger effective DiT batch → higher GPU utilization / MFU → higher throughput and lower latency.��CPU batching overhead is negligible:�make_batch < 1 ms vs. execute_stepwise ≈ 1000–1200 ms.

26 of 44

parallelism

Parallel Methods

Core Description

Configuration Parameters

Tensor Parallelism (TP)

Distributed weight to multiple devices

tensor_parallel_size

Sequence Parallelism (SP)

Split the input along the sequence dimension.

ulysses_degree, ring_degree

Expert Parallelism (EP)

Only MoE Expert MLP Blocks

enable_expert_parallel

CFG-Parallelism

CFG's positive and negative branches are assigned to different GPUs.

cfg_parallel_size

HSDP

Hybrid Sharding Data Parallelism

use_hsdp, hsdp_shard_size

VAE Patch Parallelism

Cross-GPU Distributed VAE Encoding and Decoding

vae_patch_parallel_size

Core Compatibility Matrix

Method

TP

Ulysses

Ring

CFG

HSDP

EP

VAE Patch

TP

Ulysses

Ring

-

CFG

-

HSDP

EP

-

VAE Patch

-

https://github.com/vllm-project/vllm-omni/issues/1217

Wan2.2 supports

Parallel Method

Wan2.2 Support

SP (Ulysses and Ring)

CFG - Parallel

Tensor Parallel

HSDP

VAE patch parallel

Expert parallelism

27 of 44

Quantization

28 of 44

CPU offload Support(1)

Sequential Offloading (coarse-grained model component level)

  • Register offload targets on model component (e.g., Encoder/DiT/…)
  • Pre-Hooks registered: offload targets; move self to GPU

29 of 44

CPU offload Support(2)

Layerwise Offloading

  • Keep a single TransformerBlock on GPU during computing
  • Hooks registered: Pre-fetch params for the next block; release memory after computation

30 of 44

Diffusion modules with Acceleration Support

https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion_features/#imagegen

Model

Output modality

TeaCache

Cache-DiT

SP (Ulysses & Ring)

CFG-Parallel

TP

HSDP

VAE-Patch-Parallel

CPU Offload (Layerwise)

Quant

(fp8/fp4…)

FLUX.1-dev

Image

FLUX.2-klein

LongCat-Image

LongCat-Image-Edit

Qwen-Image

N.A.

Qwen-Image-Edit

N.A.

Qwen-Image-Layered

N.A.

SD3.5

NextStep1.1

Z-Image

Wan2.2

Video

HunyuanVideo1.5

Seed-Helios

31 of 44

Hardware plugin system

31

Pluggable

vLLM-Omni

AR

EntryPoints

APIServer

Omni/AsyncOmni

OmniStage

Pluggable Layer

OmniConnector(E/P/D/G)

LLM Engine

Executor

Model Runner

Worker

Cache Engine

Scheduler

Diffusion

Scheduler

Diffusion Engine

Plugin

Hardware Platform

vLLM-Omni Plugin Interface

Executor

Worker

Model Runner

NPU/XPU/ROCm

Model Runner

Worker

Attention Backend

Attention

IR

Custom kernels

Custom Layer

32 of 44

RL Support: VeRL-Omni

Key Features

    • Efficient multimodal rollout: vLLM-Omni provides high-throughput async serving for multimodal generation
    • Modular training engines: allow easy integration of parallelism(FSDP/SP) and other optimizations for various diffusion models
    • Flexible reward engine: supports both rule-based and model-based reward computations
    • Fast diffusion RL trainers: FlowGRPO trainer achieves ~25% higher E2E training throughput than the diffusers-based implementation

Ongoing

                  • Model support: Qwen3-Omni, BAGEL, …
                  • Algorithm support: GSPO, MixGRPO, …
          • Fully async for efficient diffusion/omni-modality RL
          • Training stability: deterministic, staleness control, …
          • Hardware support: NPU

33 of 44

Summary

  1. Support Omni-Modality model inference & serving
  2. Consistent and unified API interface with vLLM
  3. Native disaggregated deployment for different model stages
  4. Pipeline async streaming for multiple stage engines
  5. Native support for diffusion stage acceleration

33

34 of 44

vLLM-Omni 0.20.0 release

vLLM-Omni Team

35 of 44

vllm-omni 0.20.0 release highlights(1)

  1. Rebased to upstream vLLM v0.20.0, with CUDA 13.0 and PyTorch 2.11 alignment, Transformers 5.x compatibility fixes, removal of the old vLLM entrypoint hijack, and runtime changes needed for the 0.20.0 integration path. (#3232, #3082, #3352, #3393, #2306)

  • CLI and configuration refactor, including the stage CLI refactor, forwarding CLI tokenizer settings into per-stage engine configs, removal of legacy Omni CLI helpers, cleaner deploy/pipeline config migration, and updated CLI documentation. (#2020, #3120, #3144, #2383, #2978)

  • Hardware plugin and platform optimization, expanding MUSA flash attention and torch.accelerator support, aligning NPU with the v0.20.0/GPU model-runner path, restoring ROCm/AMD CI signal, and refreshing XPU Docker/CI readiness for the PyTorch 2.11 stack. (#2451, #3101, #3325, #3343, #3083, #3393)

  • More SOTA model support, including Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, HunyuanImage-3.0 IT2I, ERNIE image T2I, AudioX, Wan2.2-S2V, DreamID-Omni HSDP, LTX-2.3, and FastGen Wan 2.1 pipelines. (#2890, #3089, #2753, #2658, #3107, #2861, #2077, #2751, #3138, #2893, #2749)

35

36 of 44

vllm-omni 0.20.0 release highlights(2)

  1. Diffusion continuous batching: adding async batch inference in the DiffusionEngine and strengthening step-level/diffusion serving paths with pipeline-declared offload modules, achieved a 7.8% increase in throughput and a 5.8% reduction in mean latency compared to the baseline. CFG/HSDP improvements, VAE tiling, and performance validation. (#2729, #2707, #2427, #2423, #2368, #2899, #2982)

  • Large-scale serving for Qwen3-Omni, by scaling stages such as talker and code2wav with multiple replicas, this change improves overall throughput for Qwen3-Omni deployment on H20 GPU at 32 concurrency from 0.241 req/s to 0.414 req/s (+72%), while also increasing per-GPU efficiency to 0.138 req/s/GPU (+14%). (#3203, #2376, #3306, #2396, #2598, #2600)

  • Expanded quantization coverage, including AutoRound W4A16 support for Qwen Omni, offline W4A16 quantized model support, OmniGen2 FP8, Z-Image text-encoder FP8 online quantization, HunyuanImage3 NPU quantization, GLM-Image quantization, and fixes for pre-quantized checkpoints. (#2670, #1777, #2441, #3279, #2979, #2292, #2702, #2795)

36

37 of 44

vllm-omni 0.20.0 release highlights(3)

  1. TTS model speedups, improving through native decoder construction, CUDA graph capture and shared memory pools, streaming VAE with full-graph, global speaker/reference-audio caches, and deterministic Fast AR sampling, delivering roughly 9x VoxCPM2 RTF reduction (0.946 → 0.106 on H20), -53% Fish Speech Fast AR latency, and ~3.2 GiB Code2Wav memory savings for Qwen3-TTS/Voxtral-TTS. (#2690, #2758, #2803, #2341, #2630, #2657, #2520)

  • Omni model optimization, enable async scheduling and uni process executor to align with upstream AR performance, delivering TTFT/TPOT 50% reduction. (#3164, #3203)

  • Wan2.2 on NPU is now production-ready with major I2V performance optimizations, including MindIE-SD LA, fused RoPE/AdaLayerNorm/RMSNorm, VAE BF16 and parallelism fixes, HSDP/USP deployment recipes, delivering 50-60% performance improvement. (#2919, #2393, #2459, #2391, #2585, #2583, #2571, #3067, #2969, #2852, #3063, #2262, #2817)

37

38 of 44

Qwen3-Omni

38

High throughput (Concurrency 32)

Low latency(Concurrency 1)

The experiments were run on two NVIDIA H100 GPUs.

39 of 44

Qwen3-TTS

39

High throughput (Concurrency 32)

Low latency(Concurrency 1)

The experiments were run on Single H20 GPU.

40 of 44

Fish Speech S2 Pro

40

High throughput (Concurrency 8)

Low latency(Concurrency 1)

The experiments were run on Single H20 GPU.

41 of 44

Diffusion models

41

42 of 44

vLLM-Omni Roadmap

vLLM-Omni Team

43 of 44

Future Roadmap (Overview)

P0:

  • CI/CD: Performance monitoring
  • Large scale deployment
  • Diffusion continuous batching
  • Quantization(fp8/fp4/autoround w4a16…)

P1:

  • Auto-regressive DiT models (interactive/world models)
  • Diffusers backend support
  • ModelRunnerV2 fully support
  • Video streaming input/output

https://github.com/vllm-project/vllm-omni/issues/2136

44 of 44

vLLM Networking Hour!

https://blog.vllm.ai/2025/11/30/vllm-omni.html