Meetup
Jan, 2026
About us
2
Roger Wang
vLLM&vLLM-Omni Committer
Han Gao
vLLM-Omni
Committer
Hongsheng Liu
vLLM-Omni
Committer
vLLM-Omni Overview
vLLM-Omni Team
4
Our Goal
Build the fastest and
easiest-to-use open-source
Omni-Modality model inference & serving engine
5
Omni-Modality models
6
vLLM-mni API (1): Omni class
7
from vllm_omni import Omni
# Example prompts.
inputs = {"prompt": prompt,
"multi_modal_data": {"video": video_frames, "audio": audio_signal,},}
# Create an omni with HF model name.
omni = Omni(model=“Qwen/Qwen3-Omni-30B-A3B-Instruct")
# Generate texts and audio from the oulti-modality inputs.
outputs = omni.generate(inputs)
A Python interface for offline batched inference for Qwen3-Omni/Qwen-Image
https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/offline_inference/
from vllm_omni import Omni
# Example prompts.
inputs = "A cup of coffee on the table“
# Create an omni with HF model name.
omni = Omni(model="Qwen/Qwen-Image-2512")
# Generate texts and audio from the oulti-modality inputs.
outputs = omni.generate(inputs)
vLLM-Omni API (2): OpenAI-compatible server
8
A FastAPI-based server for online serving for Qwen3-Omni
$ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
$ curl -sS -X POST http://localhost:8091/v1/chat/completions\
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": "Why is this video funny? "
"sampling_params_list": $sampling_params_list,
}'
Server
Client
https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving
vLLM-Omni API (2): OpenAI-compatible server
9
$ vllm serve Qwen/Qwen-Image-2512 --omni --port 8091
$ curl -s http://localhost:8091/v1/chat/completions\
-H "Content-Type: application/json" \
-d '{
"messages": [ {"role": "user", "content": "a cup of coffee on the table"}],
"extra_body": {"height": 1024,"width": 1024,
}
}' | jq –r '.choices[0].message.content[0].image_url.url' | cut –d ',' -f2 | base64 -d > output.png
Server
Client
https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving
A FastAPI-based server for online serving for Qwen-Image
vLLM-Omni serving demo
10
A Gradio demo for Qwen3-Omni online serving
python gradio_demo.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 7861
Then open http://localhost:7861/ on your local browser to interact with the web UI.
Broad Model Support
vLLM-Omni supports 20+ popular omni and diffusion model architectures(growing rapidly)
Qwen-Omni
Qwen-Image
BAGEL
Z-Image
Wan
Ovis-Image
LongCat
SD3
Flux
Image/3D
StepFun
GLM
MiMo
…plus Many more
Contributors
12
…
Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!
vLLM Github Repo
13
$ uv pip install vllm==0.12.0 --torch-backend=auto
$ uv pip install vllm-omni
2100+ Stars
Official
release!
vLLM-Omni System Walkthrough
vLLM-Omni Team
Goal of the walkthrough
15
Multi-modality models
16
Backbone: AR + DiT(Main)
Models: Qwen-Image
Tasks: t2i, t2v, i2i...
Backbone: AR(main) + DiT
Models: BAGEL, Hunyuan Image 3.0
Tasks: t2i, i2i, i2t...
Backbone: AR + DiT
Models: Qwen-Omni/Ming-Omni
Tasks: any-to-text+audio
AR-decoder(Thinker)
Visual
Encoder
Text
Tokenizer
Code2Wav
AR-decoder(Talker)
Audio
Encoder
Multi-modality models: AR/DiT comparison
17
| AR | DiT |
Use cases | Text generation | Multi-modelity generation |
Generation process | Token-by-token KV Cache based | Diffusion step |
Bottleneck | Prefill: compute bound Decode: memory bound | Compute bound |
Seq length | varied | Fixed |
Attention Mask | causal mask | Full mask |
parallelism | TP/DP/EP/PP/CP/SP | TP/EP/USP/CFG |
Main architecture of vLLM-Omni
18
imported
modified
new
vLLM-Omni
AR
EntryPoints
APIServer
Omni/AsyncOmni
OmniStage
Model/Layer/Ops
OmniConnector(E/P/D/G)
LLMEngine
Executor
ModelRunner
Worker
Cache Engine
Scheduler
Diffusion
Worker
Pipeline
Scheduler
DiffusionEngine
Main component:
Interface Design
19
OmniStage
vllm/entrypoints/omni_stage.py
AsyncOmni
vllm_omni/entrypoints/async_omni.py
Synchronous
Asynchronous
Developer Interface�vllm/engine
User custom server
def add_request()
def abort_request()
def step()
async def generate()
async def abort()
+ background engine loop
omni
vllm_omni/entrypoints/omni.py
openai_api_server
vllm_omni/entrypoints/openai/
api_server.py
End-user Interface�vllm_omni/entrypoints
Batched inference
OpenAI-compatible API server
Omni
Diffusion
Omni
LLM
StageWorker
Diffusion
Engine
LLM
Engine
AutoRegressive(AR) Module Design
20
OmniARScheduler(vLLMScheduler)
EngineCore
schedule()
OmniNewRequestData
scheduler_output
request
additional_information
prompt_embedding
imported
modified
new
Executor
GPUARWorker(GPUWorker)
GPUARModelRunner(GPUModelRunner)
additional_information payload
prompt_embedding
payload
execute_model
Pooler_output
extract_multimodal_output
The AR module in vLLM-Omni handles autoregressive generation stages for:
The AR module of vLLM-Omni extends vLLM's core components to support:
Config yaml for Qwen3-Omni
21
OmniStage_0:
Thinker
OmniStage_1:
Talker
OmniStage_2:
Code2wav
stage_id: 0
runtime:
devices: "0,1"
engine_args:
model_stage: thinker
model_arch: …ModelClass
gpu_memory_utilization: 0.6
engine_output_type: latent
final_output: true
final_output_type: text
stage_id: 1
runtime:
devices: "1"
engine_args:
model_stage: talker
model_arch: …ModelClass
gpu_memory_utilization: 0.3
engine_output_type: latent
engine_input_source: [0]
stage_id: 2
runtime:
devices: "0"
engine_args:
model_stage: code2wav
model_arch: …ModelClass
gpu_memory_utilization: 0.1
engine_input_source: [1]
final_output: true
final_output_type: audio
Natively Disaggregated Serving
Meta data flow
D2H2D flow
D2D flow
OmniStage-1
Scheduler
Worker
OmniConnector
ModelRunner
Memory Pool
Omni(global scheduler)
Transfer engine
OmniStage-2
Scheduler
Worker
OmniConnector
ModelRunner
OmniStage-K
Scheduler
Worker
OmniConnector
ModelRunner
…
N_1
N_2
N_k
When requests arrive
Example: Qwen3-Omni
AR-decoder(Thinker)
Visual
Encoder
Text
Tokenizer
Code2Wav
AR-decoder(Talker)
Audio
Encoder
Pipeline Async Streaming for stages
Transformers baseline
Dataset: Seed TTS 100 samples
Async Chunked Prefill & Streaming Input/Output
25
AS-IS: E2E generation
TO-BE:Streaming generation
Diffusion Core:
Acceleration Components:
Diffusion Module Design
DiffusionEngine
DiffusionWorker
Scheduler
Diffusion Pipeline
Prompt Encode
N-step Sampling
VAE Decode
embed
latent
DiT
Acceleration Components
Attention Backend
Cache Backend
Parallelism
USP
TP/EP
FA
CFG
SAGE
CacheDiT
TeaCache
MindIE
Fused Ops
(coming soon)
Quantization
(coming soon)
Timestep Distill.
(coming soon)
vllm import
native
Third-party
When requests arrive
27
DiffusionEngine
vllm_omni/diffusion/diffusion_engine.py
Request mm prompt: �“Make it play guitar”
Scheduler
vllm_omni/diffusion/schedule.py
OmniDiffusion
vllm_omni/diffusion/omni_diffusion.py
Omni
vllm_omni/entrypoints/omni.py
Worker
vllm_omni/diffusion/worker/gpu_worker.py
Worker.pipeline
vllm_omni/diffusion/models
pipeline_{model}_{task}.py
postprocess
generated image/video
Provided Models with Acceleration Support
Model | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |
LongCat-Image | ❌ | ✅ | ❌ | ❌ |
LongCat-Image-Edit | ❌ | ✅ | ❌ | ❌ |
Ovis-Image | ❌ | ✅ | ❌ | ❌ |
Qwen-Image/Qwen-Image-2512 | ✅ | ✅ | ✅ | ✅ |
Qwen-Image-Edit/Qwen-Image-Edit-2509 | ✅ | ✅ | ✅ | ✅ |
Qwen-Image-Layered | ❌ | ✅ | ✅ | ✅ |
Z-Image | ❌ | ✅ | ❌ | ❌ |
LongCat-Image | ❌ | ✅ | ❌ | ❌ |
LongCat-Image-Edit | ❌ | ✅ | ❌ | ❌ |
Ovis-Image | ❌ | ✅ | ❌ | ❌ |
Stable-Diffusion3.5 | ❌ | ✅ | ❌ | ❌ |
Model | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |
Wan2.2 | ❌ | ✅ | ❌ | ❌ |
Summary
29
vLLM-Omni Roadmap
vLLM-Omni Team
31
vLLM-Omni v0.12.0rc1
Release Notes
187 commits · 45 contributors · 34 new contributors
Diffusion Engine
Serving
Model
Stability & Hardware
Future Roadmap (Overview)
P0:
P1:
vLLM Networking Hour!
https://blog.vllm.ai/2025/11/30/vllm-omni.html