1 of 33

Meetup

Jan, 2026

2 of 33

About us

2

Roger Wang

vLLM&vLLM-Omni Committer

Han Gao

vLLM-Omni

Committer

Hongsheng Liu

vLLM-Omni

Committer

3 of 33

vLLM-Omni Overview

vLLM-Omni Team

4 of 33

4

5 of 33

Our Goal

Build the fastest and

easiest-to-use open-source

Omni-Modality model inference & serving engine

5

6 of 33

Omni-Modality models

6

  • Omni-modality: Text, image, video, and audio data processing
  • Non-autoregressive Architectures: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models
  • Heterogeneous outputs: from traditional text generation to multimodal outputs

7 of 33

vLLM-mni API (1): Omni class

7

from vllm_omni import Omni

# Example prompts.

inputs = {"prompt": prompt,

"multi_modal_data": {"video": video_frames, "audio": audio_signal,},}

# Create an omni with HF model name.

omni = Omni(model=“Qwen/Qwen3-Omni-30B-A3B-Instruct")

# Generate texts and audio from the oulti-modality inputs.

outputs = omni.generate(inputs)

A Python interface for offline batched inference for Qwen3-Omni/Qwen-Image

https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/offline_inference/

from vllm_omni import Omni

# Example prompts.

inputs = "A cup of coffee on the table“

# Create an omni with HF model name.

omni = Omni(model="Qwen/Qwen-Image-2512")

# Generate texts and audio from the oulti-modality inputs.

outputs = omni.generate(inputs)

8 of 33

vLLM-Omni API (2): OpenAI-compatible server

8

A FastAPI-based server for online serving for Qwen3-Omni

$ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

$ curl -sS -X POST http://localhost:8091/v1/chat/completions\

-H "Content-Type: application/json" \

-d '{

"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",

"messages": "Why is this video funny? "

"sampling_params_list": $sampling_params_list,

}'

Server

Client

https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving

9 of 33

vLLM-Omni API (2): OpenAI-compatible server

9

$ vllm serve Qwen/Qwen-Image-2512 --omni --port 8091

$ curl -s http://localhost:8091/v1/chat/completions\

-H "Content-Type: application/json" \

-d '{

"messages": [ {"role": "user", "content": "a cup of coffee on the table"}],

"extra_body": {"height": 1024,"width": 1024,

}

}' | jq –r '.choices[0].message.content[0].image_url.url' | cut –d ',' -f2 | base64 -d > output.png

Server

Client

https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving

A FastAPI-based server for online serving for Qwen-Image

10 of 33

vLLM-Omni serving demo

10

A Gradio demo for Qwen3-Omni online serving

python gradio_demo.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 7861

Then open http://localhost:7861/ on your local browser to interact with the web UI.

11 of 33

Broad Model Support

vLLM-Omni supports 20+ popular omni and diffusion model architectures(growing rapidly)

Qwen-Omni

Qwen-Image

BAGEL

Z-Image

Wan

Ovis-Image

LongCat

SD3

Flux

Image/3D

StepFun

GLM

MiMo

…plus Many more

12 of 33

Contributors

12

Thanks to all the contributors who raised issues, participated in discussions, and submitted PRs!

13 of 33

vLLM Github Repo

13

$ uv pip install vllm==0.12.0 --torch-backend=auto

$ uv pip install vllm-omni

2100+ Stars

Official

release!

14 of 33

vLLM-Omni System Walkthrough

vLLM-Omni Team

15 of 33

Goal of the walkthrough

  1. Understand how vLLM-Omni processes a multi-modal request and generates its multi-modal outputs.

  • Learn where to modify if you would like to make a specific modification/contribution.

15

16 of 33

Multi-modality models

16

Backbone: AR + DiT(Main)

Models: Qwen-Image

Tasks: t2i, t2v, i2i...

Backbone: AR(main) + DiT

Models: BAGEL, Hunyuan Image 3.0

Tasks: t2i, i2i, i2t...

Backbone: AR + DiT

Models: Qwen-Omni/Ming-Omni

Tasks: any-to-text+audio

AR-decoder(Thinker)

Visual

Encoder

Text

Tokenizer

Code2Wav

AR-decoder(Talker)

Audio

Encoder

17 of 33

Multi-modality models: AR/DiT comparison

17

AR

DiT

Use cases

Text generation

Multi-modelity generation

Generation process

Token-by-token KV Cache based

Diffusion step

Bottleneck

Prefill: compute bound

Decode: memory bound

Compute bound

Seq length

varied

Fixed

Attention Mask

causal mask

Full mask

parallelism

TP/DP/EP/PP/CP/SP

TP/EP/USP/CFG

18 of 33

Main architecture of vLLM-Omni

18

imported

modified

new

vLLM-Omni

AR

EntryPoints

APIServer

Omni/AsyncOmni

OmniStage

Model/Layer/Ops

OmniConnector(E/P/D/G)

LLMEngine

Executor

ModelRunner

Worker

Cache Engine

Scheduler

Diffusion

Worker

Pipeline

Scheduler

DiffusionEngine

Main component:

  • Entrypoints: offline/online serving ,OmniStage abstraction for model stages(AR/DiT)
  • AR module: inherited from vLLM(CB/PA/Prefix Cache…) and adapted to the Omni-modality model
  • Diffusion module: implemented natively and optimized by acceleration components
  • Model/Layer/ops: parallelism、quantization、attention…
  • OmniConnector: natively supports E/P/D/G disaggregation

19 of 33

Interface Design

19

OmniStage

vllm/entrypoints/omni_stage.py

AsyncOmni

vllm_omni/entrypoints/async_omni.py

Synchronous

Asynchronous

Developer Interface�vllm/engine

User custom server

def add_request()

def abort_request()

def step()

async def generate()

async def abort()

+ background engine loop

omni

vllm_omni/entrypoints/omni.py

openai_api_server

vllm_omni/entrypoints/openai/

api_server.py

End-user Interface�vllm_omni/entrypoints

Batched inference

OpenAI-compatible API server

Omni

Diffusion

Omni

LLM

StageWorker

Diffusion

Engine

LLM

Engine

20 of 33

AutoRegressive(AR) Module Design

20

OmniARScheduler(vLLMScheduler)

EngineCore

schedule()

OmniNewRequestData

scheduler_output

request

additional_information

prompt_embedding

imported

modified

new

Executor

GPUARWorker(GPUWorker)

GPUARModelRunner(GPUModelRunner)

additional_information payload

prompt_embedding

payload

execute_model

Pooler_output

extract_multimodal_output

The AR module in vLLM-Omni handles autoregressive generation stages for:

  • Text generation
  • Chain-of-Thought (CoT)
  • multimodal latent tokens (e.g.: Talker in Qwen3-Omni)

The AR module of vLLM-Omni extends vLLM's core components to support:

  • Multimodal inputs/outputs: Processing images, videos, and audio alongside text
  • Prompt Embedding: Passing pre-computed prompt embeddings between pipeline stages via serialized payloads
  • Additional information: Carrying per-request metadata (tensors, lists) through the pipeline and exposing per-request hidden representations for downstream stages

21 of 33

Config yaml for Qwen3-Omni

21

OmniStage_0:

Thinker

OmniStage_1:

Talker

OmniStage_2:

Code2wav

stage_id: 0

runtime:

devices: "0,1"

engine_args:

model_stage: thinker

model_arch: …ModelClass

gpu_memory_utilization: 0.6

engine_output_type: latent

final_output: true

final_output_type: text

stage_id: 1

runtime:

devices: "1"

engine_args:

model_stage: talker

model_arch: …ModelClass

gpu_memory_utilization: 0.3

engine_output_type: latent

engine_input_source: [0]

stage_id: 2

runtime:

devices: "0"

engine_args:

model_stage: code2wav

model_arch: …ModelClass

gpu_memory_utilization: 0.1

engine_input_source: [1]

final_output: true

final_output_type: audio

22 of 33

Natively Disaggregated Serving

Meta data flow

D2H2D flow

D2D flow

  • Standardized Unified Abstraction: A generalized interface that handles heterogeneous data (Text, Image, Audio).

  • Control & Data Plane Decoupling: Metadata travels via lightweight control signals, while heavy payloads are offloaded to high-performance data planes.

  • Hybrid Backend Support: Native support for Shared Memory (SHM) and Mooncake for distributed transfers.

  • Disaggregated Multi-Modal Execution: Enables seamless communication between decoupled stages.

  • Multi-Instance Scaling: Supports multiple instances for each omni stage, enabling elastic deployment and efficient load distribution across distributed clusters.

OmniStage-1

Scheduler

Worker

OmniConnector

ModelRunner

Memory Pool

Omni(global scheduler)

Transfer engine

OmniStage-2

Scheduler

Worker

OmniConnector

ModelRunner

OmniStage-K

Scheduler

Worker

OmniConnector

ModelRunner

N_1

N_2

N_k

23 of 33

When requests arrive

Example: Qwen3-Omni

AR-decoder(Thinker)

Visual

Encoder

Text

Tokenizer

Code2Wav

AR-decoder(Talker)

Audio

Encoder

24 of 33

Pipeline Async Streaming for stages

Transformers baseline

Dataset: Seed TTS 100 samples

25 of 33

Async Chunked Prefill & Streaming Input/Output

25

  • Chunked pipeline: Asynchronous chunked computation and communication across stages
  • Streaming input/Output: Chunked by frames, take preprocessing, prefill and decode instantly
  • APIServer Support: OpenAI v1/chat/completions with “stream” argument

AS-IS: E2E generation

TO-BE:Streaming generation

26 of 33

Diffusion Core:

    • Natively implemented and optimized by configuring the acceleration layer
  • Encoders are disaggregated to run on vllm engine for higher throughput in large-scale deployment.

Acceleration Components:

                  • Cache backend: Cache-DiT, TeaCache…
          • Parallelism: TP/EP/USP/CFG…
          • Attention: interface abstraction for third-party integration(FA/SAGE/MindIE-SD…)
          • Quantization: FP4/FP8/AWQ…
          • Fused Ops: custom and third-party integration
          • Timestep distillation: rCM, sCM…

Diffusion Module Design

DiffusionEngine

DiffusionWorker

Scheduler

Diffusion Pipeline

Prompt Encode

N-step Sampling

VAE Decode

embed

latent

DiT

Acceleration Components

Attention Backend

Cache Backend

Parallelism

USP

TP/EP

FA

CFG

SAGE

CacheDiT

TeaCache

MindIE

Fused Ops

(coming soon)

Quantization

(coming soon)

Timestep Distill.

(coming soon)

vllm import

native

Third-party

27 of 33

When requests arrive

27

DiffusionEngine

vllm_omni/diffusion/diffusion_engine.py

Request mm prompt: �“Make it play guitar”

Scheduler

vllm_omni/diffusion/schedule.py

  • Waiting request queue
  • Running request queue
  • FIFO
  • Preprocess
  • Add to the scheduler’s waiting queue

OmniDiffusion

vllm_omni/diffusion/omni_diffusion.py

  • Convert user inputs into OmniDiffusionRequest
  • Initialize DiffusionEngine

Omni

vllm_omni/entrypoints/omni.py

Worker

vllm_omni/diffusion/worker/gpu_worker.py

Worker.pipeline

vllm_omni/diffusion/models

pipeline_{model}_{task}.py

postprocess

generated image/video

28 of 33

Provided Models with Acceleration Support

Model

TeaCache

Cache-DiT

Ulysses-SP

Ring-Attention

LongCat-Image

LongCat-Image-Edit

Ovis-Image

Qwen-Image/Qwen-Image-2512

Qwen-Image-Edit/Qwen-Image-Edit-2509

Qwen-Image-Layered

Z-Image

LongCat-Image

LongCat-Image-Edit

Ovis-Image

Stable-Diffusion3.5

  • X2I
  • X2V

Model

TeaCache

Cache-DiT

Ulysses-SP

Ring-Attention

Wan2.2

29 of 33

Summary

  1. Support Omni-Modality model inference & serving
  2. Consistent and unified API interface with vLLM
  3. Native disaggregated deployment for different model stages
  4. Pipeline async streaming for multiple stage engines
  5. Native support for diffusion stage acceleration

29

30 of 33

vLLM-Omni Roadmap

vLLM-Omni Team

31 of 33

31

vLLM-Omni v0.12.0rc1

Release Notes

187 commits · 45 contributors · 34 new contributors

Diffusion Engine

  • Cache-Dit & TeaCache
  • SageAttention
  • USP、RingAttention
  • Torch.compile

Serving

  • OpenAI image generation
  • OpenAI create speech
  • Modality control
  • Streaming Output

Model

  • Qwen-Image-Edit Series
  • Wan2.2
  • SD3 & Ovis-Image
  • LongCat-Image & Bagel
  • Torch Profiler
  • AMD Rocm
  • NPU

Stability & Hardware

32 of 33

Future Roadmap (Overview)

P0:

  • Stabalize our core API and user interface
  • CI/CD

P1:

  • Wide model support(Omni&DiT…)
  • performance optimizations(Async sched/Parallel/Cache/Sparse…)
  • Hardware Support(CPU/GPU/xPU…)
  • Full disaggregation(OmniConnector&OmniRouter…)

33 of 33

vLLM Networking Hour!

https://blog.vllm.ai/2025/11/30/vllm-omni.html