1 of 45

Towards Building Unified Multimodal Models

Speaker: Wenhu Chen

1

1/20/26

2 of 45

Multimodal Understanding

Liu et al., "Visual Instruction Tuning," NeurIPS 2023

2

1/20/26

3 of 45

Multimodal Generation

Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models," CVPR 2022

3

1/20/26

4 of 45

Divergence of Understanding & Generation

4

1/20/26

Architecture: Transformer

Encoder: CLIP

Loss: Autoregressive

Architecture: Attention + Unet

Encoder: CLIP + VAE

Loss: Diffusion

5 of 45

The Benefits of Unification

  • Unified Interface for understanding/generation
    • Easy to serve and deploy.
    • Easy to train and scale up

  • Mutual benefits
    • Better understanding of image/text can help generation
    • Better generation of images can help understand image and text

5

1/20/26

6 of 45

1) Architecture Unification: DiT

Peebles and Xie, "Scalable Diffusion Models with Transformers," ICCV 2023

6

1/20/26

7 of 45

1) The success of DiT

OpenAI, "Video generation models as world simulators," Technical Report 2024

7

1/20/26

8 of 45

2) Encoder Unification: CLIP & VAE

8

1/20/26

CLIP captures strong semantic features;

CLIP fails to reconstruct the visual details;

VAE captures strong visual details;

VAE fails to capture semantic features;

9 of 45

2) Multimodal Understanding

  • Understanding an image only requires semantic features

  • Text supervision (CLIP)
    • Align image to text feature space

  • No need for visual details
    • No spatial information
    • No identify information
    • No layout information

Radford et al., "Learning Transferable Visual Models from Natural Language Supervision," ICML 2021

9

1/20/26

10 of 45

2) Multimodal Generation

  • Generation requires massive visual details

  • Self-supervision
    • Reconstructing the origin images

  • Still need high-level semantics
    • Object information
    • Relation information
    • Action information

van den Oord et al., "Neural Discrete Representation Learning," NeurIPS 2017

10

1/20/26

11 of 45

3) Loss (Parti, Chameleon)

Chameleon Team, "Chameleon: Mixed-Modal Early-Fusion Foundation Models," arXiv 2024

11

1/20/26

12 of 45

The Current Status

  • VAE and CLIP need to co-exist.

  • Loss function cannot be unified.

  • Designing unified models can help generation, but not understanding.
    • Better visual quality and instruction following.
    • Understanding benchmarks like MMMU are not improving

12

1/20/26

13 of 45

Towards Unified Model

  • We will cover two papers:
    • Qwen-Image (Alibaba Qwen)
    • UniVideo (TIGER-Lab & Kuaishou)

  • Major Contribution:
    • MMDiT architecture
    • Seamlessly integrate LLM with Diffusion Model

  • Benefits
    • Better on image generation tasks (editing, stylization, text rendering)
    • Maintaining performance of image understanding tasks

13

1/20/26

14 of 45

Qwen-Image Technical Report����

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu

14

1/20/26

15 of 45

Background

  • Current T2I images are lagging in
    • Complex prompt following
    • Text rendering
    • Precise image editing
    • Semantic coherence

  • How can we combine VLM features to enhance the model
    • Integrate Qwen-VL => DiT architecture

15

1/20/26

16 of 45

MMDiT Architecture

16

1/20/26

17 of 45

Pre-Training (Flow Matching)

17

1/20/26

 

18 of 45

Instruction Tuning (Editing)

18

1/20/26

1

2

19 of 45

Experimental Results (GenEval)

  • GenEval focuses on object-centric T2I evaluation using compositional prompts with diverse attributes.

19

1/20/26

20 of 45

Experimental Results (CVTG-2K)

  • CVTG-2K is evaluating text rendering capabilities

20

1/20/26

21 of 45

Experimental Results (Gedit-Bench)

21

1/20/26

22 of 45

Experimental Results (GSO)

  • GSO evaluates the model’s capabilities to synthesize different views of the same object.
  • The authors measure the similarity with the groundtruth.

22

1/20/26

23 of 45

Research Question

  • Qwen-Image only shows the model’s capabilities on seen tasks.
    • Trained on T2I and Image Editing
      • Evaluate on T2I and Image Editing
    • Can we train on a subset of tasks?
      • Evaluate on unseen tasks?

  • Qwen-Image is only restricted to Image domains
    • Can we build a video version of MMDiT?
    • Unify all the different video tasks.

23

1/20/26

24 of 45

UniVideo: �Unified Understanding, Generation, and Editing for Videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

24

1/20/26

25 of 45

Unified Multimodal Model

25

1/20/26

26 of 45

Existing Options for Connector

26

1/20/26

Ours

27 of 45

Ablation of Different Connectors

27

1/20/26

28 of 45

Training Setup (Stage 1)

28

1/20/26

29 of 45

Training Setup (Stage 1)

29

1/20/26

30 of 45

Training Setup (Stage 2)

30

1/20/26

31 of 45

Training Setup (Stage 3)

  • Text to Image/Video
  • Image to Video
  • In-context video generation
  • In-context video editing (addition, swap, delete, style)
  • Image editing
  • image/Video stylization

31

1/20/26

32 of 45

Training Setup (Stage 3)

32

1/20/26

33 of 45

Training Setup (Stage 3)

33

1/20/26

34 of 45

Training Dataset

34

1/20/26

35 of 45

Data Synthesis

35

1/20/26

36 of 45

Dataset Quality Control

36

1/20/26

37 of 45

Experiment 1: Foundational Tasks

37

1/20/26

38 of 45

Experiment 2: Seen Video Tasks

  • In-Context

    • Video generation

    • Video editing

38

1/20/26

39 of 45

Experiment 2: In-context VideoGen

  • We test UniVideo’s capability on in-context video generation

39

1/20/26

40 of 45

Experiment 2: In-context VideoEdit

  • We evaluate the model’s capabilities on In-context Video Edit

40

1/20/26

41 of 45

Experiment 3: Unseen Video->Video Editing

41

1/20/26

42 of 45

Experiment 3: Findings

  • Our model has never been trained with video input
    • The task inputs are all image and text

  • UniVideo is able to generalize to video editing
    • The task input is a human-provided video

42

1/20/26

43 of 45

Qualitative Results

43

1/20/26

44 of 45

Takeaways

  • Unified Model is an important direction forward

  • The joint representation can help the model generalize well to many unseen and complex tasks

  • We are still in the process of finding more unified architecture

  • We have seen evidence of using VLMs to help generation

  • We don’t have evidence of generation helping VLMs

44

1/20/26

45 of 45

45

1/20/26