Towards Building Unified Multimodal Models
Speaker: Wenhu Chen
1
1/20/26
Multimodal Understanding
Liu et al., "Visual Instruction Tuning," NeurIPS 2023
2
1/20/26
Multimodal Generation
Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models," CVPR 2022
3
1/20/26
Divergence of Understanding & Generation
4
1/20/26
Architecture: Transformer
Encoder: CLIP
Loss: Autoregressive
Architecture: Attention + Unet
Encoder: CLIP + VAE
Loss: Diffusion
The Benefits of Unification
5
1/20/26
1) Architecture Unification: DiT
Peebles and Xie, "Scalable Diffusion Models with Transformers," ICCV 2023
6
1/20/26
1) The success of DiT
OpenAI, "Video generation models as world simulators," Technical Report 2024
7
1/20/26
2) Encoder Unification: CLIP & VAE
8
1/20/26
CLIP captures strong semantic features;
CLIP fails to reconstruct the visual details;
VAE captures strong visual details;
VAE fails to capture semantic features;
2) Multimodal Understanding
Radford et al., "Learning Transferable Visual Models from Natural Language Supervision," ICML 2021
9
1/20/26
2) Multimodal Generation
van den Oord et al., "Neural Discrete Representation Learning," NeurIPS 2017
10
1/20/26
3) Loss (Parti, Chameleon)
Chameleon Team, "Chameleon: Mixed-Modal Early-Fusion Foundation Models," arXiv 2024
11
1/20/26
The Current Status
12
1/20/26
Towards Unified Model
13
1/20/26
Qwen-Image Technical Report����
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
14
1/20/26
Background
15
1/20/26
MMDiT Architecture
16
1/20/26
Pre-Training (Flow Matching)
17
1/20/26
Instruction Tuning (Editing)
18
1/20/26
1
2
Experimental Results (GenEval)
19
1/20/26
Experimental Results (CVTG-2K)
20
1/20/26
Experimental Results (Gedit-Bench)
21
1/20/26
Experimental Results (GSO)
22
1/20/26
Research Question
23
1/20/26
UniVideo: �Unified Understanding, Generation, and Editing for Videos
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen
24
1/20/26
Unified Multimodal Model
25
1/20/26
Existing Options for Connector
26
1/20/26
Ours
Ablation of Different Connectors
27
1/20/26
Training Setup (Stage 1)
28
1/20/26
Training Setup (Stage 1)
29
1/20/26
Training Setup (Stage 2)
30
1/20/26
Training Setup (Stage 3)
31
1/20/26
Training Setup (Stage 3)
32
1/20/26
Training Setup (Stage 3)
33
1/20/26
Training Dataset
34
1/20/26
Data Synthesis
35
1/20/26
Dataset Quality Control
36
1/20/26
Experiment 1: Foundational Tasks
37
1/20/26
Experiment 2: Seen Video Tasks
38
1/20/26
Experiment 2: In-context VideoGen
39
1/20/26
Experiment 2: In-context VideoEdit
40
1/20/26
Experiment 3: Unseen Video->Video Editing
41
1/20/26
Experiment 3: Findings
42
1/20/26
Qualitative Results
43
1/20/26
Takeaways
44
1/20/26
45
1/20/26