1 of 33

VideoPoet:

A Large Language Model for Zero-Shot Video Generation

ICML 2024 Best Paper�Presenter: Lijun Yu 于力军

https://sites.research.google/videopoet/

2 of 33

Video Poets

Dan Kondratyuk^

Lijun Yu*

Xiuye Gu

José Lezama

Jonathan Huang*^

Grant Schindler

Rachel Hornung

Vighnesh Birodkar^

Jimmy Yan

Ming-Chang Chiu^

Krishna �Somandepalli

Hassan Akbari^

Yair Alon

Yong Cheng

Josh Dillon^

Agrim Gupta^

Meera Hahn

Anja Hauth

David Hendon

Alonso Martinez^

David Minnen

Mikhail Sirotenko

Kihyuk Sohn^

Xuan Yang

Hartwig Adam

Ming-Hsuan Yang

Irfan Essa

Huisheng Wang

David A. Ross*

Bryan Seybold*

Lu Jiang*^

Equal contribution

* Corresponding authors

^ Work done while at Google

3 of 33

~10 Years of Video Generation

Space	Method	Model
Pixel	GAN	VGAN (16’), …
	Autoregressive	Video Transformer (20’), …
	Diffusion	VDM (22’), Imagen Video (22’), …
Latent (Discrete)	Autoregressive	VideoGPT (21’), CogVideo (22’), …
Latent (Discrete)	Masked	Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …
Latent (continuous)	Diffusion	VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more

VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby

VideoPoet (24’)

4 of 33

~10 Years of Video Generation

Space	Method	Model
Pixel	GAN	VGAN (16’), …, DVD-GAN (19’), …
	Autoregressive	Video Transformer (20’), …
	Diffusion	VDM (22’), Imagen Video (22’), …
Latent (Discrete)	Autoregressive	VideoGPT (21’), CogVideo (22’), …
Latent (Discrete)	Masked	Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …
Latent (continuous)	Diffusion	VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more

VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby

VideoPoet (24’)

5 of 33

~10 Years of Video Generation

Space	Method	Model
Pixel	GAN	VGAN (16’), …, DVD-GAN (19’), …
	Autoregressive	Video Transformer (20’), …
	Diffusion	VDM (22’), Imagen Video (22’), …
Image Token	Autoregressive	VideoGPT (21’), CogVideo (22’), …
Video Token	Masked	Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …
Latent (continuous)	Diffusion	VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more

VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby

6 of 33

~10 Years of Video Generation

Space	Method	Model
Pixel	GAN	VGAN (16’), …, DVD-GAN (19’), …
	Autoregressive	Video Transformer (20’), …
	Diffusion	VDM (22’), Imagen Video (22’), …
Image Token	Autoregressive	VideoGPT (21’), CogVideo (22’), …
Video Token	Masked	Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …
Video Latent	Diffusion	VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more

VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby

Latent diffusion has been dominating text to video

7 of 33

~10 Years of Video Generation

Space	Method	Model
Pixel	GAN	VGAN (16’), …, DVD-GAN (19’), …
	Autoregressive	Video Transformer (20’), …
	Diffusion	VDM (22’), Imagen Video (22’), …
Image Token	Autoregressive	VideoGPT (21’), CogVideo (22’), …
Video Token	Masked	Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …
Video Latent	Diffusion	VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more

VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby

A skeleton drinking a glass of soda.

Latent diffusion has been dominating text to video

Is latent diffusion the only way to generate nice videos?

8 of 33

~10 Years of Video Generation

VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby

A skeleton drinking a glass of soda.

Is latent diffusion the only way to generate nice videos? Absolutely NOT!

Space	Method	Model
Pixel	GAN	VGAN (16’), …, DVD-GAN (19’), …
	Autoregressive	Video Transformer (20’), …
	Diffusion	VDM (22’), Imagen Video (22’), …
Image Token	Autoregressive	VideoGPT (21’), CogVideo (22’), …
Video Token	Masked	Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …
Video Latent	Diffusion	VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more
Video Token	Autoregressive	VideoPoet (24’)

9 of 33

10 of 33

VideoPoet is an autoregressive LLM that �synthesizes videos with high-fidelity motion and matching audio, �from a large variety of condition signals

11 of 33

VideoPoet Framework

12 of 33

VideoPoet Framework

Modality-specific tokenizers

Raw signal

Token

➡️ Encode & Compress ➡️

⬅️ Decode & Decompress ⬅️

13 of 33

MAGVIT-v2 Video Tokenizer

Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. In ICLR 2024.

Defining the visual “language”

Quantized VAE w/ temporally causal 3D CNN

Image as a prefix of video for joint training
Seamless support for long videos

14 of 33

MAGVIT-v2 Video Tokenizer

Scalable quantizer w/ 2¹⁸ large vocabulary for higher prediction bandwidth
Better compression than VVC with reconstructive and adversarial training

Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. In ICLR 2024.

Defining the visual “language”

Quantized VAE w/ temporally causal 3D CNN

Image as a prefix of video for joint training
Seamless support for long videos

15 of 33

SoundStream Audio Tokenizer

Residual vector quantizer
Better compression than Opus

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec. TASLP 2021.

Defining the audio “language”

Quantized VAE w/ causal 1D CNN

16 of 33

VideoPoet Framework

Out-of-the-box LLM transformer on discrete token sequences

17 of 33

VideoPoet Framework

Out-of-the-box LLM transformer on discrete token sequences

Flexibility: any to any task setup in a single model
Training efficiency: learning entire sequence in a single step with causal attention
Inference efficiency: various acceleration techniques such as caching,� where full decoding FLOPs equal to one full forward pass

18 of 33

Training Data

Mixture of pre-existing sources and formats, in two training phases

🔥 Pretraining uses everything, including unlabeled and noisy data

Source	🎥 Video	🔊 Audio	🌁 Image	🔠 Text	Sample Count	🔥 Pretrain	🔧 Adapt T2V
A	✅	✅			~170M	✅
B	✅	◑		◑	~50M	✅
C	✅			✅	~50M	✅	✅
D			✅	✅	~1B images	✅

19 of 33

Training Data

Mixture of pre-existing sources and formats, in two training phases

🔥 Pretraining uses everything, including unlabeled and noisy data
🔧 Task adaptation uses task-specific high quality data

Source	🎥 Video	🔊 Audio	🌁 Image	🔠 Text	Sample Count	🔥 Pretrain	🔧 Adapt T2V
A	✅	✅			~170M	✅
B	✅	◑		◑	~50M	✅
C	✅			✅	~50M	✅	✅
D			✅	✅	~1B images	✅

20 of 33

Training Tasks

Output Prefix	Continue ▶️	Video 🎥	Audio 🔊	🎥🔊	Image 🌁	Style 🎨
🚫 Unconditional		✅	✅	✅	✅
🎥 Video	✅					✅
🔊 Audio	✅
🎥🔊 Video + Audio	✅
🌁 Image		✅
🔠 Text		🔠		🔠	🔠

Self-supervised

21 of 33

Training Tasks

Output Prefix	Continue ▶️	Video 🎥	Audio 🔊	🎥🔊	Image 🌁	Style 🎨
🚫 Unconditional		✅	✅	✅	✅
🎥 Video	✅	✅➕➕	✅			✅
🔊 Audio	✅	✅
🎥🔊 Video + Audio	✅
🌁 Image		✅
🔠 Text		🔠		🔠	🔠

Self-supervised

22 of 33

Training Tasks

Output Prefix	Continue ▶️	Video 🎥	Audio 🔊	🎥🔊	Image 🌁	Style 🎨
🚫 Unconditional		✅	✅	✅	✅
🎥 Video	✅	✅➕➕	✅			✅
🔊 Audio	✅	✅
🎥🔊 Video + Audio	✅
🌁 Image		✅
🔠 Text		✅		✅	✅

Self-supervised + Supervised

23 of 33

MAGVIT Superresolution

Masked transformer with non-autoregressive decoding

Faster at small scale
Multi-axis windowed attention for long sequence

Yu et al. MAGVIT: Masked Generative Video Transformer. In CVPR 2023.

24 of 33

Automatic Benchmarks

Zero-shot text-to-video generation comparison with state-of-the-art

25 of 33

Human Evaluations

Zero-shot text-to-video generation comparison with prior and concurrent works

26 of 33

🔠 → 🎥�Text to Video

More at https://sites.research.google/videopoet/

27 of 33

🌁 → 🎥 Image to Video

(and 3D)

More at https://sites.research.google/videopoet/

28 of 33

🌁 → 🎥 Image to Video

🎥 → 🎨 Video Stylization

(and 3D)

Video Editing

More at https://sites.research.google/videopoet/

A kite surfer riding a shark.

29 of 33

On text-to-video generations, without text prompt

🎥 → 🔊

Video to Audio

More at https://sites.research.google/videopoet/

30 of 33

Future Research Directions

Context:

This is the logo of ICML

This is Lijun

Query:

Generate a video of Lijun dancing like this wearing a hoodie with the ICML logo.

This is a new dance

31 of 33

Future Research Directions

Real-time streaming video generation�interactive neural gaming, neural user interface for OS / APPs
Universal multimodal generative model�SOTA generation of text & video (& audio & …) and reasoning�c.f., human-level machine translation (~18’) -> ChatGPT (23’)

Query: Can you show me how to tie this shoe with a single hand? 🥾

On the efficiency side, we probably want to care about how video generation can run in a real-time streaming fashion.

This will not only enable interactive neural gaming, but may also facilitate neural user interface.

Imagine for a neural-based operating system, it could have no more blue screen crashes, but may reboot when it runs out of memory due to a long context.

Further advancement would hopefully take us to a universal multimodal generative model that excels at text, video, audio, image, and beyond.

Think about text-to-video as machine translation in two thousand eighteen, when it first beat human performance.

It took another five years before we have ChatGPT.

I guess it will take sooner before we can reason and generate across modalities with LLM-level intelligence.

Looking forward to seeing how it answers “can you show me how to tie this shoe with a single hand” with a live video.

32 of 33

Summary

VideoPoet represents a distinct approach to video generation

State-of-the-art quality, �challenging the diffusion monopoly
Multi-task flexibility, �going beyond the text to video translation paradigm
Video-first foundation model, �building upon LLM infrastructure for native integration

33 of 33

Thank

you.