1 of 33

VideoPoet:

A Large Language Model for Zero-Shot Video Generation

ICML 2024 Best Paper�Presenter: Lijun Yu 于力军

https://sites.research.google/videopoet/

2 of 33

Video Poets

Dan Kondratyuk^

Lijun Yu*

Xiuye Gu

José Lezama

Jonathan Huang*^

Grant Schindler

Rachel Hornung

Vighnesh Birodkar^

Jimmy Yan

Ming-Chang Chiu^

Krishna �Somandepalli

Hassan Akbari^

Yair Alon

Yong Cheng

Josh Dillon^

Agrim Gupta^

Meera Hahn

Anja Hauth

David Hendon

Alonso Martinez^

David Minnen

Mikhail Sirotenko

Kihyuk Sohn^

Xuan Yang

Hartwig Adam

Ming-Hsuan Yang

Irfan Essa

Huisheng Wang

David A. Ross*

Bryan Seybold*

Lu Jiang*^

Equal contribution

* Corresponding authors

^ Work done while at Google

3 of 33

~10 Years of Video Generation

Space

Method

Model

Pixel

GAN

VGAN (16’), …

Autoregressive

Video Transformer (20’), …

Diffusion

VDM (22’), Imagen Video (22’), …

Latent (Discrete)

Autoregressive

VideoGPT (21’), CogVideo (22’), …

Masked

Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …

Latent (continuous)

Diffusion

VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’),

Sora (24’), Veo (24’), Gen-3 (24’), and many more

VideoPoet (24’)

4 of 33

~10 Years of Video Generation

Space

Method

Model

Pixel

GAN

VGAN (16’), …, DVD-GAN (19’), …

Autoregressive

Video Transformer (20’), …

Diffusion

VDM (22’), Imagen Video (22’), …

Latent (Discrete)

Autoregressive

VideoGPT (21’), CogVideo (22’), …

Masked

Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …

Latent (continuous)

Diffusion

VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’),

Sora (24’), Veo (24’), Gen-3 (24’), and many more

VideoPoet (24’)

5 of 33

~10 Years of Video Generation

Space

Method

Model

Pixel

GAN

VGAN (16’), …, DVD-GAN (19’), …

Autoregressive

Video Transformer (20’), …

Diffusion

VDM (22’), Imagen Video (22’), …

Image Token

Autoregressive

VideoGPT (21’), CogVideo (22’), …

Video Token

Masked

Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …

Latent (continuous)

Diffusion

VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’),

Sora (24’), Veo (24’), Gen-3 (24’), and many more

6 of 33

~10 Years of Video Generation

Space

Method

Model

Pixel

GAN

VGAN (16’), …, DVD-GAN (19’), …

Autoregressive

Video Transformer (20’), …

Diffusion

VDM (22’), Imagen Video (22’), …

Image Token

Autoregressive

VideoGPT (21’), CogVideo (22’), …

Video Token

Masked

Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …

Video Latent

Diffusion

VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’),

Sora (24’), Veo (24’), Gen-3 (24’), and many more

Latent diffusion has been dominating text to video

7 of 33

~10 Years of Video Generation

Space

Method

Model

Pixel

GAN

VGAN (16’), …, DVD-GAN (19’), …

Autoregressive

Video Transformer (20’), …

Diffusion

VDM (22’), Imagen Video (22’), …

Image Token

Autoregressive

VideoGPT (21’), CogVideo (22’), …

Video Token

Masked

Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …

Video Latent

Diffusion

VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’),

Sora (24’), Veo (24’), Gen-3 (24’), and many more

A skeleton drinking a glass of soda.

Latent diffusion has been dominating text to video

Is latent diffusion the only way to generate nice videos?

8 of 33

~10 Years of Video Generation

A skeleton drinking a glass of soda.

Is latent diffusion the only way to generate nice videos? Absolutely NOT!

Space

Method

Model

Pixel

GAN

VGAN (16’), …, DVD-GAN (19’), …

Autoregressive

Video Transformer (20’), …

Diffusion

VDM (22’), Imagen Video (22’), …

Image Token

Autoregressive

VideoGPT (21’), CogVideo (22’), …

Video Token

Masked

Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), …

Video Latent

Diffusion

VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’),

Sora (24’), Veo (24’), Gen-3 (24’), and many more

Video Token

Autoregressive

VideoPoet (24’)

9 of 33

10 of 33

VideoPoet is an autoregressive LLM that �synthesizes videos with high-fidelity motion and matching audio, �from a large variety of condition signals

11 of 33

VideoPoet Framework

12 of 33

VideoPoet Framework

Modality-specific tokenizers

Raw signal

Token

➡️ Encode & Compress ➡️

⬅️ Decode & Decompress ⬅️

13 of 33

MAGVIT-v2 Video Tokenizer

Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. In ICLR 2024.

Defining the visual “language”

  • Quantized VAE w/ temporally causal 3D CNN
    • Image as a prefix of video for joint training
    • Seamless support for long videos

14 of 33

MAGVIT-v2 Video Tokenizer

  • Scalable quantizer w/ 218 large vocabulary for higher prediction bandwidth
  • Better compression than VVC with reconstructive and adversarial training

Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. In ICLR 2024.

Defining the visual “language”

  • Quantized VAE w/ temporally causal 3D CNN
    • Image as a prefix of video for joint training
    • Seamless support for long videos

15 of 33

SoundStream Audio Tokenizer

  • Residual vector quantizer
  • Better compression than Opus

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec. TASLP 2021.

Defining the audio “language”

  • Quantized VAE w/ causal 1D CNN

16 of 33

VideoPoet Framework

Out-of-the-box LLM transformer on discrete token sequences

17 of 33

VideoPoet Framework

Out-of-the-box LLM transformer on discrete token sequences

  • Flexibility: any to any task setup in a single model
  • Training efficiency: learning entire sequence in a single step with causal attention
  • Inference efficiency: various acceleration techniques such as caching,� where full decoding FLOPs equal to one full forward pass

18 of 33

Training Data

Mixture of pre-existing sources and formats, in two training phases

  • 🔥 Pretraining uses everything, including unlabeled and noisy data

Source

🎥 Video

🔊 Audio

🌁 Image

🔠 Text

Sample Count

🔥 Pretrain

🔧 Adapt T2V

A

~170M

B

~50M

C

~50M

D

~1B images

19 of 33

Training Data

Mixture of pre-existing sources and formats, in two training phases

  • 🔥 Pretraining uses everything, including unlabeled and noisy data
  • 🔧 Task adaptation uses task-specific high quality data

Source

🎥 Video

🔊 Audio

🌁 Image

🔠 Text

Sample Count

🔥 Pretrain

🔧 Adapt T2V

A

~170M

B

~50M

C

~50M

D

~1B images

20 of 33

Training Tasks

Output

Prefix

Continue ▶️

Video 🎥

Audio 🔊

🎥🔊

Image 🌁

Style 🎨

🚫 Unconditional

🎥 Video

🔊 Audio

🎥🔊 Video + Audio

🌁 Image

🔠 Text

🔠

🔠

🔠

Self-supervised

21 of 33

Training Tasks

Output

Prefix

Continue ▶️

Video 🎥

Audio 🔊

🎥🔊

Image 🌁

Style 🎨

🚫 Unconditional

🎥 Video

✅➕➕

🔊 Audio

🎥🔊 Video + Audio

🌁 Image

🔠 Text

🔠

🔠

🔠

Self-supervised

22 of 33

Training Tasks

Output

Prefix

Continue ▶️

Video 🎥

Audio 🔊

🎥🔊

Image 🌁

Style 🎨

🚫 Unconditional

🎥 Video

✅➕➕

🔊 Audio

🎥🔊 Video + Audio

🌁 Image

🔠 Text

Self-supervised + Supervised

23 of 33

MAGVIT Superresolution

Masked transformer with non-autoregressive decoding

  • Faster at small scale
  • Multi-axis windowed attention for long sequence

Yu et al. MAGVIT: Masked Generative Video Transformer. In CVPR 2023.

24 of 33

Automatic Benchmarks

Zero-shot text-to-video generation comparison with state-of-the-art

25 of 33

Human Evaluations

Zero-shot text-to-video generation comparison with prior and concurrent works

26 of 33

🔠 🎥�Text to Video

27 of 33

🌁 → 🎥 Image to Video

(and 3D)

28 of 33

🌁 → 🎥 Image to Video

🎥 → 🎨 Video Stylization

(and 3D)

Video Editing

A kite surfer riding a shark.

29 of 33

On text-to-video generations, without text prompt

🎥 → 🔊

Video to Audio

30 of 33

Future Research Directions

Context:

This is the logo of ICML

This is Lijun

Query:

Generate a video of Lijun dancing like this wearing a hoodie with the ICML logo.

This is a new dance

31 of 33

Future Research Directions

  • Real-time streaming video generation�interactive neural gaming, neural user interface for OS / APPs
  • Universal multimodal generative model�SOTA generation of text & video (& audio & …) and reasoning�c.f., human-level machine translation (~18’) -> ChatGPT (23’)

Query: Can you show me how to tie this shoe with a single hand? 🥾

32 of 33

Summary

VideoPoet represents a distinct approach to video generation

  • State-of-the-art quality, �challenging the diffusion monopoly
  • Multi-task flexibility, �going beyond the text to video translation paradigm
  • Video-first foundation model, �building upon LLM infrastructure for native integration

33 of 33

Thank

you.