VideoPoet:
A Large Language Model for Zero-Shot Video Generation
Video Poets
Dan Kondratyuk^
Lijun Yu*
Xiuye Gu
José Lezama
Jonathan Huang*^
Grant Schindler
Rachel Hornung
Vighnesh Birodkar^
Jimmy Yan
Ming-Chang Chiu^
Krishna �Somandepalli
Hassan Akbari^
Yair Alon
Yong Cheng
Josh Dillon^
Agrim Gupta^
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez^
David Minnen
Mikhail Sirotenko
Kihyuk Sohn^
Xuan Yang
Hartwig Adam
Ming-Hsuan Yang
Irfan Essa
Huisheng Wang
David A. Ross*
Bryan Seybold*
Lu Jiang*^
Equal contribution
* Corresponding authors
^ Work done while at Google
~10 Years of Video Generation
Space | Method | Model |
Pixel | GAN | VGAN (16’), … |
Autoregressive | Video Transformer (20’), … | |
Diffusion | VDM (22’), Imagen Video (22’), … | |
Latent (Discrete) | Autoregressive | VideoGPT (21’), CogVideo (22’), … |
Masked | Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), … | |
Latent (continuous) | Diffusion | VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more |
VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby
VideoPoet (24’)
~10 Years of Video Generation
Space | Method | Model |
Pixel | GAN | VGAN (16’), …, DVD-GAN (19’), … |
Autoregressive | Video Transformer (20’), … | |
Diffusion | VDM (22’), Imagen Video (22’), … | |
Latent (Discrete) | Autoregressive | VideoGPT (21’), CogVideo (22’), … |
Masked | Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), … | |
Latent (continuous) | Diffusion | VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more |
VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby
VideoPoet (24’)
~10 Years of Video Generation
Space | Method | Model |
Pixel | GAN | VGAN (16’), …, DVD-GAN (19’), … |
Autoregressive | Video Transformer (20’), … | |
Diffusion | VDM (22’), Imagen Video (22’), … | |
Image Token | Autoregressive | VideoGPT (21’), CogVideo (22’), … |
Video Token | Masked | Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), … |
Latent (continuous) | Diffusion | VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more |
VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby
~10 Years of Video Generation
Space | Method | Model |
Pixel | GAN | VGAN (16’), …, DVD-GAN (19’), … |
Autoregressive | Video Transformer (20’), … | |
Diffusion | VDM (22’), Imagen Video (22’), … | |
Image Token | Autoregressive | VideoGPT (21’), CogVideo (22’), … |
Video Token | Masked | Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), … |
Video Latent | Diffusion | VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more |
VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby
Latent diffusion has been dominating text to video
~10 Years of Video Generation
Space | Method | Model |
Pixel | GAN | VGAN (16’), …, DVD-GAN (19’), … |
Autoregressive | Video Transformer (20’), … | |
Diffusion | VDM (22’), Imagen Video (22’), … | |
Image Token | Autoregressive | VideoGPT (21’), CogVideo (22’), … |
Video Token | Masked | Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), … |
Video Latent | Diffusion | VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more |
VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby
A skeleton drinking a glass of soda.
Latent diffusion has been dominating text to video
Is latent diffusion the only way to generate nice videos?
~10 Years of Video Generation
VGAN (16’)�http://www.cs.columbia.edu/~vondrick/tinyvideo/�Golf Baby
A skeleton drinking a glass of soda.
Is latent diffusion the only way to generate nice videos? Absolutely NOT!
Space | Method | Model |
Pixel | GAN | VGAN (16’), …, DVD-GAN (19’), … |
Autoregressive | Video Transformer (20’), … | |
Diffusion | VDM (22’), Imagen Video (22’), … | |
Image Token | Autoregressive | VideoGPT (21’), CogVideo (22’), … |
Video Token | Masked | Phenaki (22’), MAGVIT (22’), MAGVIT-v2 (23’), … |
Video Latent | Diffusion | VideoLDM (23’), VideoCrafter (23’), W.A.L.T (23’), Sora (24’), Veo (24’), Gen-3 (24’), and many more |
Video Token | Autoregressive | VideoPoet (24’) |
VideoPoet is an autoregressive LLM that �synthesizes videos with high-fidelity motion and matching audio, �from a large variety of condition signals
VideoPoet Framework
VideoPoet Framework
Modality-specific tokenizers
Raw signal
Token
➡️ Encode & Compress ➡️
⬅️ Decode & Decompress ⬅️
MAGVIT-v2 Video Tokenizer
Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. In ICLR 2024.
Defining the visual “language”
MAGVIT-v2 Video Tokenizer
Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. In ICLR 2024.
Defining the visual “language”
SoundStream Audio Tokenizer
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec. TASLP 2021.
Defining the audio “language”
VideoPoet Framework
Out-of-the-box LLM transformer on discrete token sequences
VideoPoet Framework
Out-of-the-box LLM transformer on discrete token sequences
Training Data
Mixture of pre-existing sources and formats, in two training phases
Source | 🎥 Video | 🔊 Audio | 🌁 Image | 🔠 Text | Sample Count | 🔥 Pretrain | 🔧 Adapt T2V |
A | ✅ | ✅ | | | ~170M | ✅ | |
B | ✅ | ◑ | | ◑ | ~50M | ✅ | |
C | ✅ | | | ✅ | ~50M | ✅ | ✅ |
D | | | ✅ | ✅ | ~1B images | ✅ | |
Training Data
Mixture of pre-existing sources and formats, in two training phases
Source | 🎥 Video | 🔊 Audio | 🌁 Image | 🔠 Text | Sample Count | 🔥 Pretrain | 🔧 Adapt T2V |
A | ✅ | ✅ | | | ~170M | ✅ | |
B | ✅ | ◑ | | ◑ | ~50M | ✅ | |
C | ✅ | | | ✅ | ~50M | ✅ | ✅ |
D | | | ✅ | ✅ | ~1B images | ✅ | |
Training Tasks
Output Prefix | Continue ▶️ | Video 🎥 | Audio 🔊 | 🎥🔊 | Image 🌁 | Style 🎨 |
🚫 Unconditional | | ✅ | ✅ | ✅ | ✅ | |
🎥 Video | ✅ | | | | | ✅ |
🔊 Audio | ✅ | | | | | |
🎥🔊 Video + Audio | ✅ | | | | | |
🌁 Image | | ✅ | | | | |
🔠 Text | | 🔠 | | 🔠 | 🔠 | |
Self-supervised
Training Tasks
Output Prefix | Continue ▶️ | Video 🎥 | Audio 🔊 | 🎥🔊 | Image 🌁 | Style 🎨 |
🚫 Unconditional | | ✅ | ✅ | ✅ | ✅ | |
🎥 Video | ✅ | ✅➕➕ | ✅ | | | ✅ |
🔊 Audio | ✅ | ✅ | | | | |
🎥🔊 Video + Audio | ✅ | | | | | |
🌁 Image | | ✅ | | | | |
🔠 Text | | 🔠 | | 🔠 | 🔠 | |
Self-supervised
Training Tasks
Output Prefix | Continue ▶️ | Video 🎥 | Audio 🔊 | 🎥🔊 | Image 🌁 | Style 🎨 |
🚫 Unconditional | | ✅ | ✅ | ✅ | ✅ | |
🎥 Video | ✅ | ✅➕➕ | ✅ | | | ✅ |
🔊 Audio | ✅ | ✅ | | | | |
🎥🔊 Video + Audio | ✅ | | | | | |
🌁 Image | | ✅ | | | | |
🔠 Text | | ✅ | | ✅ | ✅ | |
Self-supervised + Supervised
MAGVIT Superresolution
Masked transformer with non-autoregressive decoding
Yu et al. MAGVIT: Masked Generative Video Transformer. In CVPR 2023.
Automatic Benchmarks
Zero-shot text-to-video generation comparison with state-of-the-art
Human Evaluations
Zero-shot text-to-video generation comparison with prior and concurrent works
🔠 → 🎥�Text to Video
🌁 → 🎥 Image to Video
(and 3D)
🌁 → 🎥 Image to Video
🎥 → 🎨 Video Stylization
(and 3D)
Video Editing
A kite surfer riding a shark.
On text-to-video generations, without text prompt
🎥 → 🔊
Video to Audio
Future Research Directions
Context:
This is the logo of ICML
This is Lijun
Query:
Generate a video of Lijun dancing like this wearing a hoodie with the ICML logo.
This is a new dance
Future Research Directions
Query: Can you show me how to tie this shoe with a single hand? 🥾
Summary
VideoPoet represents a distinct approach to video generation
Thank
you.