3 of 35

Papers:

1.MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

2.Lumiere: A Space-Time Diffusion Model for Video Generation

4 of 35

Make-a-video

5 of 35

https://makeavideo.studio/

6 of 35

Motivations:

1.progress in T2I models thanks to image text pairs on the internet.

2.lack of paired text-video data

3.processing video itself computationally expensive

7 of 35

Procedure:

1. Leverage a T2I model to learn the correlation between text and image

2.train the model on unsupervised video to learn motion

8 of 35

Implementation and training:

9 of 35

Evaluation and Results:

10 of 35

Evaluation and Results:

13 of 35

Temporal consistency in generated video

14 of 35

Lumiere pipeline

16 of 35

MAGVIT

MAGVIT-v2

17 of 35

MAGVIT: Masked Generative Video Transformer

18 of 35

What is MAGVIT?

The MAGVIT (Masked Generative Video Transformer) model is a cutting-edge approach to video synthesis tasks that offers a unified solution for various video generation challenges.

19 of 35

What challenges it tackles?

Efficiency : MAGVIT focuses on improving the efficiency of video generation tasks

Multi-Task Video Generation : MAGVIT aims to provide a unified solution for multiple video generation tasks

Quality and Fidelity : MAGVIT focuses on achieving high reconstruction fidelity.

Flexibility and Generalization :MAGVIT aims to synthesize videos with complex scenes and motion contents

20 of 35

Architecture step by step

It sample one of the tasks at each training step and build its condition inputs by cropping and padding the raw video, where green denotes valid pixels and white is padding

21 of 35

Did not mentioned how the input and prompt works

22 of 35

Then it quantize the condition inputs with the 3D-VQ encoder and select the non-padding part as condition tokens.

The video VQ autoencoder is built upon the image VQGAN. The VQ encoder tokenizes the video as f^T : V → z ∈ Z^N.

V = video clip of t frames. Z is the codebook.

23 of 35

The 3D-VQ encoder quantizes a video into discrete tokens, while the 3D-VQ decoder maps them back to the pixel space. The decoder f−1^T maps the latent tokens back to video pixels.

24 of 35

The masked token sequence combines condition tokens, [MASK] tokens, and the target tokens, with a task prompt and a class token as the prefix.

25 of 35

COMMIT performs a conditional generation process toward the output tokens while gradually replacing the interior condition tokens. Videos and tokens are temporally downsampled and stacked for visualization. Unlike the MTM denoising decoding from all [MASK].

26 of 35

The bidirectional transformer learns to predict the target tokens through three objectives: refining condition tokens, predicting masked tokens, and reconstructing target tokens.

27 of 35

Inference time

MAGVIT is 60 times faster than the autoregressive video transformer

4-16 times more efficient than the contemporary non-autoregressive

video transformer

28 of 35

Results and outputs:

29 of 35

MAGVIT-2

A Video generator by LLM

30 of 35

Problem:

While Large Language Models (LLMs) are the dominant models for generative

tasks in language, they do not perform as well as diffusion models on image and

video generation

Solution:

To effectively use LLMs for visual generation, one crucial component

is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning.

31 of 35

What is new ?

This paper highlights two new designs: a lookup-free quantizer and a collection of enhancements to the tokenizer model.

32 of 35

The abstract Idea of Free look up quantize

In LFQ, the codebook is replaced with an integer set, and the latent space is decomposed into single-dimensional variables. Each dimension of the quantized representation is obtained by finding the closest value in the codebook without the need for an exhaustive search through embeddings.

Smaller code embedding —-> Bigger vocabulary size —> Bigger code book

33 of 35

Entropy penalty:

An entropy penalty is added during training to encourage codebook utilization. This penalty helps in improving the efficiency and effectiveness of the tokenizer by promoting the utilization of the codebook

34 of 35

In my Opinion:

These two papers are small steps toward AGI.

First multi task models that are able to perform on multiple task are good at generalization. So improving them is an important task and doing experiments on them and the challenges provide us with more general solutions.

Second I think, Unifying vision and language by the same token space could set the stage for a true multimodal LLM that can understand, generate, and reason within our visual environment.

35 of 35

Overall:

the Multi-Task Masked Token Modeling approach in MAGVIT offers advantages in terms of efficiency, flexibility, quality, generalization, scalability, and state-of-the-art performance, making it a powerful and effective solution for various video generation tasks.

success of the proposed enhancements on the tokenizer in improving generation quality, scalability, and efficiency in visual tasks. It sets the stage for future research endeavors aimed at pushing the boundaries of visual generation and advancing the field of multimodal learning.

1 of 35

2 of 35