Video Understanding and Generation
First Part:
Papers:
1.MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA
2.Lumiere: A Space-Time Diffusion Model for Video Generation
Make-a-video
Motivations:
1.progress in T2I models thanks to image text pairs on the internet.
2.lack of paired text-video data
3.processing video itself computationally expensive
Procedure:
1. Leverage a T2I model to learn the correlation between text and image
2.train the model on unsupervised video to learn motion
Implementation and training:
Evaluation and Results:
Evaluation and Results:
Lumiere
Temporal consistency in generated video
Lumiere pipeline
Results:
MAGVIT
MAGVIT-v2
MAGVIT: Masked Generative Video Transformer
What is MAGVIT?
The MAGVIT (Masked Generative Video Transformer) model is a cutting-edge approach to video synthesis tasks that offers a unified solution for various video generation challenges.
What challenges it tackles?
Efficiency : MAGVIT focuses on improving the efficiency of video generation tasks
Multi-Task Video Generation : MAGVIT aims to provide a unified solution for multiple video generation tasks
Quality and Fidelity : MAGVIT focuses on achieving high reconstruction fidelity.
Flexibility and Generalization :MAGVIT aims to synthesize videos with complex scenes and motion contents
Architecture step by step
It sample one of the tasks at each training step and build its condition inputs by cropping and padding the raw video, where green denotes valid pixels and white is padding
Did not mentioned how the input and prompt works
Then it quantize the condition inputs with the 3D-VQ encoder and select the non-padding part as condition tokens.
The video VQ autoencoder is built upon the image VQGAN. The VQ encoder tokenizes the video as f^T : V → z ∈ Z^N.
V = video clip of t frames. Z is the codebook.
The 3D-VQ encoder quantizes a video into discrete tokens, while the 3D-VQ decoder maps them back to the pixel space. The decoder f−1^T maps the latent tokens back to video pixels.
The masked token sequence combines condition tokens, [MASK] tokens, and the target tokens, with a task prompt and a class token as the prefix.
COMMIT performs a conditional generation process toward the output tokens while gradually replacing the interior condition tokens. Videos and tokens are temporally downsampled and stacked for visualization. Unlike the MTM denoising decoding from all [MASK].
The bidirectional transformer learns to predict the target tokens through three objectives: refining condition tokens, predicting masked tokens, and reconstructing target tokens.
Inference time
MAGVIT is 60 times faster than the autoregressive video transformer
4-16 times more efficient than the contemporary non-autoregressive
video transformer
Results and outputs:
MAGVIT-2
A Video generator by LLM
Problem:
While Large Language Models (LLMs) are the dominant models for generative
tasks in language, they do not perform as well as diffusion models on image and
video generation
Solution:
To effectively use LLMs for visual generation, one crucial component
is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning.
What is new ?
This paper highlights two new designs: a lookup-free quantizer and a collection of enhancements to the tokenizer model.
The abstract Idea of Free look up quantize
In LFQ, the codebook is replaced with an integer set, and the latent space is decomposed into single-dimensional variables. Each dimension of the quantized representation is obtained by finding the closest value in the codebook without the need for an exhaustive search through embeddings.
Smaller code embedding —-> Bigger vocabulary size —> Bigger code book
Entropy penalty:
An entropy penalty is added during training to encourage codebook utilization. This penalty helps in improving the efficiency and effectiveness of the tokenizer by promoting the utilization of the codebook
In my Opinion:
These two papers are small steps toward AGI.
First multi task models that are able to perform on multiple task are good at generalization. So improving them is an important task and doing experiments on them and the challenges provide us with more general solutions.
Second I think, Unifying vision and language by the same token space could set the stage for a true multimodal LLM that can understand, generate, and reason within our visual environment.
Overall:
the Multi-Task Masked Token Modeling approach in MAGVIT offers advantages in terms of efficiency, flexibility, quality, generalization, scalability, and state-of-the-art performance, making it a powerful and effective solution for various video generation tasks.
success of the proposed enhancements on the tokenizer in improving generation quality, scalability, and efficiency in visual tasks. It sets the stage for future research endeavors aimed at pushing the boundaries of visual generation and advancing the field of multimodal learning.