Evaluating the Alignment of T2I and T2V Generation
Yixin Fei, Kewen Wu, Pengliang Ji, Zhiqiu Lin, Deva Ramanan
Carnegie Mellon University, Robotics Institute
Motivation:
- Text-to-image models struggle with generate images involving compositional text prompts.
- The automated evaluation metrics for alignment often function as bag-of-words.
- Text-to-video models struggle to understand camera aspects and 3D motions.
Goal:
- To provide reliable scores that evaluate the alignment for complex prompts without making use of expensive human feedback.
- To build a benchmark that cover essential visio-linguistic compositional reasoning skills to evaluate both text-to-visual generative models and vision-language alignment metrics.
- To build a high-quality text-video paired dataset, it should include aspects such as shot composition, camera movements, and lighting effects.
Evaluating automated metrics: VQAScore shows a significantly stronger agreement with human ratings compared to CLIPScore.
Evaluating text-to-visual models: Advanced prompts that require complex visio-linguistic reasoning are much harder.
VQAScore:
- Given an image and text, we calculate the probability of a “Yes” answer to a simple question like “Does this figure show ‘{text}’? Please answer yes or no.”
- Replaced Llama-2 in LLaVA-1.5 with the state-of-the-art bidirectional encoder-decoder FlanT5.
GenAI-Bench:
- Collected 1,600 compositional text prompts from professional designers who use tools such as Midjourney.
- Tagged each prompt with all relevant visio-linguistic compositional reasoning skills. (basic skills in gray, advanced skills in blue)
- Collected a total of 38, 400 human alignment ratings with the 1-to-5 Likert scale.
Improving T2I Generation:
- Ranked a few candidate images with VQAScore and selected the highest-scoring one.
- Set a benchmark for ranking by collecting 43,200 human ratings.
Ranking: Selecting the highest VQAScore images significantly boosts the overall human alignment ratings. (left: DALLE-3, right: SDXL)
Building benchmark which understands camera components:
Camera components, such as shot composition, movements, and lighting effects, are often overlooked in current video captions. However, incorporating these elements is essential for generative models to create visually consistent and realistic videos.