Evaluating the Alignment of Text-to-Visual Generation
Presenters: Yixin Fei, Kewen Wu, Pengliang Ji
Advisors: Deva Ramanan, Zhiqiu Lin
An avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.
An avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.
Motivation: fail to align with input text prompts
"A swan with a silver anklet on a crystal lake."
"A snowy owl perched to the right on a frost-covered branch."
"One content rabbit and six tired turtles."
"The dog with a leash sits quietly, the other without a leash runs wildly."
"'Jazz Night' flashing on a neon sign at the entrance to the Music Lounge."
Attribute
Relation
Counting
Negation
Typography
Motivation: how to evaluate alignment robustly?
FID Score
CLIPScore
PickScore
Goal
“the moon is over the cow”
temporal
Solution
Compute the alignment score by using VQA models in an end-to-end way
When and Why Vision-Language Models Behave Like Bags-Of-Words, and What to Do About It?
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou
Stanford University
ICLR 2023 Oral
✔️ “the white legs and the orange cat”
✖️ “the orange legs and the white cat”
✔️ “the large kite and the black ropes”
✖️ “the black kite and the large ropes”
✔️ “the sandwiches are on the plate”
✖️ “the plate is on the sandwiches”
✔️ “the horse is eating the grass”
✖️ “the grass is eating the horse”
When and Why Vision-Language Models Behave Like Bags-Of-Words, and What to Do About It?
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou
Stanford University
ICLR 2023 Oral
✔️ “the white legs and the orange cat”
✖️ “the orange legs and the white cat”
✔️ “the large kite and the black ropes”
✖️ “the black kite and the large ropes”
✔️ “the sandwiches are on the plate”
✖️ “the plate is on the sandwiches”
✔️ “the horse is eating the grass”
✖️ “the grass is eating the horse”
Preliminary: CLIPScore
Contrastive Language-Image Pre-training
Preliminary: CLIPScore
CLIP model
Fine-grained Evaluation of VLMs
When do VLMs perform poorly?
ARO Benchmark
Attribution
Relation
Order
Prepositional Relation
Verb Relation
Part Relation
Color
Size
Shape
Material
Gender
Emotion
Shuffle nouns and adjectives
Age
State
Shuffle everything but nouns and adjectives
Shuffle trigrams
Shuffle words within each trigram
Attribution
Color
✔️ “the white legs and the orange cat”
✖️ “the orange legs and the white cat”
Size
✔️ “the small building and the wood fence”
✖️ “the wood building and the small fence”
Age
State
✔️ “the old man and the large bag”
✖️ “the large man and the old bag”
✔️ “the fresh sandwich and the sliced roast beef”
✖️ “the sliced roast sandwich and the fresh beef”
Relation
Preposition
Verb
Part
✔️ “the eggs is on top of the bread”
✖️ “the bread is on top of the eggs”
✔️ “the horse is eating the grass”
✖️ “the grass is eating the horse”
✔️ “the woman is wearing the hat”
✖️ “the hat is wearing the woman”
Order
✔️ “a man with a red helmet on a small moped on a dirt road”
✖️ “a man with a red road on a small moped on a helmet dirt”
✖️ “on man a moped red helmet a small a a on dirt road with”
✖️ “with a man red a helmet small a on on moped a dirt road”
✖️ “on a small moped on a dirt road a red helmet a man with”
(shuffle nouns and adjectives)
(shuffle trigrams)
(shuffle words within each trigram)
(shuffle everything but nouns and adjectives)
Experiments on fine-grained evaluation
VLMs fail to represent word order more broadly
A Critique of Contrastive Pretraining (CLIP)
Why does CLIPScore perform poorly?
A Critique of Contrastive Pretraining (CLIP)
Why does CLIPScore perform poorly?
“CLIP pre-trains for the task of image-text retrieval on our noisy web-scale dataset”
Large pre-training datasets:
A Critique of Contrastive Pretraining (CLIP)
Why does CLIPScore perform poorly?
Large pre-training datasets:
A Simple Fix with Hard Negatives
What to do to alleviate these issues?
to make CLIP sensitive to word order
contrastive learning -> VQA models
TIFA: Accurate and Interpretable Text-to-Image
Faithfulness Evaluation with Question Answering
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, Noah A.
Problems in previous automatic evaluation metric
Image caption: SPICE
Insufficient in capturing essential image information
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 (pp. 382-398). Springer International Publishing.
Problems in previous automatic evaluation metric
Object Detection: DALL-Eval
Good at detecting object attributes and spatial relationships.
Missing background information.
Cho, J., Zala, A., & Bansal, M. (2023). Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3043-3054).
Method
Divide-and-Conquer approach: TIFA
Step 1: Questions-Answer Generation (GPT3)
Step 2: Question Filtering (UnifiedQA)
Step 3: Visual Question Answering (mPLUG-large is the best in this paper)
Method
Step 1: Questions-Answer Generation
Method
Step 1: Questions-Answer Generation
Element Category classification
Method
Step 1: Questions-Answer Generation
Element Category classification
Question generation conditioned on elements
Method
Step 2: Question Filtering
Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., & Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.
UnifiedQA
Method
Step 3: Visual Question Answering
Experiments
Experiments
Shapes, counting, and spatial relations,multiple objects.
Limitations
As a Divide-and-Conquer method:
Challenges: comprehensive evaluation with human feedback
Scale up evaluation to Video:
Human Feedback: more Subjective, more Error, and more Expensive
Recall: Evaluation Metrics for Text-to-video Generation
Observation:
1. Deficient in advanced understanding static images.
2. Does FVD/IS truly possess the capability to reason about temporal dynamics?
3. Lack features to assess the alignment between text and visual.
Truth: The Real World Embraces Complex Symbolic Concepts.
Key Question: How can we integrate these essential video symbolic concepts?
Answer: Leveraging Visual-Question-Answering models with Point Tracking models
T2VScore: An Evaluation Pipeline with Visual Question Answering
Text Input
Generated Video
VQA Model
GPT Model
Temporal: auxiliary trajectories from Point Tracking Models
How to reasonably evaluate an automatic metric?
Automatic Metric Evaluation
Measuring the correlation between the predicted scores and human ratings, with the assistance of TVGE dataset
Why did this evaluation pipeline with VQA models work better?
Conclusion
Inspired by them, we can improve end-to-end VQA-based text-to-visual alignment.
Relationship of three papers
Conclusion
Evaluation pipeline using VQA:
VQA Model
Generated Visual
Question
Large Model Backbone
Symbolic
Knowledge
Neural
Understanding
Neural: High-level
Symbolism: Low-level
Previously, neuro-symbolic thinking was difficult to incorporate into model training.
Differentiable
Text
Insight of VQA-based model
Future Works
Thank you
Part 1:
Motivation: We found that current generative models (DALLE 3, SDXL) cannot generate images that align well with text prompts. To improve the generation performance, we first need to have evaluation metrics. However, we lack a robust evaluation metric that evaluates the alignment of text-to-visual.
Current evaluation methods: a) FID, IS (only for quality)
b) CLIPScore (fails to produce reliable scores for complex prompts involving compositions)
c) human evaluation (expensive and noisy): Pick-a-pick dataset, noise,bias,expensive
Problem: Our aim is to construct a general alignment metric on text-to-visual including text-to-image/video models.
Solution: use recent generative VLMs trained for visual-question-answering (VQA), which can reason compositionally by generating answers based on images and questions
End-to-end:
1) generate question
2) Comparison at the embedding space
Part 2 Paper 1:
First dig into the most popular and widely used evaluation metric, ClipScore. We chose this paper because we want to discover the reason why ClipScore performs poorly.
0. Preliminary: what is CLIP model and what is ClipScore?
Part2 Paper 2:
Since the ClipScore is lack of compositional ability, which aligns poorly from human preferences. While collecting the true human feedback is both expensive and strongly biased.Researchers introduce a reference-free metric which evaluate the faithfulness of text input and image output in a VQA based method.
Part 2 Paper 3:
While human feedback is invaluable for evaluating text-to-visual alignment, obtaining and scaling it can be costly, especially with complex modalities like videos. The time-consuming and labor-intensive nature of human annotation often introduces subjective bias into the outcomes. To overcome these challenges, researchers have proposed automatic models to assess the quality of text-to-visual alignment.
0. A brief comparison of existing automatic metrics reveals their shortcomings in more complex and comprehensive modalities: they lack a thorough analysis for temporal assessment in videos.
1.1) Introduce evaluation pipeline T2VScore.
2) How it scrutinizes fidelity of video in representing given text description - evaluation perspectives.
3) Where’s cleverness over other methods - neural-symbolic - the potential of VQA in automated evaluation
2. Automatic Metric Evaluation: How to reasonably evaluate an automatic metric? - measuring the correlation between the predicted scores and human ratings, with the assistance of TVGE dataset.
Part3:
Conclusion: