1 of 44

Evaluating the Alignment of Text-to-Visual Generation

Presenters: Yixin Fei, Kewen Wu, Pengliang Ji

Advisors: Deva Ramanan, Zhiqiu Lin

2 of 44

An avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.

3 of 44

An avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.

4 of 44

Motivation: fail to align with input text prompts

"A swan with a silver anklet on a crystal lake."

"A snowy owl perched to the right on a frost-covered branch."

"One content rabbit and six tired turtles."

"The dog with a leash sits quietly, the other without a leash runs wildly."

"'Jazz Night' flashing on a neon sign at the entrance to the Music Lounge."

Attribute

Relation

Counting

Negation

Typography

5 of 44

Motivation: how to evaluate alignment robustly?

FID: the quality of generated images
CLIPScore: fail to produce reliable scores for complex prompts
Human evaluation: expensive, noisy, and biased (e.g. PickScore^[1])

[1] https://arxiv.org/abs/2305.01569

FID Score

CLIPScore

PickScore

6 of 44

Goal

Aim to design a general alignment metric on text-to-visual including text-to-image/video models
Text and image matching: higher scores reflect greater image-text similarity
Text and video: focus on temporal aspects

“the moon is over the cow”

temporal

7 of 44

Solution

Compute the alignment score by using VQA models in an end-to-end way

Converting a prompt into a simple question: "Does this figure show {text}?"
Exposing answer likelihoods rather than generating multiple-choice answers

8 of 44

When and Why Vision-Language Models Behave Like Bags-Of-Words, and What to Do About It?

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou

Stanford University

ICLR 2023 Oral

✔️ “the white legs and the orange cat”

✖️ “the orange legs and the white cat”

✔️ “the large kite and the black ropes”

✖️ “the black kite and the large ropes”

✔️ “the sandwiches are on the plate”

✖️ “the plate is on the sandwiches”

✔️ “the horse is eating the grass”

✖️ “the grass is eating the horse”

https://arxiv.org/abs/2210.01936

9 of 44

When and Why Vision-Language Models Behave Like Bags-Of-Words, and What to Do About It?

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou

Stanford University

ICLR 2023 Oral

✔️ “the white legs and the orange cat”

✖️ “the orange legs and the white cat”

✔️ “the large kite and the black ropes”

✖️ “the black kite and the large ropes”

✔️ “the sandwiches are on the plate”

✖️ “the plate is on the sandwiches”

✔️ “the horse is eating the grass”

✖️ “the grass is eating the horse”

https://arxiv.org/abs/2210.01936

10 of 44

Preliminary: CLIPScore

https://arxiv.org/abs/2103.00020

Contrastive Language-Image Pre-training

11 of 44

Preliminary: CLIPScore

CLIP model

https://arxiv.org/abs/2104.08718

12 of 44

Fine-grained Evaluation of VLMs

When do VLMs perform poorly?

ARO Benchmark

Attribution

Relation

Order

Prepositional Relation

Verb Relation

Part Relation

Color

Size

Shape

Material

Gender

Emotion

Shuffle nouns and adjectives

Age

State

Shuffle everything but nouns and adjectives

Shuffle trigrams

Shuffle words within each trigram

13 of 44

Attribution

one image
one correct caption
one reordered caption
chance level: 50%

Color

✔️ “the white legs and the orange cat”

✖️ “the orange legs and the white cat”

Size

✔️ “the small building and the wood fence”

✖️ “the wood building and the small fence”

Age

State

✔️ “the old man and the large bag”

✖️ “the large man and the old bag”

✔️ “the fresh sandwich and the sliced roast beef”

✖️ “the sliced roast sandwich and the fresh beef”

14 of 44

Relation

one image
one correct caption
one reordered caption
chance level: 50%

Preposition

Verb

Part

✔️ “the eggs is on top of the bread”

✖️ “the bread is on top of the eggs”

✔️ “the horse is eating the grass”

✖️ “the grass is eating the horse”

✔️ “the woman is wearing the hat”

✖️ “the hat is wearing the woman”

15 of 44

Order

one image
one correct caption
four reordered captions
chance level: 20%

✔️ “a man with a red helmet on a small moped on a dirt road”

✖️ “a man with a red road on a small moped on a helmet dirt”

✖️ “on man a moped red helmet a small a a on dirt road with”

✖️ “with a man red a helmet small a on on moped a dirt road”

✖️ “on a small moped on a dirt road a red helmet a man with”

(shuffle nouns and adjectives)

(shuffle trigrams)

(shuffle words within each trigram)

(shuffle everything but nouns and adjectives)

16 of 44

Experiments on fine-grained evaluation

Most models are near or below chance level
CLIP achieves 56% on spatial relations and 63% on attributions

VLMs fail to represent word order more broadly

17 of 44

A Critique of Contrastive Pretraining (CLIP)

Why does CLIPScore perform poorly?

18 of 44

A Critique of Contrastive Pretraining (CLIP)

Why does CLIPScore perform poorly?

“CLIP pre-trains for the task of image-text retrieval on our noisy web-scale dataset”

Large pre-training datasets:

cover a large conceptual and semantic space
lack of images with captions containing similar words

19 of 44

A Critique of Contrastive Pretraining (CLIP)

Why does CLIPScore perform poorly?

Large pre-training datasets:

cover a large conceptual and semantic space
lack of images with captions containing similar words

Behaving like a bags-of-words becomes a high-reward strategy
Neural networks are prone to exploiting shortcut strategies
ClipScore is not an ideal alignment score

20 of 44

A Simple Fix with Hard Negatives

What to do to alleviate these issues?

Motivation:

to make CLIP sensitive to word order

Methods:

generate negative captions (red) by swapping original captions (white)
sample strong alternative images (blue) similar to original images (white)

Limitations:

contrastive learning -> VQA models

21 of 44

TIFA: Accurate and Interpretable Text-to-Image

Faithfulness Evaluation with Question Answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, Noah A.

22 of 44

Problems in previous automatic evaluation metric

Image caption: SPICE

Insufficient in capturing essential image information

Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 (pp. 382-398). Springer International Publishing.

23 of 44

Problems in previous automatic evaluation metric

Object Detection: DALL-Eval

Good at detecting object attributes and spatial relationships.

Missing background information.

Cho, J., Zala, A., & Bansal, M. (2023). Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3043-3054).

24 of 44

Method

Divide-and-Conquer approach: TIFA

Step 1: Questions-Answer Generation (GPT3)

Step 2: Question Filtering (UnifiedQA)

Step 3: Visual Question Answering (mPLUG-large is the best in this paper)

25 of 44

Method

Step 1: Questions-Answer Generation

26 of 44

Method

Step 1: Questions-Answer Generation

Element Category classification

27 of 44

Method

Step 1: Questions-Answer Generation

Element Category classification

Question generation conditioned on elements

28 of 44

Method

Step 2: Question Filtering

Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., & Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.

UnifiedQA

29 of 44

Method

Step 3: Visual Question Answering

30 of 44

Experiments

31 of 44

Experiments

Generating images from captions vs. free-form text.

What elements are text-to-image models struggling with?

Shapes, counting, and spatial relations,multiple objects.

32 of 44

Limitations

As a Divide-and-Conquer method:

Time consuming

Fails to generate meaningful questions in cases of complex text prompt.

33 of 44

34 of 44

Challenges: comprehensive evaluation with human feedback

Scale up evaluation to Video:

Exponential growth of visual content
Significant domain gap with real world
Ambiguous Temporal Judgment

Human Feedback: more Subjective, more Error, and more Expensive

35 of 44

Recall: Evaluation Metrics for Text-to-video Generation

Observation:

1. Deficient in advanced understanding static images.

2. Does FVD/IS truly possess the capability to reason about temporal dynamics?

3. Lack features to assess the alignment between text and visual.

Truth: The Real World Embraces Complex Symbolic Concepts.

Key Question: How can we integrate these essential video symbolic concepts?

Answer: Leveraging Visual-Question-Answering models with Point Tracking models

36 of 44

T2VScore: An Evaluation Pipeline with Visual Question Answering

Text Input

Generated Video

VQA Model

GPT Model

Temporal: auxiliary trajectories from Point Tracking Models

37 of 44

How to reasonably evaluate an automatic metric?

Automatic Metric Evaluation

Measuring the correlation between the predicted scores and human ratings, with the assistance of TVGE dataset

38 of 44

Why did this evaluation pipeline with VQA models work better?

High-quality data drives the construction of powerful language models
Comprehensive evaluation perspectives drives robust performance metrics.

39 of 44

Conclusion

Inspired by them, we can improve end-to-end VQA-based text-to-visual alignment.

Relationship of three papers

CLIPScore is a pioneering metric for assessing text-to-visual alignment, while owns a limited understanding of the composition.
TIFA introduces a Divide-and-Conquer VQA-based text-to-image evaluation, while GPT struggles to produce meaningful questions covering complex perspectives.
T2VScore proposes VQA-based pipeline for text-to-video alignment with rich perspectives, while its accuracy calculation method stays on post-decoder levels.

40 of 44

Conclusion

Evaluation pipeline using VQA:

Large model backbone Deep understanding of visual information at the neural level.
Visual-answering process Comprehensive understanding of symbolic representation.

VQA Model

Generated Visual

Question

Large Model Backbone

Symbolic

Knowledge

Neural

Understanding

Neural: High-level

Symbolism: Low-level

Previously, neuro-symbolic thinking was difficult to incorporate into model training.

Differentiable

Text

Insight of VQA-based model

41 of 44

Future Works

End-to-end differentiable evaluations with VQA models.

Embed text prompts within a simple question formula.
Compute the distance between embeddings prior to answer decoder.

Extending evaluations with VQA models to video or 3D generation.
Fine-tuning Image/Video Generative Models with the output from evaluation models.

42 of 44

Thank you

43 of 44

44 of 44

Part 1:

Motivation: We found that current generative models (DALLE 3, SDXL) cannot generate images that align well with text prompts. To improve the generation performance, we first need to have evaluation metrics. However, we lack a robust evaluation metric that evaluates the alignment of text-to-visual.

Current evaluation methods: a) FID, IS (only for quality)

b) CLIPScore (fails to produce reliable scores for complex prompts involving compositions)

c) human evaluation (expensive and noisy): Pick-a-pick dataset, noise,bias,expensive

Problem: Our aim is to construct a general alignment metric on text-to-visual including text-to-image/video models.

Solution: use recent generative VLMs trained for visual-question-answering (VQA), which can reason compositionally by generating answers based on images and questions

End-to-end:

1) generate question

2) Comparison at the embedding space

Part 2 Paper 1:

First dig into the most popular and widely used evaluation metric, ClipScore. We chose this paper because we want to discover the reason why ClipScore performs poorly.

0. Preliminary: what is CLIP model and what is ClipScore?

Propose a new dataset for fine-grained evaluation of VLMs’ relation, attribution, and order understanding
Experiment: many models fail to perform beyond chance level at simple tasks requiring compositional understanding (including CLIP)
Critique of CLIP and CLIPScore, why they have poor performance on composition, how to improve

Part2 Paper 2:

Since the ClipScore is lack of compositional ability, which aligns poorly from human preferences. While collecting the true human feedback is both expensive and strongly biased.Researchers introduce a reference-free metric which evaluate the faithfulness of text input and image output in a VQA based method.

Part 2 Paper 3:

While human feedback is invaluable for evaluating text-to-visual alignment, obtaining and scaling it can be costly, especially with complex modalities like videos. The time-consuming and labor-intensive nature of human annotation often introduces subjective bias into the outcomes. To overcome these challenges, researchers have proposed automatic models to assess the quality of text-to-visual alignment.

0. A brief comparison of existing automatic metrics reveals their shortcomings in more complex and comprehensive modalities: they lack a thorough analysis for temporal assessment in videos.

1.1) Introduce evaluation pipeline T2VScore.

2) How it scrutinizes fidelity of video in representing given text description - evaluation perspectives.

3) Where’s cleverness over other methods - neural-symbolic - the potential of VQA in automated evaluation

2. Automatic Metric Evaluation: How to reasonably evaluate an automatic metric? - measuring the correlation between the predicted scores and human ratings, with the assistance of TVGE dataset.

Part3:

Conclusion: