1 of 30

Evaluation of Text to Image Models

1

Wissal and MoosaWW

2 of 30

Text-to-image generation models

2

3 of 30

Text-to-image generation models

3

4 of 30

Evaluation of Text-to-image Models

4

5 of 30

Challenges of Text-to-image Models

5

How can we validate the accuracy of image generation from text?

What's the best way to benchmark the performance of text-to-image models?

How can we reliably translate textual abstractions into visual representations?

6 of 30

Motivation

6

Limits: Object Counting, Compositional Reasoning

Evaluation of text-to-image models

CLIPScore

CLIP-R

DALL-EVAL

Limits: only works on Synthesized Text,

measures faithfulness on limited axes (object, counting, color, spatial relation), missing elements like material, shape, activities, and context.

7 of 30

TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

7

8 of 30

How does it work?

8

The TIFA framework employs Visual Question Answering (VQA) as a method to assess image faithfulness. By asking and answering questions about the content of the generated images, it quantitatively measures how well the images align with the original text descriptions.

9 of 30

How does it work?

9

Why ?

The benchmark serves to standardize the evaluation of text-to-image models.

10 of 30

How does it work?

10

11 of 30

Results

11

12 of 30

Thoughts

12

TIFA introduces a novel, interpretable evaluation method for text-to-image generation using question answering

The TIFA v1.0 benchmark provides a standardized set of diverse text inputs and question-answer pairs for consistent model evaluation across the community.

Dependence on VQA models could introduce biases or errors inherent in those models, potentially affecting the accuracy of the TIFA evaluations.

13 of 30

Davidsonian Scene Graph

13

14 of 30

QG/A: New Paradigm in T2I Alignment Eval

14

15 of 30

Reliability Issues in Existing QG/A Methods

15

16 of 30

Solution: Implement QG steps as a DAG

16

17 of 30

Implement QG steps as a DAG

Nodes represent unique questions.
Edges represent semantic dependencies.

4 atomic propositions -

Entities (1-tuple)
Attributes (2-tuple)
Relationships (3-tuple)
Globals (1-tuple)

17

18 of 30

Implement QG steps as a DAG

18

19 of 30

Implement QG steps as a DAG

19

20 of 30

Implement QG steps as a DAG

20

“A blue motorcycle parked by paint chipped doors”

21 of 30

Dataset - DSG 1k

21

22 of 30

Summarized Results

DSG addresses the aforementioned reliability issues

VQA fails the QG/A framework in some semantic categories (like text rendering)

22

23 of 30

Appendix 1

23

24 of 30

Element Extraction

24

25 of 30

different models

25

26 of 30

Appendix 2

26

27 of 30

Evaluation of Generated Questions

27

28 of 30

Evaluation of different VQA models

28

29 of 30

VQA models work differently on different scenes

29

30 of 30

VQA models work differently on different scenes

30