Evaluation of Text to Image Models
1
Wissal and MoosaWW
Text-to-image generation models
2
Text-to-image generation models
3
Evaluation of Text-to-image Models
4
Challenges of Text-to-image Models
5
Motivation
6
Limits: Object Counting, Compositional Reasoning
Evaluation of text-to-image models
CLIPScore
CLIP-R
DALL-EVAL
Limits: only works on Synthesized Text,
measures faithfulness on limited axes (object, counting, color, spatial relation), missing elements like material, shape, activities, and context.
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
7
How does it work?
8
The TIFA framework employs Visual Question Answering (VQA) as a method to assess image faithfulness. By asking and answering questions about the content of the generated images, it quantitatively measures how well the images align with the original text descriptions.
How does it work?
9
Why ?
The benchmark serves to standardize the evaluation of text-to-image models.
How does it work?
10
Results
11
Thoughts
12
Dependence on VQA models could introduce biases or errors inherent in those models, potentially affecting the accuracy of the TIFA evaluations.
Davidsonian Scene Graph
13
QG/A: New Paradigm in T2I Alignment Eval
14
Reliability Issues in Existing QG/A Methods
15
Solution: Implement QG steps as a DAG
16
Implement QG steps as a DAG
4 atomic propositions -
17
Implement QG steps as a DAG
18
Implement QG steps as a DAG
19
Implement QG steps as a DAG
20
“A blue motorcycle parked by paint chipped doors”
Dataset - DSG 1k
21
Summarized Results
22
Appendix 1
23
Element Extraction
24
different models
25
Appendix 2
26
Evaluation of Generated Questions
27
Evaluation of different VQA models
28
VQA models work differently on different scenes
29
VQA models work differently on different scenes
30