Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan*, Layne Berry*, Eunsol Choi, David Harwath, Kyle Mahowald
University of Texas at Austin
1
Talk Overview
2
Talk Overview
3
Background: The Winoground Visuolinguistic Compositionality Benchmark
4
: “An old person kisses a young person.”
: “A young person kisses an old person.”
Background: The Winoground Visuolinguistic Compositionality Benchmark
Text Score =
Image Score =
Group Score = Text Score and Image Score are both 1
5
“An old person kisses
a young person.”
“A young person kisses an old person.”
Background: The Winoground Visuolinguistic Compositionality Benchmark
Text Score =
Image Score =
Group Score = Text Score and Image Score are both 1
6
“An old person kisses
a young person.”
“A young person kisses an old person.”
Background: The Winoground Visuolinguistic Compositionality Benchmark
Text Score =
Image Score =
Group Score = Text Score and Image Score are both 1
7
“An old person kisses
a young person.”
“A young person kisses an old person.”
Background: The Winoground Visuolinguistic Compositionality Benchmark
Text Score =
Image Score =
Group Score = Text Score and Image Score are both 1
8
“An old person kisses
a young person.”
“A young person kisses an old person.”
Talk Overview
9
Models of Interest
10
CLIP
151M parameters
400M image-caption pairs
(Radford & Kim et al., 2021)
LXMERT
207M parameters
0.18M images, 9.18M captions
(Tan & Bansal, 2019)
UNITER
86M parameters
4.2M images; 9.58M captions
(Chen, Li, & Yu et al., 2020)
“A young person kisses an old person.”
“A young person kisses an old person.”
“A young person kisses an old person.”
SOTA VL Models Fail Miserably on Winoground
11
Talk Overview
12
Analyzing the dataset: New annotated tags!
13
the cat on the left of the photo has its right paw ahead of its left
the cat on the left of the photo has its left paw ahead of its right
original
NonCompositional | |
AmbiguouslyCorrect | |
VisuallyDifficult | ✓ |
UnusualImage | |
UnusualText | |
ComplexReasoning | ✓ |
(A) The original Winoground task…
(B) With new tags
Non-Compositional Items (n=30)
14
“Shedding its leaves.”
“Leaves its shedding.”
Ambiguously Correct Items (n=46)
15
“The person with the kids is sitting.”
“The person is sitting with the kids.”
Visually Difficult Items (n=38)
16
“The person with hair to their shoulders has brown eyes and the other person’s eyes are blue.”
“The person with hair to their shoulders has blue eyes and the other person’s eyes are brown.”
Items with Unusual Images (n=56)
17
“The orange lollipop is sad and the red lollipop is surprised.”
“The orange lollipop is surprised and the red lollipop is sad.
Items with Unusual Text (n=50)
18
“The brave in the face of fear.”
“Fear in the face of the brave.”
Items Requiring Complex Reasoning (n=78)
19
“The cup on the left is filled first and the cup on the right is filled second.”
“The cup on the left is filled second and the cup on the right is filled first.”
Items Directly Measuring Compositionality (n=171)
20
“There is a mug in some grass.”
“There is some grass in a mug.”
Talk Overview
21
Talk Overview
22
Analyzing the evaluation criteria
We relax evaluation criteria in two ways; 1. Recall @ k and 2. Finetuning probes
23
Retrieval: Recall @ k
Recall @ k (T2I) = % of texts for which the correct image match is in the top k retrievals
Recall @ k (I2T) = % of images for which the correct text match is in the top k retrievals
24
Training a probe on Winoground
Target task: Train a single non-linear binary classification probe that takes two inputs:
and must output the correct choice (class 0 here)
Control task ('Random baseline'): Same as above but trained with labels swapped for a random 50% of the dataset
Dataset: Winoground (400 examples) split into train set (300) and test set (100)
25
Training a probe on Winoground: Results (11 trials)
26
Text Score Probe
Text Score Probe
Image Score Probe
Image Score Probe
LXMERT
UNITER
Talk Overview
27
Talk Overview
28
Analyzing the models
One potential hypothesis is that the text branch of V-L models is confused by these minimal textual pairs and cannot semantically distinguish them.
By using semantics-preserving augmentations of each text, we found that
29
Semantics-preserving augmentations
30
Can models distinguish caption variants?
Per-item linear separability using SVMs
For each Winoground example (400 in total), learn a separate SVM linear classifier...
31
Can models distinguish caption variants?
All-item non-linear separability using probes
Target: Train a single non-linear probe that is given three inputs: a) 2 text embeddings of variants X and Y of the same caption and b) a text embedding of variant Z of a different caption and must correctly choose Y over Z.
Control: The same as above, but train it with 50% of the above matchings swapped
32
Using Caption Variants to Help Models
weighting original score mean/max of new scores
33
Talk Overview
34
Summary
35
Summary
36
Summary
37
Recommendations for the Future
38
Recommendations for the Future
39
Recommendations for the Future
40
Recommendations for the Future
41
Recommendations for the Future
42
Recommendations for the Future
43
ArXiv: