1 of 43

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Anuj Diwan*, Layne Berry*, Eunsol Choi, David Harwath, Kyle Mahowald

University of Texas at Austin

1

2 of 43

Talk Overview

  1. Background: Winoground
  2. Models of Interest (CLIP, UNITER, LXMERT) and Winoground
  3. Analyzing the dataset
  4. Analyzing the evaluation criteria
  5. Analyzing the models

2

3 of 43

Talk Overview

  • Background: Winoground (Thrush et al., 2022)

3

4 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

4

: “An old person kisses a young person.”

: “A young person kisses an old person.”

5 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

5

“An old person kisses

a young person.”

“A young person kisses an old person.”

6 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

6

“An old person kisses

a young person.”

“A young person kisses an old person.”

7 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

7

“An old person kisses

a young person.”

“A young person kisses an old person.”

8 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

8

“An old person kisses

a young person.”

“A young person kisses an old person.”

9 of 43

Talk Overview

  • Background: Winoground

  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground

9

10 of 43

Models of Interest

10

CLIP

151M parameters

400M image-caption pairs

(Radford & Kim et al., 2021)

LXMERT

207M parameters

0.18M images, 9.18M captions

(Tan & Bansal, 2019)

UNITER

86M parameters

4.2M images; 9.58M captions

(Chen, Li, & Yu et al., 2020)

“A young person kisses an old person.”

“A young person kisses an old person.”

“A young person kisses an old person.”

11 of 43

SOTA VL Models Fail Miserably on Winoground

11

12 of 43

Talk Overview

  • Background: Winoground
  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground

  • Analyzing the dataset

12

13 of 43

Analyzing the dataset: New annotated tags!

13

the cat on the left of the photo has its right paw ahead of its left

the cat on the left of the photo has its left paw ahead of its right

original

NonCompositional

AmbiguouslyCorrect

VisuallyDifficult

UnusualImage

UnusualText

ComplexReasoning

(A) The original Winoground task…

(B) With new tags

14 of 43

Non-Compositional Items (n=30)

14

“Shedding its leaves.”

“Leaves its shedding.”

15 of 43

Ambiguously Correct Items (n=46)

15

“The person with the kids is sitting.”

“The person is sitting with the kids.”

16 of 43

Visually Difficult Items (n=38)

16

“The person with hair to their shoulders has brown eyes and the other person’s eyes are blue.”

“The person with hair to their shoulders has blue eyes and the other person’s eyes are brown.”

17 of 43

Items with Unusual Images (n=56)

17

“The orange lollipop is sad and the red lollipop is surprised.”

“The orange lollipop is surprised and the red lollipop is sad.

18 of 43

Items with Unusual Text (n=50)

18

“The brave in the face of fear.”

“Fear in the face of the brave.”

19 of 43

Items Requiring Complex Reasoning (n=78)

19

“The cup on the left is filled first and the cup on the right is filled second.”

“The cup on the left is filled second and the cup on the right is filled first.”

20 of 43

Items Directly Measuring Compositionality (n=171)

20

“There is a mug in some grass.”

“There is some grass in a mug.”

21 of 43

Talk Overview

  • Background: Winoground
  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground
  • Analyzing the dataset
    1. Takeaway: Winoground dataset measures harder/different abilities than just compositionality

21

22 of 43

Talk Overview

  • Background: Winoground
  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground
  • Analyzing the dataset

  • Analyzing the evaluation criteria

22

23 of 43

Analyzing the evaluation criteria

We relax evaluation criteria in two ways; 1. Recall @ k and 2. Finetuning probes

  1. Instead of picking over conditioned on ("Image score"), can the model simply retrieve from the dataset, conditioned on ? (Recall @ k)
  2. Models only see one image-text pair at a time when outputting score and can't compare across pairs. Does training a probe on Winoground that has such access help?

23

24 of 43

Retrieval: Recall @ k

Recall @ k (T2I) = % of texts for which the correct image match is in the top k retrievals

Recall @ k (I2T) = % of images for which the correct text match is in the top k retrievals

24

25 of 43

Training a probe on Winoground

Target task: Train a single non-linear binary classification probe that takes two inputs:

  1. Joint embedding of Correct Pair (e.g. , )
  2. Joint embedding of Incorrect Pair (e.g. , )

and must output the correct choice (class 0 here)

Control task ('Random baseline'): Same as above but trained with labels swapped for a random 50% of the dataset

Dataset: Winoground (400 examples) split into train set (300) and test set (100)

25

26 of 43

Training a probe on Winoground: Results (11 trials)

26

Text Score Probe

Text Score Probe

Image Score Probe

Image Score Probe

LXMERT

UNITER

27 of 43

Talk Overview

  • Background: Winoground
  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground
  • Analyzing the dataset
  • Analyzing the evaluation criteria
    • Takeaway 1: Relaxing the strict matching criterion in Winoground reveals new, interesting differences between models
    • Takeaway 2: Surprisingly, training probes on Winoground doesn't seem to help performance

27

28 of 43

Talk Overview

  • Background: Winoground
  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground
  • Analyzing the dataset
  • Analyzing the evaluation criteria

  • Analyzing the models

28

29 of 43

Analyzing the models

One potential hypothesis is that the text branch of V-L models is confused by these minimal textual pairs and cannot semantically distinguish them.

By using semantics-preserving augmentations of each text, we found that

  1. The text branch actually can distinguish these pairs, but
  2. Explicitly using this information still doesn't help performance on Winoground

29

30 of 43

Semantics-preserving augmentations

  • We manually select 9 augmentation strategies from NLAugmenter (Dhole et.al 2021) that we found are most likely to preserve caption semantics
  • Augmented captions i.e. caption variants are no longer minimal textual pairs.

30

31 of 43

Can models distinguish caption variants?

Per-item linear separability using SVMs

For each Winoground example (400 in total), learn a separate SVM linear classifier...

  • Target task: between embeddings of caption 0 variants and caption 1 variants
  • Control task: between 2 random, disjoint subsets of the union of caption 0 and caption 1 variants

31

32 of 43

Can models distinguish caption variants?

All-item non-linear separability using probes

Target: Train a single non-linear probe that is given three inputs: a) 2 text embeddings of variants X and Y of the same caption and b) a text embedding of variant Z of a different caption and must correctly choose Y over Z.

Control: The same as above, but train it with 50% of the above matchings swapped

32

33 of 43

Using Caption Variants to Help Models

  • If models can tell caption variants apart, maybe that information can be used?
  • Use similarity scores between images and caption variants to aid models:
    • Given a caption and its variants compute new similarity score

weighting original score mean/max of new scores

  • This doesn't change text/image/group scores by much, implying that good semantic distinguishability may not be sufficient to achieve good image-text matching

33

34 of 43

Talk Overview

  • Background: Winoground
  • Models of Interest (CLIP, UNITER, LXMERT) and Winoground
  • Analyzing the dataset
  • Analyzing the evaluation criteria
  • Analyzing the models
    • Takeaway 1: Models' text branches can semantically distinguish the minimal textual pairs, but
    • Takeaway 2: Models don't seem to be able to use this to do Winoground-style image-text matching

34

35 of 43

Summary

  • We created new annotations that revealed that more abilities are needed to succeed on Winoground than just compositionality

35

36 of 43

Summary

  • We created new annotations that revealed that more abilities are needed to succeed on Winoground than just compositionality
  • We relaxed evaluation criteria using a) Recall @ k, revealing interesting differences between the 3 models and b) training probes, that didn't help

36

37 of 43

Summary

  • We created new annotations that revealed that more abilities are needed to succeed on Winoground than just compositionality
  • We relaxed evaluation criteria using a) Recall @ k, revealing interesting differences between the 3 models and b) training probes, that didn't help
  • We finally showed that models are able to semantically distinguish the two captions using caption variants and linear/non-linear probes, but are likely unable to use such knowledge to succeed on Winoground

37

38 of 43

Recommendations for the Future

  • To get a better idea of model performance, evaluate separately on each of our tag's subsets

38

39 of 43

Recommendations for the Future

  • To get a better idea of model performance, evaluate separately on each of our tag's subsets
  • In our model analysis, we only showed that the text branch is able to encode semantic distinctions; outstanding questions include

39

40 of 43

Recommendations for the Future

  • To get a better idea of model performance, evaluate separately on each of our tag's subsets
  • In our model analysis, we only showed that the text branch is able to encode semantic distinctions; outstanding questions include
    • Does the image branch encode semantic distinctions?

40

41 of 43

Recommendations for the Future

  • To get a better idea of model performance, evaluate separately on each of our tag's subsets
  • In our model analysis, we only showed that the text branch is able to encode semantic distinctions; outstanding questions include
    • Does the image branch encode semantic distinctions?
    • Is the image-text matching score capable of making finegrained distinctions to succeed on Winoground?

41

42 of 43

Recommendations for the Future

  • To get a better idea of model performance, evaluate separately on each of our tag's subsets
  • In our model analysis, we only showed that the text branch is able to encode semantic distinctions; outstanding questions include
    • Does the image branch encode semantic distinctions?
    • Is the image-text matching score capable of making finegrained distinctions to succeed on Winoground?
    • How can we train these pretrained models to be better at Winoground?

42

43 of 43

Recommendations for the Future

  • To get a better idea of model performance, evaluate separately on each of our tag's subsets
  • In our model analysis, we only showed that the text branch is able to encode semantic distinctions; outstanding questions include
    • Does the image branch encode semantic distinctions?
    • Is the image-text matching score capable of making finegrained distinctions to succeed on Winoground?
    • How can we train these pretrained models to be better at Winoground?

43