1 of 43

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Anuj Diwan*, Layne Berry*, Eunsol Choi, David Harwath, Kyle Mahowald

University of Texas at Austin

2 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground
Analyzing the dataset
Analyzing the evaluation criteria
Analyzing the models

3 of 43

Talk Overview

Background: Winoground (Thrush et al., 2022)

4 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

: “An old person kisses a young person.”

: “A young person kisses an old person.”

5 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

“An old person kisses

a young person.”

“A young person kisses an old person.”

6 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

“An old person kisses

a young person.”

“A young person kisses an old person.”

7 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

“An old person kisses

a young person.”

“A young person kisses an old person.”

8 of 43

Background: The Winoground Visuolinguistic Compositionality Benchmark

Text Score =

Image Score =

Group Score = Text Score and Image Score are both 1

“An old person kisses

a young person.”

“A young person kisses an old person.”

9 of 43

Talk Overview

Background: Winoground

Models of Interest (CLIP, UNITER, LXMERT) and Winoground

10 of 43

Models of Interest

CLIP

151M parameters

400M image-caption pairs

(Radford & Kim et al., 2021)

LXMERT

207M parameters

0.18M images, 9.18M captions

(Tan & Bansal, 2019)

UNITER

86M parameters

4.2M images; 9.58M captions

(Chen, Li, & Yu et al., 2020)

“A young person kisses an old person.”

11 of 43

SOTA VL Models Fail Miserably on Winoground

12 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground

Analyzing the dataset

13 of 43

Analyzing the dataset: New annotated tags!

the cat on the left of the photo has its right paw ahead of its left

the cat on the left of the photo has its left paw ahead of its right

original

NonCompositional
AmbiguouslyCorrect
VisuallyDifficult	✓
UnusualImage
UnusualText
ComplexReasoning	✓

(A) The original Winoground task…

(B) With new tags

14 of 43

Non-Compositional Items (n=30)

“Shedding its leaves.”

“Leaves its shedding.”

15 of 43

Ambiguously Correct Items (n=46)

“The person with the kids is sitting.”

“The person is sitting with the kids.”

16 of 43

Visually Difficult Items (n=38)

“The person with hair to their shoulders has brown eyes and the other person’s eyes are blue.”

“The person with hair to their shoulders has blue eyes and the other person’s eyes are brown.”

17 of 43

Items with Unusual Images (n=56)

“The orange lollipop is sad and the red lollipop is surprised.”

“The orange lollipop is surprised and the red lollipop is sad.

18 of 43

Items with Unusual Text (n=50)

“The brave in the face of fear.”

“Fear in the face of the brave.”

19 of 43

Items Requiring Complex Reasoning (n=78)

“The cup on the left is filled first and the cup on the right is filled second.”

“The cup on the left is filled second and the cup on the right is filled first.”

20 of 43

Items Directly Measuring Compositionality (n=171)

“There is a mug in some grass.”

“There is some grass in a mug.”

21 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground
Analyzing the dataset

Takeaway: Winoground dataset measures harder/different abilities than just compositionality

22 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground
Analyzing the dataset

Analyzing the evaluation criteria

23 of 43

Analyzing the evaluation criteria

We relax evaluation criteria in two ways; 1. Recall @ k and 2. Finetuning probes

Instead of picking over conditioned on ("Image score"), can the model simply retrieve from the dataset, conditioned on ? (Recall @ k)
Models only see one image-text pair at a time when outputting score and can't compare across pairs. Does training a probe on Winoground that has such access help?

24 of 43

Retrieval: Recall @ k

Recall @ k (T2I) = % of texts for which the correct image match is in the top k retrievals

Recall @ k (I2T) = % of images for which the correct text match is in the top k retrievals

25 of 43

Training a probe on Winoground

Target task: Train a single non-linear binary classification probe that takes two inputs:

Joint embedding of Correct Pair (e.g. , )
Joint embedding of Incorrect Pair (e.g. , )

and must output the correct choice (class 0 here)

Control task ('Random baseline'): Same as above but trained with labels swapped for a random 50% of the dataset

Dataset: Winoground (400 examples) split into train set (300) and test set (100)

26 of 43

Training a probe on Winoground: Results (11 trials)

Text Score Probe

Image Score Probe

LXMERT

UNITER

27 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground
Analyzing the dataset
Analyzing the evaluation criteria

Takeaway 1: Relaxing the strict matching criterion in Winoground reveals new, interesting differences between models
Takeaway 2: Surprisingly, training probes on Winoground doesn't seem to help performance

28 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground
Analyzing the dataset
Analyzing the evaluation criteria

Analyzing the models

29 of 43

Analyzing the models

One potential hypothesis is that the text branch of V-L models is confused by these minimal textual pairs and cannot semantically distinguish them.

By using semantics-preserving augmentations of each text, we found that

The text branch actually can distinguish these pairs, but
Explicitly using this information still doesn't help performance on Winoground

30 of 43

Semantics-preserving augmentations

We manually select 9 augmentation strategies from NLAugmenter (Dhole et.al 2021) that we found are most likely to preserve caption semantics
Augmented captions i.e. caption variants are no longer minimal textual pairs.

31 of 43

Can models distinguish caption variants?

Per-item linear separability using SVMs

For each Winoground example (400 in total), learn a separate SVM linear classifier...

Target task: between embeddings of caption 0 variants and caption 1 variants
Control task: between 2 random, disjoint subsets of the union of caption 0 and caption 1 variants

32 of 43

Can models distinguish caption variants?

All-item non-linear separability using probes

Target: Train a single non-linear probe that is given three inputs: a) 2 text embeddings of variants X and Y of the same caption and b) a text embedding of variant Z of a different caption and must correctly choose Y over Z.

Control: The same as above, but train it with 50% of the above matchings swapped

33 of 43

Using Caption Variants to Help Models

If models can tell caption variants apart, maybe that information can be used?
Use similarity scores between images and caption variants to aid models:

Given a caption and its variants compute new similarity score

weighting original score mean/max of new scores

This doesn't change text/image/group scores by much, implying that good semantic distinguishability may not be sufficient to achieve good image-text matching

34 of 43

Talk Overview

Background: Winoground
Models of Interest (CLIP, UNITER, LXMERT) and Winoground
Analyzing the dataset
Analyzing the evaluation criteria
Analyzing the models

Takeaway 1: Models' text branches can semantically distinguish the minimal textual pairs, but
Takeaway 2: Models don't seem to be able to use this to do Winoground-style image-text matching

35 of 43

Summary

We created new annotations that revealed that more abilities are needed to succeed on Winoground than just compositionality

36 of 43

Summary

We created new annotations that revealed that more abilities are needed to succeed on Winoground than just compositionality
We relaxed evaluation criteria using a) Recall @ k, revealing interesting differences between the 3 models and b) training probes, that didn't help

37 of 43

Summary

We created new annotations that revealed that more abilities are needed to succeed on Winoground than just compositionality
We relaxed evaluation criteria using a) Recall @ k, revealing interesting differences between the 3 models and b) training probes, that didn't help
We finally showed that models are able to semantically distinguish the two captions using caption variants and linear/non-linear probes, but are likely unable to use such knowledge to succeed on Winoground

38 of 43