1 of 29

Demystifying CLIP data @ICLR24
Altogether: Image Captioning via Re-aligning Alt-text @EMNLP24

Xu et al. Meta + NYU&UoW

2 of 29

Demystifying CLIP data

3 of 29

4 of 29

Related work redux

How to go from raw internet data to good imagenet zeroshot results
The only thing they do in this paper is fiddle with the data and the results
They attempt to replicate and improve upon original CLIP’s dataset creation process

5 of 29

6 of 29

MetaCLIP

They aim to replicate Radford et al data curation process
500k queries ~ 20k images per query
3.1 500k queries
3.2 substring matching
3.3 inverted indexing
3.4 query and balancing
Results + ablations hihi

7 of 29

A synset(synonym set) is a set of words with the same part-of-speech that can be interchanged in a certain context

8 of 29

9 of 29

10 of 29

11 of 29

12 of 29

13 of 29

What does this even do?

It reduces dominance and noise from head entries, like common web terms. E.g., out of 400M pairs, only 20k texts containing “photo” are kept (while there are 54M “photo” instances in the pool).
It diversifies the data distribution and balances tail/head entries, leading to a more task-agnostic foundation.
Sampling for each entry ensures that data points with more matched entries or denser information are prioritized for curation.

14 of 29

15 of 29

2 Pools of data

Pool 1 (exact CLIP replica attempt): t=20k; 1.6B -> 400M image+text pairs

15 CC snapshots from JAN21 to JAN23

Pool 2: 10.7B image+text pairs. After deduplication, English Language IDentification (LID), and sub-string matching.

90 CC snapshots from 2013 to APR23

Pool 2.1 t=170k version -> 2.5B image+text pairs.

tail = 6% of total counts, same ratio as CLIP 400M/1.6B ~= 2.5B/10.7B

Pool 2.2 t=20k -> 1B Images, tail is fatter than pool 1 (here the number of rarely occurring queries is bigger)

16 of 29

EKSPAČI

17 of 29

Zero-shot imagenet believe it or not - pool 1

18 of 29

Pool 2 shenanigans

19 of 29

20 of 29

21 of 29

22 of 29

A.1 They tried using datacomp data - it wasn’t “good”

400M vs 1B

23 of 29

A.2 Curation deets

HTML parsing + language ID
URL/text deduplication
wget
NSFW filter + dedup pt2

24 of 29

A.3 Human study on the effects on curation

We collect an evaluation set of 100 random image-text pairs for balanced and unbalanced data, respectively, and ask annotators to score on the image, text, and pair quality metrics, separately, on a scale of 1 to 5