1 of 29

  1. Demystifying CLIP data @ICLR24
  2. Altogether: Image Captioning via Re-aligning Alt-text @EMNLP24

Xu et al. Meta + NYU&UoW

2 of 29

  1. Demystifying CLIP data

3 of 29

4 of 29

Related work redux

  • How to go from raw internet data to good imagenet zeroshot results
  • The only thing they do in this paper is fiddle with the data and the results
  • They attempt to replicate and improve upon original CLIP’s dataset creation process

5 of 29

6 of 29

MetaCLIP

  • They aim to replicate Radford et al data curation process
  • 500k queries ~ 20k images per query
  • 3.1 500k queries
  • 3.2 substring matching
  • 3.3 inverted indexing
  • 3.4 query and balancing
  • Results + ablations hihi

7 of 29

A synset(synonym set) is a set of words with the same part-of-speech that can be interchanged in a certain context

8 of 29

9 of 29

10 of 29

11 of 29

12 of 29

13 of 29

What does this even do?

  1. It reduces dominance and noise from head entries, like common web terms. E.g., out of 400M pairs, only 20k texts containing “photo” are kept (while there are 54M “photo” instances in the pool).
  2. It diversifies the data distribution and balances tail/head entries, leading to a more task-agnostic foundation.
  3. Sampling for each entry ensures that data points with more matched entries or denser information are prioritized for curation.

14 of 29

15 of 29

2 Pools of data

  1. Pool 1 (exact CLIP replica attempt): t=20k; 1.6B -> 400M image+text pairs
    1. 15 CC snapshots from JAN21 to JAN23

  • Pool 2: 10.7B image+text pairs. After deduplication, English Language IDentification (LID), and sub-string matching.
    • 90 CC snapshots from 2013 to APR23
  • Pool 2.1 t=170k version -> 2.5B image+text pairs.
    • tail = 6% of total counts, same ratio as CLIP 400M/1.6B ~= 2.5B/10.7B
  • Pool 2.2 t=20k -> 1B Images, tail is fatter than pool 1 (here the number of rarely occurring queries is bigger)

16 of 29

EKSPAČI

17 of 29

Zero-shot imagenet believe it or not - pool 1

18 of 29

Pool 2 shenanigans

19 of 29

20 of 29

21 of 29

22 of 29

A.1 They tried using datacomp data - it wasn’t “good”

400M vs 1B

23 of 29

A.2 Curation deets

  1. HTML parsing + language ID
  2. URL/text deduplication
  3. wget
  4. NSFW filter + dedup pt2

24 of 29

A.3 Human study on the effects on curation

We collect an evaluation set of 100 random image-text pairs for balanced and unbalanced data, respectively, and ask annotators to score on the image, text, and pair quality metrics, separately, on a scale of 1 to 5

  • NOISE MITIGATION

25 of 29

26 of 29

Altogether: Image Captioning via Re-aligning Alt-text @EMNLP24

27 of 29

28 of 29

29 of 29