1 of 44

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola�

(ICML 2024)

Presenters: Ankur Sikarwar, Anirudh Buvanesh

1

2 of 44

Hypothesis

Different neural networks are converging toward the same way of representing the world.

2

3 of 44

Outline

  • Background
  • Convergence Experiments
  • What is driving convergence?
  • Implications and limitations

3

4 of 44

Rosetta Neurons

4

Dravid, Amil, et al. "Rosetta neurons: Mining the common units in a model zoo." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

5 of 44

Evidence from Model Switching

5

Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.

6 of 44

Evidence from Model Switching

6

Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.

Trained with method/data A

Trained with method/data B

7 of 44

Evidence from Model Switching

7

Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.

Trained with method/data A

Trained with method/data B

8 of 44

Outline

  • Background
  • Convergence Experiments
  • What is driving convergence?
  • Implications and limitations

8

9 of 44

Characterizing Representations using Kernels

9

10 of 44

Characterizing Representations using Kernels

10

Focus on vector embeddings:

11 of 44

Characterizing Representations using Kernels

11

Characterizing a representation in terms of its kernel:

Focus on vector embeddings:

12 of 44

Characterizing Representations using Kernels

12

Characterizing a representation in terms of its kernel:

Focus on vector embeddings:

similar

dissimilar

Kvision

Slide credit: Phillip Isola

13 of 44

Characterizing Representations using Kernels

13

sim( , )

DINO

CLIP

Slide credit: Phillip Isola

14 of 44

Nearest-neighbour kernel-alignment metric

14

What percentage of my nearest-neighbors under representation f are also my nearest-neighbors under representation g ?

f

g

Slide credit: Phillip Isola

15 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

15

16 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

16

Hypothesis 1:�

There are many different ways one can represent the visual world, and each can be highly effective.�

17 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

17

Hypothesis 1:�

There are many different ways one can represent the visual world, and each can be highly effective.�

Hypothesis 2:

All strong visual representations are alike.

18 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

18

  • 78 vision models: different architecture, objectives, training data distributions.

19 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

19

  • 78 vision models: different architecture, objectives, training data distributions.

  • Group models by performance on VTAB, and measure representational similarity within each group

20 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

20

All strong representations are alike, each weak representation is weak in its own way.

21 of 44

Experiment: Is language-vision alignment increasing?

21

22 of 44

Experiment: Is language-vision alignment increasing?

22

Hypothesis 1:�

As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision

23 of 44

Experiment: Is language-vision alignment increasing?

23

Hypothesis 1:�

As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision.�Hypothesis 2:

Better language models are better vision models.

24 of 44

Experiment: Is language-vision alignment increasing?

24

Hypothesis 1:�

As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision.�Hypothesis 2:

Better language models are better vision models.

Hypothesis 2+:

The best language model is the best vision model. They converge to the same representation

25 of 44

Experiment: Is language-vision alignment increasing?

25

sim( , )

Kvision

Ktext

“apple”

“orange”

“elephant”

“giraffe”

“apple”

“orange”

“elephant”

“giraffe”

26 of 44

Experiment: Is language-vision alignment increasing?

26

27 of 44

Experiment: Is language-vision alignment increasing?

27

Strong models converge in representation.

28 of 44

Experiment: Is language-vision alignment increasing?

28

Towards convergence

29 of 44

Experiment: Is language-vision alignment increasing?

29

30 of 44

Experiment: Is language-vision alignment increasing?

30

Alignment to downstream performance

31 of 44

Outline

  • Background
  • Convergence Experiments
  • What is driving convergence?
  • Implications and limitations

31

32 of 44

What is Driving Convergence?

  • Multi-Task scaling hypothesis
  • Capacity hypothesis
  • Simplicity Bias

32

33 of 44

Multi-Task Scaling Hypothesis

  • Pre-training objectives implicitly do multi-task learning.

33

Hypothesis Space

Solves �task 1

Solves �task 2

Solves both tasks

34 of 44

Capacity Hypothesis

  • High capacity networks can represent a bigger hypothesis space.

34

Scale

Architectures

35 of 44

Simplicity Bias

  • Deep networks are biased toward finding simple fits to the data.

35

Hypothesis Space

Functions that solve the task

Simple Functions

  • Regularization
    • Weight decay
    • Dropout

36 of 44

What are we converging to?

  • World of objects of different colors
  • Different modalities are (lossy) projections of the world

36

A red sphere next to a blue cone

Question: What do different modalities have in common?

Hypothesis: Convergence towards co-occurrences of events in the world

37 of 44

Evidence of Convergence Towards Co-Occurrences

  • Co-occurring colours are represented closer in the embedding space

37

38 of 44

Outline

  • Background
  • Convergence Experiments
  • What is driving convergence?
  • Implications and limitations

38

39 of 44

Implication: Sharing Data Between Modalities

  • Visual data can improve language models, and language data enhances vision models [1].

  • Transfer between vision language models to embodied AI [2].

39

[1]: Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021.

[2]: Black, Kevin, et al. "π0: A vision-language-action flow model for general robot control, 2024." arXiv, 2024.

40 of 44

Limitation: Why are we far from high alignment?

  • Different modalities may contain different information.

40

An aurora is a breathtaking display of swirling, luminous colors that dance across the night sky.”

41 of 44

Increasing Mutual Information Improves Alignment

  • With denser captions, the mapping may become more bijective, leading to improved language-vision alignment scores.

41

42 of 44

Limitation: Convergence towards biases?

  • Biases from data collection procedures

42

User’s intent: Shock absorption badminton shoes

Top-K recommendations don’t capture user interests

Relevant items don’t get clicks

Data for next iteration curated from click logs

Loss of diversity, deviate from capturing user interests

43 of 44

Limitations

  • Focus is only image and text modalities.
  • Other ways of measuring alignment?

43

44 of 44

Questions?

44