1 of 44

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola�

(ICML 2024)

Presenters: Ankur Sikarwar, Anirudh Buvanesh

2 of 44

Hypothesis

Different neural networks are converging toward the same way of representing the world.

3 of 44

Outline

Background
Convergence Experiments
What is driving convergence?
Implications and limitations

4 of 44

Rosetta Neurons

Dravid, Amil, et al. "Rosetta neurons: Mining the common units in a model zoo." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

5 of 44

Evidence from Model Switching

Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.

6 of 44

Evidence from Model Switching

Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.

Trained with method/data A

Trained with method/data B

7 of 44

Evidence from Model Switching

Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.

Trained with method/data A

Trained with method/data B

8 of 44

Outline

Background
Convergence Experiments
What is driving convergence?
Implications and limitations

9 of 44

Characterizing Representations using Kernels

10 of 44

Characterizing Representations using Kernels

Focus on vector embeddings:

11 of 44

Characterizing Representations using Kernels

Characterizing a representation in terms of its kernel:

Focus on vector embeddings:

12 of 44

Characterizing Representations using Kernels

Characterizing a representation in terms of its kernel:

Focus on vector embeddings:

similar

dissimilar

K_vision

Slide credit: Phillip Isola

13 of 44

Characterizing Representations using Kernels

sim( , )

DINO

CLIP

Slide credit: Phillip Isola

14 of 44

Nearest-neighbour kernel-alignment metric

What percentage of my nearest-neighbors under representation f are also my nearest-neighbors under representation g ?

Slide credit: Phillip Isola

15 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

16 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

Hypothesis 1:�

There are many different ways one can represent the visual world, and each can be highly effective.�

17 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

Hypothesis 1:�

There are many different ways one can represent the visual world, and each can be highly effective.�

�Hypothesis 2:

All strong visual representations are alike.

18 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

78 vision models: different architecture, objectives, training data distributions.

19 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

78 vision models: different architecture, objectives, training data distributions.

Group models by performance on VTAB, and measure representational similarity within each group

20 of 44

Experiment: Is alignment between vision models increasing as vision systems become stronger?

All strong representations are alike, each weak representation is weak in its own way.

21 of 44

Experiment: Is language-vision alignment increasing?

22 of 44

Experiment: Is language-vision alignment increasing?

Hypothesis 1:�

As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision

23 of 44

Experiment: Is language-vision alignment increasing?

Hypothesis 1:�

As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision.��Hypothesis 2:

Better language models are better vision models.

24 of 44

Experiment: Is language-vision alignment increasing?

Hypothesis 1:�

As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision.��Hypothesis 2:

Better language models are better vision models.

�Hypothesis 2+:

The best language model is the best vision model. They converge to the same representation

25 of 44

Experiment: Is language-vision alignment increasing?

sim( , )

K_vision

K_text

“apple”

“orange”

“elephant”

“giraffe”

“apple”

“orange”

“elephant”

“giraffe”

26 of 44

Experiment: Is language-vision alignment increasing?

27 of 44

Experiment: Is language-vision alignment increasing?

Strong models converge in representation.

28 of 44

Experiment: Is language-vision alignment increasing?

Towards convergence

29 of 44

Experiment: Is language-vision alignment increasing?

30 of 44

Experiment: Is language-vision alignment increasing?

Alignment to downstream performance

31 of 44

Outline

Background
Convergence Experiments
What is driving convergence?
Implications and limitations

32 of 44

What is Driving Convergence?

Multi-Task scaling hypothesis
Capacity hypothesis
Simplicity Bias

33 of 44

Multi-Task Scaling Hypothesis

Pre-training objectives implicitly do multi-task learning.

Hypothesis Space

Solves �task 1

Solves �task 2

Solves both tasks

34 of 44

Capacity Hypothesis

High capacity networks can represent a bigger hypothesis space.

Scale

Architectures

35 of 44

Simplicity Bias

Deep networks are biased toward finding simple fits to the data.

Hypothesis Space

Functions that solve the task

Simple Functions

Regularization

Weight decay
Dropout

36 of 44

What are we converging to?

World of objects of different colors

Different modalities are (lossy) projections of the world

A red sphere next to a blue cone

Question: What do different modalities have in common?

Hypothesis: Convergence towards co-occurrences of events in the world

37 of 44

Evidence of Convergence Towards Co-Occurrences

Co-occurring colours are represented closer in the embedding space

38 of 44

Outline

Background
Convergence Experiments
What is driving convergence?
Implications and limitations

39 of 44

Implication: Sharing Data Between Modalities

Visual data can improve language models, and language data enhances vision models [1].

Transfer between vision language models to embodied AI [2].

[1]: Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021.

[2]: Black, Kevin, et al. "π0: A vision-language-action flow model for general robot control, 2024." arXiv, 2024.

40 of 44

Limitation: Why are we far from high alignment?

Different modalities may contain different information.

“An aurora is a breathtaking display of swirling, luminous colors that dance across the night sky.”

41 of 44

Increasing Mutual Information Improves Alignment

With denser captions, the mapping may become more bijective, leading to improved language-vision alignment scores.

42 of 44

Limitation: Convergence towards biases?

Biases from data collection procedures

User’s intent: Shock absorption badminton shoes

Top-K recommendations don’t capture user interests

Relevant items don’t get clicks

Data for next iteration curated from click logs

Loss of diversity, deviate from capturing user interests

43 of 44

Limitations

Focus is only image and text modalities.

Other ways of measuring alignment?

1 of 44

2 of 44

3 of 44

4 of 44

5 of 44

6 of 44

7 of 44

8 of 44

9 of 44

10 of 44

11 of 44

12 of 44

13 of 44

14 of 44

15 of 44

16 of 44

17 of 44

18 of 44

19 of 44

20 of 44

21 of 44

22 of 44

23 of 44

24 of 44

25 of 44

26 of 44

27 of 44

28 of 44

29 of 44

30 of 44

31 of 44

32 of 44

33 of 44

34 of 44

35 of 44

36 of 44

37 of 44

38 of 44

39 of 44

40 of 44

41 of 44

42 of 44

43 of 44

44 of 44