The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola�
(ICML 2024)
Presenters: Ankur Sikarwar, Anirudh Buvanesh
1
Hypothesis
Different neural networks are converging toward the same way of representing the world.
2
Outline
3
Rosetta Neurons
4
Dravid, Amil, et al. "Rosetta neurons: Mining the common units in a model zoo." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
Evidence from Model Switching
5
Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.
Evidence from Model Switching
6
Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.
Trained with method/data A
Trained with method/data B
Evidence from Model Switching
7
Bansal, Yamini, Preetum Nakkiran, and Boaz Barak. "Revisiting model stitching to compare neural representations." Advances in neural information processing systems 34 (2021): 225-236.
Trained with method/data A
Trained with method/data B
Outline
8
Characterizing Representations using Kernels
9
Characterizing Representations using Kernels
10
Focus on vector embeddings:
Characterizing Representations using Kernels
11
Characterizing a representation in terms of its kernel:
Focus on vector embeddings:
Characterizing Representations using Kernels
12
Characterizing a representation in terms of its kernel:
Focus on vector embeddings:
similar
dissimilar
Kvision
Slide credit: Phillip Isola
Characterizing Representations using Kernels
13
sim( , )
DINO
CLIP
Slide credit: Phillip Isola
Nearest-neighbour kernel-alignment metric
14
What percentage of my nearest-neighbors under representation f are also my nearest-neighbors under representation g ?
f
g
Slide credit: Phillip Isola
Experiment: Is alignment between vision models increasing as vision systems become stronger?
15
Experiment: Is alignment between vision models increasing as vision systems become stronger?
16
Hypothesis 1:�
There are many different ways one can represent the visual world, and each can be highly effective.�
Experiment: Is alignment between vision models increasing as vision systems become stronger?
17
Hypothesis 1:�
There are many different ways one can represent the visual world, and each can be highly effective.�
�Hypothesis 2:
All strong visual representations are alike.
Experiment: Is alignment between vision models increasing as vision systems become stronger?
18
Experiment: Is alignment between vision models increasing as vision systems become stronger?
19
Experiment: Is alignment between vision models increasing as vision systems become stronger?
20
All strong representations are alike, each weak representation is weak in its own way.
Experiment: Is language-vision alignment increasing?
21
Experiment: Is language-vision alignment increasing?
22
Hypothesis 1:�
As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision
Experiment: Is language-vision alignment increasing?
23
Hypothesis 1:�
As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision.��Hypothesis 2:
Better language models are better vision models.
Experiment: Is language-vision alignment increasing?
24
Hypothesis 1:�
As language models get better and better, they will become more and more specific to language, and start being less generally useful for vision.��Hypothesis 2:
Better language models are better vision models.
�Hypothesis 2+:
The best language model is the best vision model. They converge to the same representation
Experiment: Is language-vision alignment increasing?
25
sim( , )
Kvision
Ktext
“apple”
“orange”
“elephant”
“giraffe”
“apple”
“orange”
“elephant”
“giraffe”
Experiment: Is language-vision alignment increasing?
26
Experiment: Is language-vision alignment increasing?
27
Strong models converge in representation.
Experiment: Is language-vision alignment increasing?
28
Towards convergence
Experiment: Is language-vision alignment increasing?
29
Experiment: Is language-vision alignment increasing?
30
Alignment to downstream performance
Outline
31
What is Driving Convergence?
32
Multi-Task Scaling Hypothesis
33
Hypothesis Space
Solves �task 1
Solves �task 2
Solves both tasks
Capacity Hypothesis
34
Scale
Architectures
Simplicity Bias
35
Hypothesis Space
Functions that solve the task
Simple Functions
What are we converging to?
36
A red sphere next to a blue cone
Question: What do different modalities have in common?
Hypothesis: Convergence towards co-occurrences of events in the world
Evidence of Convergence Towards Co-Occurrences
37
Outline
38
Implication: Sharing Data Between Modalities
39
[1]: Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021.
[2]: Black, Kevin, et al. "π0: A vision-language-action flow model for general robot control, 2024." arXiv, 2024.
Limitation: Why are we far from high alignment?
40
“An aurora is a breathtaking display of swirling, luminous colors that dance across the night sky.”
Increasing Mutual Information Improves Alignment
41
Limitation: Convergence towards biases?
42
User’s intent: Shock absorption badminton shoes
Top-K recommendations don’t capture user interests
Relevant items don’t get clicks
Data for next iteration curated from click logs
Loss of diversity, deviate from capturing user interests
Limitations
43
Questions?
44