4 of 53

Today (30)

Representation learning

5 of 53

Representation learning

Now we have learned multiple neural networks

They all can output a “vector” to present an image

6 of 53

The usefulness of representation

Retrieval, image-to-image search

7 of 53

The representation learning setup

How can we teach the neural network encoder to capture image relationships?

Neural net

Encoder

Image

8 of 53

The representation learning setup

What properties should z possess?
What is the objective function to learn f to achieve it?

9 of 53

The representation learning setup

z has lower dimensionality
P(z) ha a simple structure (simple distribution)
Dimensions are disentangled

10 of 53

What makes a good representation?

Compression:

Less memory
Invariant to nuisance
Occam’s razor: Less is more

Prediction:

Predictive of the future/past
Predictive of any question you can ask about the image

Neural net

Encoder

11 of 53

An overview

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

12 of 53

Autoencoder

Objective function:

Design principle: z with lower dimensionality

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

13 of 53

Autoencoder

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

14 of 53

Example

Training data

Nearest neighbor search

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

15 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

16 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

17 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

18 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

19 of 53

Predictive encoding

Encoder

[Noroozi et al., 2016]

20 of 53

Predictive encoding

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Object detectors emerge from scene recognition

21 of 53

Self-supervised learning

Come up with a “pretext” target from raw data itself

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

22 of 53

Self-supervised learning: context encoder

[Pathak et al., CVPR 2016]

23 of 53

Self-supervised learning: masked autoencoder (MAE)

[He et al., CVPR 2022]

24 of 53

Imputation

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

25 of 53

Imputation

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

26 of 53

Clustering

So far, we consider “continuous output” vector

We can also consider learning a “one-hot” “discrete” representation

Lightweight, abstract representation
More aligned with language

27 of 53

Clustering

Classification is “supervised” clustering

Without labels, we need a certain way to group data

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

28 of 53

K-means

Iterative clustering:

Initialize “cluster center”
Assign each data to its closest center
Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

29 of 53

K-means

Iterative clustering:

Initialize “cluster center”
Assign each data to its closest center
Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

30 of 53

K-means

Iterative clustering:

Initialize “cluster center”
Assign each data to its closest center
Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

31 of 53

K-means

Iterative clustering:

Initialize “cluster center”
Assign each data to its closest center
Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

32 of 53

K-means

Algorithm

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

33 of 53

K-means + deep learning

Deep Clustering for Unsupervised Learning of Visual Features [Caron et al., ECCV 2018]

34 of 53

K-means + deep learning

Neural Discrete Representation Learning [Aaron van den Oord et al., 2018]

Algorithm known as VQ-VAE

35 of 53

Question: what do different algorithms learn?

36 of 53

Contrastive learning

Making you the representation “invariant” to some transformation (views)

f is the neural network encoder
T is a certain transformation (such as image translation)

37 of 53

Data transformation

[Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020]

38 of 53

Contrastive learning

Making you the representation “invariant” to some transformation (views)

f is the neural network encoder
T is a certain transformation (such as image translation)

Question: there is a trivial solution, which is f always outputs 0

39 of 53

Contrastive learning: transformation

Solution:

Positive pairs:

Negative pairs:

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

40 of 53

Contrastive learning: co-occurence

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

41 of 53

Contrastive learning

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

42 of 53

Contrastive learning: exemplar losses

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

43 of 53

[Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR]

44 of 53

Potential final project

Design your transformation for invariance

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Section 30.10.2

Figure 30.13

45 of 53

Research example

46 of 53

Rethinking pre-training for object detector

Goal: Bridge the input point cloud and output bounding boxes

Our idea: Help the backbone cluster points into object instances or parts

Feature

backbone

Detection

head

47 of 53

Our proposal: leverage color information

Motivation: each object instance or part often possesses a coherent color and has a sharp contrast to the background

Hypothesis: Learning to colorize LiDAR point clouds would equip the backbone with the semantic cues to segment points

Region growing [Preetha et al., 2012]

Superpixel [Liu et al., 2011]

48 of 53

Our solution: “grounded” point colorization (GPC)

Provide colors of a subset of points as hints

49 of 53

Our solution: “grounded” point colorization (GPC)

Provide colors of a subset of points as hints

50 of 53

GPC Pre-training + fine-tuning

color

decoder

backbone

feature

concatenate

hints

……

N points

……

Output embeddings indicate which points should be colored/segmented together

Pre-training = Point-wise color regression problem

51 of 53

GPC Pre-training + fine-tuning

backbone

feature

……

N points

……

color

decoder

concatenate

hints

Pre-training = Point-wise color regression problem

detection

head

Output embeddings indicate which points should be colored/segmented together

52 of 53

GPC Pre-training + fine-tuning

backbone

feature

color

decoder

detection

head

pre-training

fine-tuning

concat

hints

Pre-Training LiDAR-Based 3D Object Detectors Through Colorization, ICLR 2024

53 of 53

What does GPC really learn?

Predict the exact color of each point

Predict which points possess the same color and are segmented together