1 of 53

CSE 5524: �Representation learning

1

2 of 53

HW 3

  • Due: 4/8/2025
  • Focus on the region proposal network (RPN).
  • A bit more implementation, so start earlier.

3 of 53

Final project (30%)

  • Project proposal: 4/1 (3%)
  • Instructions will be released soon

4 of 53

Today (30)

  • Representation learning

4

5 of 53

Representation learning

  • Now we have learned multiple neural networks

  • They all can output a “vector” to present an image

6 of 53

The usefulness of representation

6

Retrieval, image-to-image search

7 of 53

The representation learning setup

  • How can we teach the neural network encoder to capture image relationships?

Neural net

Encoder

Image

8 of 53

The representation learning setup

  • What properties should z possess?
  • What is the objective function to learn f to achieve it?

9 of 53

The representation learning setup

  • z has lower dimensionality
  • P(z) ha a simple structure (simple distribution)
  • Dimensions are disentangled

10 of 53

What makes a good representation?

  • Compression:
    • Less memory
    • Invariant to nuisance
    • Occam’s razor: Less is more

  • Prediction:
    • Predictive of the future/past
    • Predictive of any question you can ask about the image

Neural net

Encoder

11 of 53

An overview

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

12 of 53

Autoencoder

Objective function:

Design principle: z with lower dimensionality

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

13 of 53

Autoencoder

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

14 of 53

Example

Training data

Nearest neighbor search

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

15 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

16 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

17 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

18 of 53

Predictive encoding

Pretext tasks:

A good representation should solve these tasks well

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

19 of 53

Predictive encoding

Encoder

[Noroozi et al., 2016]

20 of 53

Predictive encoding

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Object detectors emerge from scene recognition

21 of 53

Self-supervised learning

  • Come up with a “pretext” target from raw data itself

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

22 of 53

Self-supervised learning: context encoder

[Pathak et al., CVPR 2016]

23 of 53

Self-supervised learning: masked autoencoder (MAE)

[He et al., CVPR 2022]

24 of 53

Imputation

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

25 of 53

Imputation

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

26 of 53

Clustering

  • So far, we consider “continuous output” vector

  • We can also consider learning a “one-hot” “discrete” representation

    • Lightweight, abstract representation
    • More aligned with language

27 of 53

Clustering

  • Classification is “supervised” clustering

  • Without labels, we need a certain way to group data

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

28 of 53

K-means

  • Iterative clustering:
    • Initialize “cluster center”
    • Assign each data to its closest center
    • Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

29 of 53

K-means

  • Iterative clustering:
    • Initialize “cluster center”
    • Assign each data to its closest center
    • Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

30 of 53

K-means

  • Iterative clustering:
    • Initialize “cluster center”
    • Assign each data to its closest center
    • Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

31 of 53

K-means

  • Iterative clustering:
    • Initialize “cluster center”
    • Assign each data to its closest center
    • Re-calculate “cluster means”

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

32 of 53

K-means

  • Algorithm

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

33 of 53

K-means + deep learning

  • Deep Clustering for Unsupervised Learning of Visual Features [Caron et al., ECCV 2018]

34 of 53

K-means + deep learning

  • Neural Discrete Representation Learning [Aaron van den Oord et al., 2018]
    • Algorithm known as VQ-VAE

35 of 53

Question: what do different algorithms learn?

36 of 53

Contrastive learning

  • Making you the representation “invariant” to some transformation (views)

    • f is the neural network encoder
    • T is a certain transformation (such as image translation)

37 of 53

Data transformation

37

[Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020]

38 of 53

Contrastive learning

  • Making you the representation “invariant” to some transformation (views)

    • f is the neural network encoder
    • T is a certain transformation (such as image translation)

  • Question: there is a trivial solution, which is f always outputs 0

39 of 53

Contrastive learning: transformation

  • Solution:
    • Positive pairs:

    • Negative pairs:

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

40 of 53

Contrastive learning: co-occurence

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

41 of 53

Contrastive learning

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

42 of 53

Contrastive learning: exemplar losses

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

43 of 53

43

[Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR]

44 of 53

Potential final project

  • Design your transformation for invariance

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Section 30.10.2

Figure 30.13

45 of 53

Research example

46 of 53

Rethinking pre-training for object detector

Goal: Bridge the input point cloud and output bounding boxes

Our idea: Help the backbone cluster points into object instances or parts

46

Feature

backbone

Detection

head

47 of 53

Our proposal: leverage color information

Motivation: each object instance or part often possesses a coherent color and has a sharp contrast to the background

Hypothesis: Learning to colorize LiDAR point clouds would equip the backbone with the semantic cues to segment points

47

Region growing [Preetha et al., 2012]

Superpixel [Liu et al., 2011]

48 of 53

Our solution: “grounded” point colorization (GPC)

Provide colors of a subset of points as hints

48

49 of 53

Our solution: “grounded” point colorization (GPC)

Provide colors of a subset of points as hints

49

50 of 53

GPC Pre-training + fine-tuning

50

color

decoder

backbone

feature

concatenate

hints

……

N points

……

Output embeddings indicate which points should be colored/segmented together

Pre-training = Point-wise color regression problem

51 of 53

GPC Pre-training + fine-tuning

51

backbone

feature

……

N points

……

color

decoder

concatenate

hints

Pre-training = Point-wise color regression problem

detection

head

Output embeddings indicate which points should be colored/segmented together

52 of 53

GPC Pre-training + fine-tuning

52

backbone

feature

color

decoder

detection

head

pre-training

fine-tuning

concat

hints

Pre-Training LiDAR-Based 3D Object Detectors Through Colorization, ICLR 2024

53 of 53

What does GPC really learn?

Predict the exact color of each point

Predict which points possess the same color and are segmented together

53