1 of 76

Join at slido.com�#3715895

The Slido app must be installed on every computer you’re presenting from

3715895

2 of 76

Course Evaluation

The final course survey is intended to help us improve future offerings of the course.

�If you complete our internal anonymous survey AND the official Course Evaluation (http://course-evaluations.berkeley.edu) you will get 0.5% extra credit points to your grade. If 90% of the class also submits both, then everyone who submitted both will receive an additional 0.5% percentage point on their final grade.

�Internal course survey link: https://bit.ly/cs189-eval

3715895

3 of 76

Self-Supervised Learning

Lecture 24

Data understanding using self-supervised representation learning

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

4 of 76

Roadmap

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

5 of 76

Self-Supervised Learning

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

6 of 76

Supervised vs. Unsupervised Learning

Supervised learning: learning with labeled data

Approach: Collect a large dataset, label the data, train a model, and deploy.
Costly

Unsupervised learning: learning with unlabeled data

Approach: Discover patterns in data either via clustering similar instances, or density estimation, or dimensionality reduction …

Let's build methods that learn from "raw" data with no annotations required.

None of these represent how humans learn.

3715895

7 of 76

Self-Supervised Learning

Unsupervised Learning: Model isn’t told what to predict.

Self-Supervised Learning: Representation learning with unlabeled data

Learn useful feature representations from unlabeled data through pre-text tasks.
The term “self-supervised” refers to creating its own supervision (i.e., without supervision, without labels)

3715895

8 of 76

Self-Supervised Learning Example

Pre-text task: Train a model to predict the rotation degree of rotated images with cats and dogs (can collect million of images, labeling is not required).

Downstream task: Use transfer learning and fine-tune the learned model from the pre-text task for classification of cat vs. dog with very few labeled examples.

3715895

9 of 76

How?�Transfer Learning

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

10 of 76

Transfer Learning

A shortcut in training neural networks for recognition tasks.
The idea is to start with a fully-trained image recognition neural network, off-the-shelf with trained weights.
We can re-purpose the trained network for our particular recognition task.
What was learned by the neural network in its early layers are useful features in recognizing various things in images.
Most of the famous models (pre-trained or not pre-trained) exists on TorchVision: https://docs.pytorch.org/vision/main/models.html

3715895

11 of 76

Transfer Learning with CNN

Cat

Trained feature extractor

Linear classifier

3715895

12 of 76

Transfer Learning with CNN

What do we do for a new image classification problem?
Key idea:

Freeze parameters in feature extractor
Re-train classifier

Trained feature extractor

Linear classifier

3715895

13 of 76

Given a small labeled dataset for a new classification task, what is the most effective initial approach to leverage a pretrained CNN?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

3715895

14 of 76

Transfer Learning With CNN

3715895

15 of 76

Fine-tuning

Cat

3715895

16 of 76

Fine-tuning

Bakery

Initialize with pre-trained, then train with low learning rate

3715895

17 of 76

Why Pretext Learning Is Important

Goodfellow

3715895

18 of 76

Self-Supervised Learning Challenges

Challenges for self-supervised learning

How to select a suitable pre-text task for an application?
There is no gold standard for comparison of learned feature representations.
Selecting a suitable loss functions, since there is no single objective as the test set accuracy in supervised learning.

3715895

19 of 76

Self-Supervised Learning Approaches

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

20 of 76

Approaches

Generative: Predict part of the input signal

Autoencoders
GANs

Super-resolution

Colorization
Inpainting

Discriminative: Predict something about the input signal

Context Prediction
Rotation
Jigsaw
Clustering
Contrastive

Multimodal: Use some additional signal in addition to RGB images.

Video
3D
Sound
Language

3715895

21 of 76

Generative: Autoencoders

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

22 of 76

Autoencoders

Autoencoders are designed to reproduce their input, especially for images.

Key point is to reproduce the input from a learned encoding.

Encoder

Decoder

Code/Feature

3715895

23 of 76

Autoencoder Idea

3715895

24 of 76

Convolutional Autoencoder

Information Bottleneck

3715895

25 of 76

Denoising Autoencoders

We try to undo the effect of corruption process stochastically applied to the input.
We still aim to encode the input and to NOT mimic the identity function.

Encoder

Decoder

Latent space representation

Denoised Input

Noisy Input

3715895

26 of 76

Denoising Autoencoders

3715895

27 of 76

What are valid training objectives for a denoising auto encoder?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

3715895

28 of 76

3715895

29 of 76

Denoising Autoencoders - Process

Apply Noise

3715895

30 of 76

Denoising Autoencoders - Process

Encode And Decode

DAE

3715895

31 of 76

Denoising Autoencoders - Process

DAE

3715895

32 of 76

Denoising Autoencoders - Process

Compare

3715895

33 of 76

Demo

3715895

34 of 76

Generative: Colorization

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

35 of 76

Image Colorization

Image colorization

Efros' Group: Colorful Image Colorization

Training data: Pairs of color and grayscale images.
Pretext task: Predict the colors of the objects in grayscale images.

The model needs to understand the objects in images and paint them with a suitable color.
Right image: Learn that the sky is blue, cloud is white, and mountain is green.

Sky is blue

Cloud is white

Mountain is green

3715895

36 of 76

Image Colorization

3715895

37 of 76

Image Colorization

3715895

38 of 76

Generative: Cross-Channel Prediction

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

39 of 76

Cross-Channel Prediction

Split-brain autoencoder or cross-channel prediction

Efros' Group: Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction

Training data: Remove some of the image color channels.
Pretext task: Predict the missing channel from the other image channels.

3715895

40 of 76

Cross-Channel Prediction

The input image is split into grayscale and color channels.

Two encoder-decoders are trained: F₁ predicts the color channels from the gray channel, and F₂ predicts the gray channel from the color channels.
The two predicted images are combined to reconstruct the original image.

A loss function (e.g., cross-entropy) is used to minimize the reconstruction loss.

3715895

41 of 76

Cross-Channel Prediction

3715895

42 of 76

Generative: Context-Encoders (Inpaiting)

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

43 of 76

Context Encoders

Predict missing pieces, aka context encoders, or inpainting

Efros' Group: Context Encoders: Feature Learning by Inpainting

Training data: Remove a random region in images.
Pretext task: Fill in a missing piece in the image.

The model needs to understand the content of the entire image, and produce a plausible replacement for the missing piece

3715895

44 of 76

Context Encoders

3715895

45 of 76

Context Encoders

3715895

46 of 76

Context Encoders

Input image

3715895

47 of 76

Context Encoders

Additional examples comparing the used loss functions

The joint loss produces the most realistic images

Input image

Joint loss

3715895

48 of 76

Context Encoders

The removed regions can have different shapes.
Random region and random block masks outperformed the central region features

Central region

Random block

Random region

3715895

49 of 76

Context Encoders

Evaluation on PASCAL VOC for several downstream tasks.

The learned features by the context encoder are not as good as supervised features but are comparable to other unsupervised methods, and perform better than randomly initialized models

E.g., over 10% improvement for segmentation over random initialization

3715895

50 of 76

Generative: Image Super Resolution

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

51 of 76

Image Super-Resolution

Image Super-Resolution

Ledig (2017) Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Training data: Pairs of regular and down-sampled low-resolution images.
Pretext task: Predict a high-resolution image that corresponds to a down-sampled low-resolution image.

3715895

52 of 76

Image Super-Resolution

SRGAN (Super-Resolution GAN) is a variant of GAN for producing super-resolution images

The generator takes a low-resolution image and outputs a high-resolution image using a fully-convolutional network.
The discriminator's loss function combines L₂ and content loss to distinguish between real and generated (fake) super-resolution images.

Content loss (part of the perceptual loss) compares the features from the actual and predicted images.

3715895

53 of 76

Discriminative: Rotation

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

54 of 76

Image Rotation

Geometric transformation recognition

Gidaris (2018) - Unsupervised Representation Learning by Predicting Image Rotations

Training data: Images rotated by a multiple of 90° at random.

This corresponds to four rotated images at 0°, 90°, 180°, and 270°.

Pretext task: Train a model to predict the rotation degree that was applied

Therefore, it is a 4-class classification problem

3715895

55 of 76

Image Rotation

A single ConvNet model is used to predict one of the four rotations.

The model needs to understand the location and type of the objects in images to understand the rotation degree.

3715895

56 of 76

Image Rotation - Generalizability

3715895

57 of 76

ImageNet Top-1 Classification Results

With non-linear layers

3715895

58 of 76

Task Generalization

With linear layers

3715895

59 of 76

Relative Patch Position

Relative patch position for context prediction

Efros' Group: Unsupervised Visual Representation Learning by Context Prediction

Training data: Multiple patches extracted from images.
Pretext task: Train a model to predict the relationship between the patches.

E.g., predict the relative position of the selected patch below (i.e., position # 7)

For the center patch, there are 8 possible neighbor patches (8 possible classes)

3715895

60 of 76

Relative Patch Position

The patches are inputted into two ConvNets with shared weights

The learned features by the ConvNets are concatenated.
Classification is performed over 8 classes (8 possible neighbor positions)

The model needs to understand the spatial context of images to predict the relative positions between the patches.

3715895

61 of 76

Relative Patch Position

3715895

62 of 76

Relative Patch Position

3715895

63 of 76

Discriminative: Jigsaw

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

64 of 76

Image Jigsaw Puzzle

3715895

65 of 76

Image Jigsaw Puzzle

A ConvNet model passes the individual patches through the same Conv layers that have shared weights.
The features are combined and passed through fully-connected layers.
Output is the positions of the patches

The patches are shuffled according to a set of 64 predefined permutations

Namely, for 9 patches, in total there are 9! = 362,880 possible puzzles
The authors used a small set of 64 shuffling permutations with the highest hamming distance

3715895

66 of 76

Discriminative: Clustering

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

67 of 76

Deep Clustering

Deep clustering of images

Caron (2019) Deep Clustering for Unsupervised Learning of Visual Features

Training data: Clusters of images based on the content

E.g., clusters on mountains, temples, etc.

Pretext task: Predict the cluster to which an image belongs

3715895

68 of 76

Deep Clustering

The architecture for SSL is called deep clustering.

The model treats each cluster as a separate class.
The output is the number of the cluster (i.e., cluster label) for an input image.
The authors used Kmeans for clustering.

The model needs to learn the content in the images to assign them to the corresponding cluster

3715895

69 of 76

Contrastive Representation Learning

Self-Supervised Learning
Transfer Learning
Self-Supervised Learning Approaches
Generative

Autoencoders
Colorization
Cross-Channel Prediction
Context-Encoders (Inpainting)
Image Super-Resolution

Discriminative

Rotation
Jigsaw
Clustering
Contrastive Learning

3715895

70 of 76

Contrastive Representation Learning

Same object

Different object

3715895

71 of 76

Formulation for Contrastive Learning

3715895

72 of 76

Contrastive Loss

Given 1 positive sample and N-1 negative samples:

This is similar to cross-entropy loss for a N-way softmax classifier.

3715895

73 of 76

SimCLR: A Simple Framework for Contrastive Learning

Cosine similarity as the score function
Use a projection network g(·) to project features to a space where contrastive learning is applied.
Generate positive samples through data augmentation:

Random cropping, random color distortion, and random blur.

3715895

74 of 76

Momentum Contrastive Learning (MoCo)

Decouples negative sample size from minibatch size; allows large batch training without TPUs.

https://arxiv.org/abs/1911.05722

3715895

75 of 76

Contrastive Language Image Pre-Training (CLIP)

3715895

76 of 76

Self-Supervised Learning

Lecture 24

Reading: Partially covered in Chapter 19 of Bishop Deep Learning Textbook