1 of 76

Join at slido.com�#3715895

The Slido app must be installed on every computer you’re presenting from

3715895

2 of 76

Course Evaluation

The final course survey is intended to help us improve future offerings of the course.

�If you complete our internal anonymous survey AND the official Course Evaluation (http://course-evaluations.berkeley.edu) you will get 0.5% extra credit points to your grade. If 90% of the class also submits both, then everyone who submitted both will receive an additional 0.5% percentage point on their final grade.

�Internal course survey link: https://bit.ly/cs189-eval

3715895

3 of 76

Self-Supervised Learning

Lecture 24

Data understanding using self-supervised representation learning

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

4 of 76

Roadmap

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

5 of 76

Self-Supervised Learning

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

6 of 76

Supervised vs. Unsupervised Learning

  • Supervised learning: learning with labeled data

    • Approach: Collect a large dataset, label the data, train a model, and deploy.
    • Costly
  • Unsupervised learning: learning with unlabeled data
    • Approach: Discover patterns in data either via clustering similar instances, or density estimation, or dimensionality reduction …

Let's build methods that learn from "raw" data with no annotations required.

None of these represent how humans learn.

3715895

7 of 76

Self-Supervised Learning

  • Unsupervised Learning: Model isn’t told what to predict.

  • Self-Supervised Learning: Representation learning with unlabeled data
    • Learn useful feature representations from unlabeled data through pre-text tasks.
    • The term “self-supervised” refers to creating its own supervision (i.e., without supervision, without labels)

3715895

8 of 76

Self-Supervised Learning Example

  • Pre-text task: Train a model to predict the rotation degree of rotated images with cats and dogs (can collect million of images, labeling is not required).

  • Downstream task: Use transfer learning and fine-tune the learned model from the pre-text task for classification of cat vs. dog with very few labeled examples.

3715895

9 of 76

How?�Transfer Learning

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

10 of 76

Transfer Learning

  • A shortcut in training neural networks for recognition tasks.
  • The idea is to start with a fully-trained image recognition neural network, off-the-shelf with trained weights.
  • We can re-purpose the trained network for our particular recognition task.
  • What was learned by the neural network in its early layers are useful features in recognizing various things in images.
  • Most of the famous models (pre-trained or not pre-trained) exists on TorchVision: https://docs.pytorch.org/vision/main/models.html

3715895

11 of 76

Transfer Learning with CNN

Cat

Trained feature extractor

Linear classifier

3715895

12 of 76

Transfer Learning with CNN

  • What do we do for a new image classification problem?
  • Key idea:
    • Freeze parameters in feature extractor
    • Re-train classifier

Trained feature extractor

Linear classifier

3715895

13 of 76

Given a small labeled dataset for a new classification task, what is the most effective initial approach to leverage a pretrained CNN?

The Slido app must be installed on every computer you’re presenting from

3715895

14 of 76

Transfer Learning With CNN

3715895

15 of 76

Fine-tuning

Cat

3715895

16 of 76

Fine-tuning

Bakery

Initialize with pre-trained, then train with low learning rate

3715895

17 of 76

Why Pretext Learning Is Important

17

Goodfellow

3715895

18 of 76

Self-Supervised Learning Challenges

  • Challenges for self-supervised learning
    • How to select a suitable pre-text task for an application?
    • There is no gold standard for comparison of learned feature representations.
    • Selecting a suitable loss functions, since there is no single objective as the test set accuracy in supervised learning.

3715895

19 of 76

Self-Supervised Learning Approaches

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

20 of 76

Approaches

Generative: Predict part of the input signal

  • Autoencoders
  • GANs
    • Super-resolution
  • Colorization
  • Inpainting

Discriminative: Predict something about the input signal

  • Context Prediction
  • Rotation
  • Jigsaw
  • Clustering
  • Contrastive

Multimodal: Use some additional signal in addition to RGB images.

  • Video
  • 3D
  • Sound
  • Language

3715895

21 of 76

Generative: Autoencoders

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

22 of 76

Autoencoders

  • Autoencoders are designed to reproduce their input, especially for images.
    • Key point is to reproduce the input from a learned encoding.

Encoder

Decoder

Code/Feature

3715895

23 of 76

Autoencoder Idea

 

 

 

 

 

 

 

 

3715895

24 of 76

Convolutional Autoencoder

Information Bottleneck

3715895

25 of 76

Denoising Autoencoders

  • We try to undo the effect of corruption process stochastically applied to the input.
  • We still aim to encode the input and to NOT mimic the identity function.

Encoder

Decoder

Latent space representation

Denoised Input

Noisy Input

3715895

26 of 76

Denoising Autoencoders

 

 

 

 

 

 

3715895

27 of 76

What are valid training objectives for a denoising auto encoder?

The Slido app must be installed on every computer you’re presenting from

3715895

28 of 76

 

 

 

 

 

 

 

 

0

0

 

3715895

29 of 76

Denoising Autoencoders - Process

 

Apply Noise

 

3715895

30 of 76

Denoising Autoencoders - Process

 

Encode And Decode

DAE

 

DAE

3715895

31 of 76

Denoising Autoencoders - Process

 

DAE

 

DAE

3715895

32 of 76

Denoising Autoencoders - Process

 

Compare

 

3715895

33 of 76

Demo

3715895

34 of 76

Generative: Colorization

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

35 of 76

Image Colorization

  • Image colorization
  • Training data: Pairs of color and grayscale images.
  • Pretext task: Predict the colors of the objects in grayscale images.
    • The model needs to understand the objects in images and paint them with a suitable color.
    • Right image: Learn that the sky is blue, cloud is white, and mountain is green.

Sky is blue

Cloud is white

Mountain is green

3715895

36 of 76

Image Colorization

  •  

3715895

37 of 76

Image Colorization

3715895

38 of 76

Generative: Cross-Channel Prediction

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

39 of 76

Cross-Channel Prediction

  • Split-brain autoencoder or cross-channel prediction
  • Training data: Remove some of the image color channels.
  • Pretext task: Predict the missing channel from the other image channels.

3715895

40 of 76

Cross-Channel Prediction

  • The input image is split into grayscale and color channels.
    • Two encoder-decoders are trained: F1 predicts the color channels from the gray channel, and F2 predicts the gray channel from the color channels.
    • The two predicted images are combined to reconstruct the original image.
      • A loss function (e.g., cross-entropy) is used to minimize the reconstruction loss.

3715895

41 of 76

Cross-Channel Prediction

  •  

3715895

42 of 76

Generative: Context-Encoders (Inpaiting)

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

43 of 76

Context Encoders

  • Predict missing pieces, aka context encoders, or inpainting
  • Training data: Remove a random region in images.
  • Pretext task: Fill in a missing piece in the image.
    • The model needs to understand the content of the entire image, and produce a plausible replacement for the missing piece

3715895

44 of 76

Context Encoders

  •  

3715895

45 of 76

Context Encoders

  •  

3715895

46 of 76

Context Encoders

  •  

Input image

 

 

 

3715895

47 of 76

Context Encoders

  • Additional examples comparing the used loss functions
    • The joint loss produces the most realistic images

Input image

 

 

Joint loss

3715895

48 of 76

Context Encoders

  • The removed regions can have different shapes.
  • Random region and random block masks outperformed the central region features

Central region

Random block

Random region

3715895

49 of 76

Context Encoders

  • Evaluation on PASCAL VOC for several downstream tasks.
    • The learned features by the context encoder are not as good as supervised features but are comparable to other unsupervised methods, and perform better than randomly initialized models
      • E.g., over 10% improvement for segmentation over random initialization

3715895

50 of 76

Generative: Image Super Resolution

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

51 of 76

Image Super-Resolution

  • Image Super-Resolution
  • Training data: Pairs of regular and down-sampled low-resolution images.
  • Pretext task: Predict a high-resolution image that corresponds to a down-sampled low-resolution image.

3715895

52 of 76

Image Super-Resolution

  • SRGAN (Super-Resolution GAN) is a variant of GAN for producing super-resolution images
    • The generator takes a low-resolution image and outputs a high-resolution image using a fully-convolutional network.
    • The discriminator's loss function combines L2 and content loss to distinguish between real and generated (fake) super-resolution images.
      • Content loss (part of the perceptual loss) compares the features from the actual and predicted images.

3715895

53 of 76

Discriminative: Rotation

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

54 of 76

Image Rotation

  • Geometric transformation recognition
  • Training data: Images rotated by a multiple of 90° at random.
    • This corresponds to four rotated images at 0°, 90°, 180°, and 270°.
  • Pretext task: Train a model to predict the rotation degree that was applied
    • Therefore, it is a 4-class classification problem

3715895

55 of 76

Image Rotation

  • A single ConvNet model is used to predict one of the four rotations.
    • The model needs to understand the location and type of the objects in images to understand the rotation degree.

3715895

56 of 76

Image Rotation - Generalizability

56

3715895

57 of 76

ImageNet Top-1 Classification Results

57

With non-linear layers

3715895

58 of 76

Task Generalization

58

With linear layers

3715895

59 of 76

Relative Patch Position

  • Relative patch position for context prediction
  • Training data: Multiple patches extracted from images.
  • Pretext task: Train a model to predict the relationship between the patches.
    • E.g., predict the relative position of the selected patch below (i.e., position # 7)
      • For the center patch, there are 8 possible neighbor patches (8 possible classes)

3715895

60 of 76

Relative Patch Position

  • The patches are inputted into two ConvNets with shared weights
    • The learned features by the ConvNets are concatenated.
    • Classification is performed over 8 classes (8 possible neighbor positions)
  • The model needs to understand the spatial context of images to predict the relative positions between the patches.

3715895

61 of 76

Relative Patch Position

  •  

3715895

62 of 76

Relative Patch Position

  •  

3715895

63 of 76

Discriminative: Jigsaw

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

64 of 76

Image Jigsaw Puzzle

  •  

3715895

65 of 76

Image Jigsaw Puzzle

  • A ConvNet model passes the individual patches through the same Conv layers that have shared weights.
  • The features are combined and passed through fully-connected layers.
  • Output is the positions of the patches
    • The patches are shuffled according to a set of 64 predefined permutations
      • Namely, for 9 patches, in total there are 9! = 362,880 possible puzzles
      • The authors used a small set of 64 shuffling permutations with the highest hamming distance

3715895

66 of 76

Discriminative: Clustering

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

67 of 76

Deep Clustering

  • Deep clustering of images
  • Training data: Clusters of images based on the content
    • E.g., clusters on mountains, temples, etc.
  • Pretext task: Predict the cluster to which an image belongs

3715895

68 of 76

Deep Clustering

  • The architecture for SSL is called deep clustering.
    • The model treats each cluster as a separate class.
    • The output is the number of the cluster (i.e., cluster label) for an input image.
    • The authors used Kmeans for clustering.
  • The model needs to learn the content in the images to assign them to the corresponding cluster

3715895

69 of 76

Contrastive Representation Learning

  • Self-Supervised Learning
  • Transfer Learning
  • Self-Supervised Learning Approaches
  • Generative
    • Autoencoders
    • Colorization
    • Cross-Channel Prediction
    • Context-Encoders (Inpainting)
    • Image Super-Resolution
  • Discriminative
    • Rotation
    • Jigsaw
    • Clustering
    • Contrastive Learning

3715895

70 of 76

Contrastive Representation Learning

Same object

Different object

 

 

 

 

 

 

 

 

3715895

71 of 76

Formulation for Contrastive Learning

  •  

 

 

3715895

72 of 76

Contrastive Loss

  • Given 1 positive sample and N-1 negative samples:

 

This is similar to cross-entropy loss for a N-way softmax classifier.

3715895

73 of 76

SimCLR: A Simple Framework for Contrastive Learning

  • Cosine similarity as the score function
  • Use a projection network g(·) to project features to a space where contrastive learning is applied.
  • Generate positive samples through data augmentation:
    • Random cropping, random color distortion, and random blur.

3715895

74 of 76

Momentum Contrastive Learning (MoCo)

Decouples negative sample size from minibatch size; allows large batch training without TPUs.

3715895

75 of 76

Contrastive Language Image Pre-Training (CLIP)

3715895

76 of 76

Self-Supervised Learning

Lecture 24

Reading: Partially covered in Chapter 19 of Bishop Deep Learning Textbook