1 of 318

CS294-158 Deep Unsupervised Learning

Lecture 7 Self-Supervised Learning

Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu

2 of 318

Reminder: Representations Matter

Goodfellow

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

3 of 318

Depth often refines representations

Goodfellow

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

4 of 318

Today

Goal: representation learning

i.e. pre-train a NN so it can be finetuned to good performance on a downstream task with limited downstream data

How about generative models we covered so far?

e.g. AR, Flow, VAE, GAN, Diffusion
Yes, they can also achieve this

Today: alternative approaches to representation learning, which do not involve a generative model

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

5 of 318

What is Self-Supervised Learning?

A version of unsupervised learning where data provides the supervision

In general, withhold some part of the data and the task a neural network to predict it from the remaining parts

Details decide what proxy loss or pretext task the network tries to solve, and depending on the quality of the task, good semantic features can be obtained without actual labels

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

6 of 318

Motivation: LeCake

Yann LeCun’s cake

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

7 of 318

Motivation

Yann LeCun’s cake

Slide: LeCun

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

8 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

9 of 318

Denoising Autoencoder

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

10 of 318

Denoising Autoencoder

Vincent et al 2010

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

11 of 318

Denoising Autoencoder

Vincent et al 2010

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

12 of 318

Denoising Autoencoder

Vincent et al 2010

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

13 of 318

Emphasizing corrupted dimensions

Vincent et al 2010

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

14 of 318

Stacked Denoising Autoencoder

Vincent et al 2010

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

15 of 318

Denoising Autoencoder

Vincent et al 2010

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

16 of 318

Diffusion Models

EmerDiff: pixel level segmentation masks from diffusion models – query vectors as features (see Lecture 6)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

17 of 318

Predict missing pieces

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

18 of 318

Context Encoders

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

19 of 318

Context Encoders

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

20 of 318

Context Encoders

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

21 of 318

Context Encoders

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

22 of 318

Context Encoders

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

23 of 318

Context Encoders

Pathak et al 2016

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

24 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

25 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

26 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

27 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

28 of 318

Predicting one view from another

Ground Truth

L2 regression

Pixelwise classification

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

29 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

30 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

31 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

32 of 318

Predicting one view from another

Slide: Richard Zhang

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

33 of 318

Temporal coherence of color

Slide: Zisserman

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

34 of 318

Tracking emerges from colorization

GIFs from Google AI Blog post

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

35 of 318

MAE

Nov, 2021

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

36 of 318

MAE

Architecture: Vision Transformer (ViT)

BIG

small

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

37 of 318

MAE on ImageNet validation images

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

38 of 318

MAE on CoCo validation images

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

39 of 318

Masking Ratio

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

40 of 318

Comparison with Prior SOTA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

41 of 318

MAE Cousins / Derivatives

BeIT
VideoMAE
SiamMAE
Audio-MAE
M3AE
MultiMAE
Multi-View / Masked World Models for Visual Control (covered in 2nd half of lecture)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

42 of 318

BEIT

[June 2021 / Sep 2022]

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

43 of 318

BEIT Architecture

discreteVAE tokenizer

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

44 of 318

VideoMAE

[Oct, 2022]

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

45 of 318

VideoMAE Architecture

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

46 of 318

Observations

– High masking ratio: 90% to 95%

– Impressive results even on very small datasets, e.g. 3k videos

– Data quality is more important than data quantity for Self Supervised Video Pretraining. Domain shift between pre-training and target datasets is an important factor.

– VideoMAE with the vanilla ViT backbone can achieve 87.4% on Kinects-400, 75.4% on SomethingSomething V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data.

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

47 of 318

Experiments on Something-Something V2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

48 of 318

Experiments on Kinetics 400

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

49 of 318

Siam MAE

[May 2023]

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

50 of 318

SiamMAE: Architecture

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

51 of 318

SiamMAE: Key idea

– By masking a large fraction (95%) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations.

– SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

52 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

53 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

54 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

55 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

56 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

57 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

58 of 318

Audio-MAE

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

59 of 318

Audio-MAE

– Encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.

– The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.

– Local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands.

– Fine-tune the encoder with a lower masking ratio on target datasets.

– Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

60 of 318

Audio-MAE: Architecture

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

61 of 318

Audio-MAE: Architecture

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

62 of 318

MultiMAE

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

63 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

64 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

65 of 318

MultiMAE observations

– like MAE, encoder only processes non-masked tokens

– like MAE, shallow decoders

– pseudolabels for non-RGB modalities

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

66 of 318

MultiMAE Experiments

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

67 of 318

MultiMAE Experiments

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

68 of 318

M3AE: MultiModal MAE

[Oct 2022]

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

69 of 318

M3AE Contributions

– Until M3AE: dominant multi-modal representation learning paradigm was contrastive learning (CLIP, ALIGN)

– Downside of cross-modal contrastive: only works with paired data

– We find that multimodal pretraining of M3AE on CC12M achieves significantly higher performance on the ImageNet-1k linear classification benchmark [33] compared to pre-training on images only (MAE).

– M3AE performs best when we apply a high mask ratio (75%) on language, while in contrast, language models like BERT conventionally use a low mask ratio (15%)

– Encoder: image patches and language tokens, ViT

– Decoder: light weight, following MAE

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

70 of 318

M3AE: Architecture

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

71 of 318

Comparison with MAE

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

72 of 318

Multiview MWM

[covered in a later section of lecture]

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

73 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL, DINO/DINOv2, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind

RL and Control

R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control

Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

74 of 318

Relative Position of Image Patches

Task: Predict the relative position of the second patch with respect to the first

Slide: Zisserman

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

75 of 318

Relative Position of Image Patches

Slide: Zisserman

Doersch, Gupta, Efros

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

76 of 318

Relative Position of Image Patches

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

77 of 318

Relative Position of Image Patches

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

78 of 318

Solving Jigsaw Puzzles

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

79 of 318

Solving Jigsaw Puzzles

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

80 of 318

Rotation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

81 of 318

Rotation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

82 of 318

Rotation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

83 of 318

Rotation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

84 of 318

Rotation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

85 of 318

Rotation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

86 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

87 of 318

Contrastive Predictive Coding

July 2018

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

88 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

89 of 318

Contrastive Predictive Coding

Figure from Alex Graves

Don't directly predict x

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

90 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

91 of 318

Contrastive Predictive Coding

Figure from Alex Graves

Bilinear dot product

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

92 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

93 of 318

Contrastive Predictive Coding

Figure from Alex Graves

InfoNCE

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

94 of 318

Contrastive Predictive Coding

Figure from Alex Graves

InfoNCE

Can be viewed as categorical cross-entropy of classifying the positive sample correctly

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

95 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

96 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

97 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

98 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

99 of 318

Contrastive Predictive Coding

Figure from Alex Graves

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

100 of 318

CPC - Speech

100

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

101 of 318

CPC - Speech

101

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

102 of 318

CPC - ImageNet

102

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

103 of 318

CPC - ImageNet

103

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

104 of 318

CPC - ImageNet

104

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

105 of 318

CPC - ImageNet

105

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

106 of 318

CPC - Natural Language Processing

106

Oord, Li, Vinyals 2018

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

107 of 318

CPC - Reinforcement Learning

107

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

108 of 318

CPCv2 - Large Scale CPC on ImageNet

108

May 2019

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

109 of 318

CPCv2 - Large Scale CPC on ImageNet

109

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

110 of 318

CPCv2 - Large Scale CPC on ImageNet

110

Figure from Aaron Van den Oord

ResNet-161

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

111 of 318

CPCv2 - Large Scale CPC on ImageNet

111

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

112 of 318

CPCv2 - Large Scale CPC on ImageNet

112

Figure from Aaron Van den Oord

ResNet-161

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

113 of 318

CPCv2 - Large Scale CPC on ImageNet

113

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

114 of 318

CPCv2 - Large Scale CPC on ImageNet

114

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

115 of 318

CPCv2 - Large Scale CPC on ImageNet

115

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

116 of 318

CPCv2 - Large Scale CPC on ImageNet

116

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

117 of 318

CPCv2 - Large Scale CPC on ImageNet

117

Figure from Aaron Van den Oord

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

118 of 318

CPCv2 - Large Scale CPC on ImageNet

118

Figure from Aaron Van den Oord

1. Other patches within image

2. Patches from other images

Negatives

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

119 of 318

CPCv2 - Large Scale CPC on ImageNet

119

1. Other patches within image

2. Patches from other images

Negatives

InfoNCE Loss

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

120 of 318

CPCv2 - Large Scale CPC on ImageNet

120

1. Other patches within image

2. Patches from other images

Negatives

InfoNCE Loss

Parallel Implementation

with PixelCNN (masked conv) and 1x1 conv

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

121 of 318

CPCv2 - Large Scale CPC on ImageNet

121

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

122 of 318

CPCv2 - Large Scale CPC on ImageNet

122

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

123 of 318

CPCv2 - Linear Classification

Linear �Classifier�Score

(Imagenet)

123

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

124 of 318

CPCv1 ---> CPCv2

124

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

125 of 318

CPCv2 - Data-Efficient Image Recognition

125

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

126 of 318

Instance Discrimination

attract

repel

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

127 of 318

Instance Discrimination

attract

repel

MoCo
SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

128 of 318

Momentum Contrast (MoCo)

Nov 2019

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

129 of 318

Momentum Contrast (MoCo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

130 of 318

Momentum Contrast (MoCo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

131 of 318

Momentum Contrast (MoCo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

132 of 318

Momentum Contrast (MoCo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

133 of 318

Momentum Contrast (MoCo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

134 of 318

Momentum Contrast (MoCo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

135 of 318

SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

136 of 318

SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

137 of 318

SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

138 of 318

SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

139 of 318

SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

140 of 318

MoCov2 vs SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

141 of 318

MoCov2 vs SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

142 of 318

MoCov2 vs SimCLR

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

143 of 318

MoCo v3

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

144 of 318

MoCo v3

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

145 of 318

MoCo v3

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

146 of 318

MoCo v3

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

147 of 318

MoCo v3

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

148 of 318

BYOL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

149 of 318

BYOL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

150 of 318

BYOL

Normalize features

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

151 of 318

BYOL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

152 of 318

BYOL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

153 of 318

BYOL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

154 of 318

BYOL

Another perspective

https://imbue.com/research/2020-08-24-understanding-self-supervised-contrastive-learning/

Batch norm needed to prevent mode collapse
Implicit contrastive learning - common mode between examples in the minibatch removed

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

155 of 318

Summary

https://imbue.com/research/2020-08-24-understanding-self-supervised-contrastive-learning/

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

156 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

156

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

157 of 318

DINO

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

158 of 318

DINO

Consider knowledge distillation

Student network g tries to match a teacher network gt
Minimize the cross entropy of the distributions

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

159 of 318

DINO

Self supervised learning as knowledge distillation

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

160 of 318

DINO

Apply centering to avoid collapse - use EMA so things work across different batch sizes

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

161 of 318

DINO

Threshold attention map get mask

Compare similarity to ground truth mask

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

162 of 318

DINO

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

163 of 318

DINO

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

164 of 318

iBOT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

165 of 318

iBOT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

166 of 318

iBOT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

167 of 318

DINO - V2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

168 of 318

DINO - V2

ViT-L

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

169 of 318

DINO-V2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

170 of 318

DINO - V2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

171 of 318

DINO-V2

Feature matching

Image retrieval

Segmentation

Depth prediction

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

172 of 318

JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

173 of 318

I-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

174 of 318

I-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

175 of 318

I-JEPA

Context Encoder, Target Encoder and Predictor are ViTs

Predictor

Transformer encoder
Concat context tokens
Have masked tokens for prediction patches

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

176 of 318

I-JEPA

Context and Target Selection

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

177 of 318

I-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

178 of 318

Freeze context encoder and predictor�

Train a RCDM (representation conditioned diffusion model to visualize predictions

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

179 of 318

V-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

180 of 318

V-JEPA

Performs well on downstream video/image tasks
Better than pixel prediction approaches if freezing weights
Competitive with full fine tuning
Shorter training schedules

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

181 of 318

V-JEPA

Short range masks: union of 8 randomly sampled target blocks converting 15 % of each frame

Long range masks:union of 2 randomly rampled target blocks covering 70% of each frame

~90% mask ratio

Train on large dataset of 2 million videos from publicly available dataset

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

182 of 318

V-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

183 of 318

V-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

184 of 318

V-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

185 of 318

V-JEPA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

186 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

187 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

187

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

188 of 318

CLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

189 of 318

CLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

190 of 318

CLIP

Dataset

Existing text annotated image dataset at the time were relatively small
YFCC100M

Text metadata quality is low, some captions are automatically generated file names like “20160716 113957.JPG”

Constructed dataset of 400M image-text pairs
Images searched with one of 500K generated queries

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

191 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

192 of 318

CLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

193 of 318

CLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

194 of 318

CLIP

CLIP learns features useful for other model

unCLIP

LLaVA

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

195 of 318

LiT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

196 of 318

LiT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

197 of 318

LiT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

198 of 318

LiT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

199 of 318

SigLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

200 of 318

SigLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

201 of 318

SigLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

202 of 318

SigLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

203 of 318

FLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

204 of 318

FLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

205 of 318

FLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

206 of 318

FLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

207 of 318

SLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

208 of 318

SLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

209 of 318

SLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

210 of 318

SLIP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

211 of 318

CoCa

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

212 of 318

CoCa

Contrastive learning with conventional encoder decoder transformer
Achieving 91% top 1 ImageNet accuracy post fine tuning

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

213 of 318

CoCa

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

214 of 318

CoCa

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

215 of 318

CoCa

Trained from scratch on JFT-3B dataset and ALIGN dataset
JFT is a (internal) Google classification benchmark

Randomly sample a caption from a templated prompt
Ie “a photo of the cat, animal”

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

216 of 318

CoCa

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

217 of 318

CoCa

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

218 of 318

CoCa

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

219 of 318

ImageBind

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

220 of 318

ImageBind

Train with (Image, Modality) pairs

Transformer for all modality encoders

Train with InfoNCE loss

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

221 of 318

ImageBind

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

222 of 318

BLIP-2

Goal: Efficient, zero-shot capabilities in vision-language tasks
Pitfalls:

end-to-end training on vision-language models is expensive
Finetuning from pretrained LLMs or ViTs can result in catastrophic forgetting
Aligning image and language modalities is hard

30 Jan 2023

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

223 of 318

BLIP-2

Solution: Q-Former

Learn the image-language modality alignment, then finetune for language generation

2-stage pretraining: vision-language representation learning, vision-to-language generative learning

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

224 of 318

BLIP-2

Representation learning just focuses on this block
Three objectives: Image-Grounded Text Generation (ITG), Image-Text Matching (ITM), Image-Text Contrastive Learning (ITC)
Image encoder alternatively cross attends, queries and text self-attend
Delineate image X, text Y as an image-text pair (here, image + caption)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

225 of 318

BLIP-2

Output Queries

Image Encoder on X (Frozen)

Image Transformer

Input Queries

Text Transformer

CLS embedding

Text embedding

Max. similarity

CLS token

Text token Y

ITC (Image-Text Contrastive Learning)

Maximum similarity computed among queries and taken as text-image similarity
Similarities contrasted between positive image-text pairs and negatives in-batch
Result: aligning relevant image-text modalities

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

226 of 318

BLIP-2

Output Queries

Image Encoder on X (Frozen)

Image Transformer

Input Queries

Text Transformer

Autoregressive text output Y

DEC token

ITG (Image-Grounded Text Generation)

Queries attend on one another, separate from text transformer
Text transformer takes in decoder token and conditions on queries to do text generation (using as a target the same text label as in ITC)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

227 of 318

BLIP-2

Output Queries

Image Encoder on X (Frozen)

Image Transformer

Input Queries

Text Transformer

Autoregressive text Y

DEC token

Autoregressive text Y

ITG (Image-Grounded Text Generation)

Text labels are generated autoregressively - resulting attention mask shown above
Train on text generation loss
Result: creating informative vision-language tokens

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

228 of 318

BLIP-2

Output Queries

Image Encoder on X (Frozen)

Image Transformer

Input Queries

Text Transformer

Text token Y

Classifier, Averaged

ITM (Image-Text Matching)

Queries and text attend on one another unrestricted
Output queries classified and averaged, trained on classifying (X,Y) as a pair or not
Hard negative mining utilized
Result: fine-grained representation learning

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

229 of 318

BLIP-2

Second pretraining stage: use X as input to encoder, Y as output from LLM Decoder
Q-Former is finetuned on the same data, along with a projection layer between Q-Former and the LLM Decoder (language modeling loss)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

230 of 318

BLIP-2

Training

Trained on 129 million image-caption pairs

COCO, Visual Genome, LAION400M, etc.

Used CapFilt to generate synthetic captions from BLIP-1 and filter on caption similarity with CLIP embeddings
Random augmentations: cropping, horizontal flipping

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

231 of 318

BLIP-2

BLIP shows notable improvements especially in zero-shot visual question answering
Image-text retrieval doesn’t utilize language generation stage (so only the BLIP visual encoder gets used)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

232 of 318

BLIP-2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

233 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

233

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

234 of 318

CURL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

235 of 318

CURL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

236 of 318

CURL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

237 of 318

CURL

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

238 of 318

R3M

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

239 of 318

R3M

Motivation: Plug-and-play, general Visual Representations for Robotics must contain Three main ingredients…

Temporal dynamics of the scene i.e. how states might transition to other states
A prior over semantic relevance: should focus on task relevant features like objects and their relationships
Be compact, excluding features irrelevant to the previous two criteria such as backgrounds

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

240 of 318

R3M

Ego4d is diverse, in-the-wild, and language annotated

Contains 3,500 hours of data from 70 locations across the globe

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

241 of 318

R3M

Time Contrastive Learning encodes temporal dynamics into the representation

Frames closer in time are more perceptually similar than frames farther apart in time
Frames from the same videos are more perceptually similar than those from other videos

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

242 of 318

R3M

Video-Language alignment encourages F to capture semantically relevant features

Video language alignment should increase over the course of the video
Frames from captioned video should be more aligned language than frames from another video

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

243 of 318

R3M

Joint Optimization - Simple regularizations encourage sparsity of representations

L1 reduces representations to only critical features

L2 probably has more of a regularizing effect than anything

ResNet18, ResNet34, and ResNet50 architectures optimized with Adam are all released
Random cropping at the video level

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

244 of 318

R3M

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

245 of 318

R3M

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

246 of 318

R3M

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

247 of 318

MVP

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

248 of 318

MVP

Use MAE learned features for robotic control

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

249 of 318

MVP

Establishes a benchmark dataset and evaluation suite

Train on Human-Object-Interaction dataset of 700k images:

Epic Kitchens
Youtube 10 Days of Hands
Something 2 Something

Evaluate on new PixMC Benchmark - Train with PPO

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

250 of 318

MVP

Evaluate with highly parallelized PPO

MLP policy
Critic has same architecture and learns from same representations
State is MAE representation + proprioception
Action space is position control in joint angle space
Learn to handcrafted dense rewards

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

251 of 318

MVP

Self-supervised on large data is better than supervised on smaller data (ImageNet)

Oracle has access to hand-engineered state - location of objects, 3d poses, direction to goal vectors

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

252 of 318

MVP

Representations robust to distractors and generalize different object types

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

253 of 318

+ Large volume and diversity

- Lacks modalities important for EAI� (e.g. proprioception, actions etc)

Egocentric human videos�(e.g. Ego4D, Epic-Kitchens etc)

+ Matching modality and embodiments

- Lacks volume and diversity � (physical system, lab setup etc.)

Robot execution trajectories�(e.g. BAIR Robot Dataset, BC-Z etc)

+ Side information available� (e.g. joint sensors, object poses)

- Inaccurate physics for transfer

Simulators�(e.g. Habitat, MuJoCo etc)

Goal: Develop a unified learning paradigm for trajectories that are multi-modal and heterogeneous

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

254 of 318

MTM

Trajectory is a generic sequence of elements.

x_t

Joint Sensors

Vision (RGB)

Vision (Depth)

Action

Element�(time t)

Modalities

q_t¹

q_t²

q_t³

q_t^K

Tokens

Lift to common�embedding space with modality-specific encoders.

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

255 of 318

MTM

Missing modalities ⇔ Masked as a constraint

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

256 of 318

MTM

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

257 of 318

MTM

Summary

Random autoregressive masking helps downstream prediction tasks

Competitive RCBC on continuous control tasks�
Heteromodal dataset training capabilities�
Representation learning

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

258 of 318

MTM

Setup: Only a small fraction (~1%) of the dataset has full (s, R, a) trajectories. Remainder of dataset is missing actions.�
Baselines: Can train only on the labelled subset of data.�
Heteromodal MTM: Train on mixture dataset, with missing modalities treated as if they were masked out.

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

259 of 318

MTM

Setup: (1) Pretrain MTM model on offline dataset. �(2) Use state encoder of MTM and feed it to a standard RL algorithms (TD3)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

260 of 318

Masked World Models for Visual Control

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

261 of 318

Masked World Models for Visual Control

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

262 of 318

Masked World Models for Visual Control

Main idea: Decouple visual representation learning and dynamics learning

Visual Representation Learning

Training an autoencoder with convolutional feature masking
Reward prediction to encode task-relevant information

Dynamics learning

Training a recurrent state-space model (RSSM) that reconstructs frozen autoencoder representations

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

263 of 318

Masked World Models for Visual Control

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

264 of 318

Masked World Models for Visual Control

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

265 of 318

MWM outperforms DreamerV2 on challenging Meta-world tasks

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

266 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

267 of 318

Multi-View MAE

Multiple cameras have often been used for visual robotic manipulation

[Akkaya et al., 2019]

[Jangir et al., 2022]

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

268 of 318

Multi-View MAE

Main Idea:�Reconstruct masked viewpoints to learn cross-view information

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

269 of 318

Multi-View MAE

MV-MAE can extract both multi-view �and single-view representations

Visual robotic manipulation with multi-view or single-view data

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

270 of 318

Multi-View MAE

RLBench [James et al., 2020] with front and wrist cameras

Widely-used camera configuration

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

271 of 318

Multi-View MAE

MV-MWM outperforms both single-view and multi-view baselines

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

272 of 318

Multi-View MAE

MV-MWM is also outperforming baselines in imitation learning setup

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

273 of 318

Multi-View MAE

Motivation:

Camera calibration is a tedious procedure

Solution: Training a viewpoint-robust policy with viewpoint randomization

Viewpoint randomization

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

274 of 318

Multi-View MAE

Step 1: Multi-view representation learning with viewpoint randomization

Step 2: Learn a world model for viewpoint-robust control

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

275 of 318

Multi-View MAE

MV-MWM learns a policy with aggressive viewpoint randomization

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

276 of 318

Multi-View MAE

MV-MWM learns a policy with aggressive viewpoint randomization

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

277 of 318

Multi-View MAE

Rotation

Shake

Translation

Zoom

Zero-Shot Sim2Real Transfer with Hand-held Cameras

Without proprioceptive states, depth, and adaptation

(From Younggyo)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

278 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

278

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

279 of 318

Predicting neighbouring context

279

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

280 of 318

Word Embeddings

280

(From 224n Stanford)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

281 of 318

Word Embeddings

281

(From 224n Stanford)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

282 of 318

Word Embeddings

282

(From 224n Stanford)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

283 of 318

Word Embeddings

SVD approach suffers from:

Sparsity
SVD computation costs
Infrequent words
Noise from frequent words
There are hacks to fix some of these (ex TF-IDF) but still not very reliable

283

(From 224n Stanford)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

284 of 318

n-gram Language Models

284

(From 224n Stanford)

Unigram

Bigram

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

285 of 318

word2vec

285

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

286 of 318

word2vec

286

(From 224n Stanford)

Continuous Bag Of Words (CBOW)

Skip Gram

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

287 of 318

word2vec - CBOW

287

(From 224n Stanford)

Continuous Bag Of Words (CBOW)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

288 of 318

word2vec - Skip Gram

288

(From 224n Stanford)

Skip Gram

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

289 of 318

word2vec - Skip Gram

289

Skip-gram model

Don’t have to have the denominator over all words in the vocabulary

Can use negative sampling

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

290 of 318

word2vec

290

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

291 of 318

word2vec

291

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

292 of 318

GloVe

Consider counting based statistical approaches

Word co occurrences where is the number of times j occurs in the context of i

Ratios of co-occurrence probabilities can encode meaning

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

293 of 318

GloVe

Vector dot product to be similar to likelihood of of their co occurrence

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

294 of 318

GloVe

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

295 of 318

BERT

Oct 2018

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

296 of 318

Task 1 - Masked Language Model:

15% mask ratio
10% of the time use a random token
10% of the time leave unchanged
Loss only on masked tokens

Task 2 - Next Sentence Prediction

50/50 next sentence directly follows vs a random one from the dataset

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

297 of 318

BERT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

298 of 318

BERT

Pre-training data:

BookCorpus (800M words)
English Wikipedia (2500M words)

Fine Tuning

For each task, inputs and outputs given to BERT
Pretraining enables

Sentence A/B type tasks, ie sentence pairs in paraphrasing, hypothesis-premise pairs, question passage pairs
Token level tasks by looking at features per token
CLS for whole sentence level tasks.

Evaluate on GLUE - 11 NLP tasks

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

299 of 318

BERT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

300 of 318

BERT

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

301 of 318

BERT

Feature based

Extract out frozen features
Learn classifier for Named Entity Recognition task

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

302 of 318

RoBERTa

Jul 2019

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

303 of 318

RoBERTa

Greatly simplify the process to train BERT

Dynamic Masking

Original BERT performed masking at the data processing step

Next sentence prediction loss not needed

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

304 of 318

RoBERTa

Training with large batches
Text encoding with BPE (50K vocab size)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

305 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

306 of 318

Cast all tasks as language input, language output

Explore different architectures and pre training tasks

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

307 of 318

Finetuning

Fine-tune with the same input output format. Add a task specific text prefix to the model, ex.�

Input: translate English to German: That is good.

Output: Das ist gut.

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

308 of 318

Denoising Objective

Sentinel token to delineate removed spans

(unique ids that are added to the token vocab)

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

309 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

310 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

311 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

312 of 318

T5’s effect in Imagen - using T5’s text encoder

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

313 of 318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

314 of 318

UL2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

315 of 318

UL2

Does training on different pre training tasks help?

Formulate 3 types of pretraining

Extreme denoising
Low corruption
Sequential denoising

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

316 of 318

UL2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

317 of 318

UL2

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning

318 of 318

Outline

Reconstruct from a corrupted (or partial) version

Denoising AutoEncoder / Diffusion
In-painting / Masked AutoEncoder: MAE, VideoMAE, Audio-MAE, BeIT, M3AE, MultiMAE, SiamMAE
Colorization, Split-Brain AutoEncoder

Visual common sense tasks

Relative patch prediction
Jigsaw puzzles
Rotation

Contrastive Learning

Contrastive Predictive Coding (CPC)
Instance Discrimination: SimCLR, MoCo-v1,2,3, BYOL

Feature Prediction: DINO/DINOv2/iBOT, JEPA, I-JEPA, V-JEPA
Text-Image: CLIP, LiT, SigLIP, FLIP, SLIP, CoCa, BLIP/BLIP-2, ImageBind
RL and Control: R3M, CURL, MVP, MTM, Multi-View MAE and Masked World Models for Visual Control
Language

Word2vec and Glove
BERT, RoBERTa, T5, UL2

318

UC Berkeley -- Spring 2024 -- Deep Unsupervised Learning -- Pieter Abbeel, Kevin Frans, Philipp Wu, Wilson Yan -- L7 Self-Supervised Learning