Unsupervised Learning
Tyler, Arjun (4/20/22)
Outline
Introduction
What about the number of samples?
Bits per sample to learn from: Supervised, Unsupervised & RL…
Some Cool Examples
Self-Supervised Learning
Self Supervised Learning
Autoregressive Models
Autoregressive Models
P(x1, x2, x3, …, xN) = P(x1) * P(x2|x1) * … * p(xN|xN-1, … x1)
Use the same network to model each conditional likelihood
Predict value given previous values
Language Models (e.g. GPT)
ML
is
Language Model (e.g. Transformer)
is 0.78
for 0.14
… 0.08
cool
great 0.25
cool 0.21
… 0.54
P(word | previous words)
Max P of training data!
Language Models (e.g. GPT)
ML
for
Language Model (e.g. Transformer)
is 0.78
for 0.14
… 0.08
fun
fun 0.47
memes 0.36
… 0.17
is 0.78
for 0.14
… 0.08
Frame Prediction
Frame Prediction
Granularity
PixelCNN (Autoregressive Pixel Model)
Pixels now in sequence!
Could just throw a sequence model but really inefficient…
PixelCNN
CNN
0.5 | -1 | 2 |
-1 | 1.3 | 0.7 |
-0.5 | 0.5 | 0.8 |
PixelCNN
CNN
0.5 | -1 | 2 |
-1 | 1.3 | 0.7 |
-0.5 | 0.5 | 0.8 |
PixelCNN
CNN
0.5 | -1 | 2 |
-1 | 1.3 | 0.7 |
-0.5 | 0.5 | 0.8 |
NOTE: After the first layer center square need not be masked!
PixelCNN
| | |
| 1.3 | 0.7 |
-0.5 | 0.5 | 0.8 |
| | |
| | 0.7 |
-0.5 | 0.5 | 0.8 |
conv
=
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
First layer
Second Layer
Receptive Field
Completion Models
Autoregressive models with a different generation order
Predict value given previous values and subsequent values
GPT ⇒ BERT
I ate a [MASK] yesterday.
Should I wear a [MASK]?
Pred masked values
Pred binary next sentence
Excellent pre-training for downstream tasks!
When would you want to finetune GPT vs BERT?
Frame Prediction ⇒ Super SloMo
PixelCNN ⇒ SuperRes
PixelCNN ⇒ Inpainting
More SSL for Vision?
Self Supervised Learning == good for language. What about Vision?
PixelCNN!
Seeing 20/20
Even better: use patches instead of pixels and transformers as our seq2seq architecture
Vision Transformer (ViT)
Attention unleashed (BERT in CV)
random masking
Masked Autoencoders
random masking
Masked Autoencoders
encode visible patches
Masked Autoencoders
add mask tokens
Masked Autoencoders
reconstruct
Example Reconstructions
Representation Learning
Why do representations matter?
210 / 6 = ?
CCX / VI = ?
“In the context of machine learning, what makes one representation better than another? Generally speaking, a good representation is one that makes a subsequent learning task easier. The choice of representation will usually depend on the choice of the subsequent learning task” - Chapter 15, Representation Learning, of deeplearningbook.org
Goodfellow’s thoughts:
Transfer Learning
Transfer Learning
Goal: Pre-train a backbone without labels!
How?
Recall…
Devise our own task based on the data: self-supervision!
Pretext Tasks
Tasks created for the purpose of obtaining good learned representations
A Simple Task: Rotations
input image
A Simple Task: Rotations
rotate it four different ways and keep the type of rotation as the label
A Simple Task: Rotations
train model to predict the rotation
Slightly Harder: Jigsaw Puzzle
Learn to solve a jigsaw puzzle of the input!
Slightly Harder: Jigsaw Puzzle
Slightly Harder: Jigsaw Puzzle
Slightly Harder: Jigsaw Puzzle
Slightly Harder: Jigsaw Puzzle
Slightly Harder: Jigsaw Puzzle
Slightly Harder: Jigsaw Puzzle
Why do we jitter the boxes?
What makes a label?
Slightly Harder: Jigsaw Puzzle
Similar to rotations, learn an index [0, …, 9!-1] for permutation
Contrastive Learning
Key Idea: Encourage representations for +’s to attract and -’s to repel
https://ml.berkeley.edu/blog/posts/contrastive_learning/ (img source: Tony Zhao)
Timeline of Contrastive Learning
NPID, 2018
MoCo, 2019
SimCLR, 2020
BYOL, 2020
CPC, 2018
CLIP, 2021
…
The Loss Function
https://ml.berkeley.edu/blog/posts/contrastive_learning/ (img source: Tony Zhao)
The Loss Function
https://ml.berkeley.edu/blog/posts/contrastive_learning/ (img source: Tony Zhao)
Metric learning!
SimCLR
A Simple Contrastive Learning of Representations
We have our loss term:
How do we create (+) and (-) samples?
SimCLR
A Simple Contrastive Learning of Representations
We have our loss term:
How do we create (+) and (-) samples?
Data Augmentation!
SimCLR
A Simple Contrastive Learning of Representations
https://ml.berkeley.edu/blog/posts/contrastive_learning/ (img source: Tony Zhao)
SimCLR
A Simple Contrastive Learning of Representations
calculating the loss
Pain point: Need large batch size/many negative examples! (fixed by MoCo)
https://ml.berkeley.edu/blog/posts/contrastive_learning/ (img source: Tony Zhao)
Momentum Contrast (MoCo)
SimCLR pain point: Need large batch size/many negative examples!
Idea: Keep a memory bank of negative examples to draw from (NPID too)
Momentum Contrast (MoCo)
Rephrase contrastive learning: q = query example, k = corresponding key.
→ Standard CL: want to maximize <q, k> if k is (+), minimize if k is (-)
→ Difference is how we find examples of k
Momentum Contrast (MoCo)
Rephrase contrastive learning: q = query example, k = corresponding key.
→ Standard CL: want to maximize <q, k> if k is (+), minimize if k is (-)
→ Difference is how we find examples of k
Momentum Contrast (MoCo)
MoCo uses a dynamic dictionary lookup
dict = a large FIFO queue of encoded representations
(since newer is better!)
Momentum Contrast (MoCo)
A queue is non-differentiable, so we instead update using momentum between parameters 𝚹k and 𝚹q for our momentum encoder.
CLIP
CLIP
Use natural language + images as supervision w/ contrastive pre-training
CLIP
Use natural language + images as supervision w/ contrastive pre-training
CLIP
Pre-train a model to learn good text and image representations
Key idea: Using a contrastive loss w/ (image, text) pairs to have text inform image info
Dense cosine similarity loss b/w pairs
CLIP
CLIP
Interesting takeaways:
Bag of Words >> Transformers for text encoding
Contrastive objective >> Exact caption prediction
CLIP
Use pre-trained encoders to create embeddings
For each image, create candidate captions and pick the best one to do zero-shot prediction
CLIP
Masked Autoencoders
note encoder/decoder asymmetry + no masks into encoder
MAEs outperform… everything
BERT-like scalability
BERT-like scalability
Original supervised training overfits
BERT-like scalability
Supervised training w/ strong regularization saturates!
BERT-like scalability
MAE pre-training generalizes much better
4%!
BERT-like scalability
Follows a similar trend to JFT-300M (300x more data)
Deep Feature Clustering
DeepCluster
Autoencoders
z
x
x’
Autoencoders
Generative Modeling
Generative Modeling
Decoder from an autoencoder?
Decoder from an autoencoder?
Latent Space
Variational Autoencoder
Variational Autoencoder
Variational Autoencoder
Variational Autoencoder
Reconstruction Loss
“Regularization” Term
VAE Latent Math!
+
-
=
Arithmetic in latent space then decode!
Conditional VAE
z
abstract
c=dog
class labels
Conditional VAE
z
abstract
c=cat
class labels
GANs
VAEs -> GANs
z
VAEs -> GANs
z
Generator
Dataset
Discriminator
Real or Fake?
Loss?
Same arch. as VAE decoder!
GAN Training Objectives
Generator: Generate images that fool discriminator.
Discriminator: Classify real vs generated/fake images.
GANs
Training
GANs
GANs
Cycle-GAN
Example of a self-supervised solution to a task (image style transfer)!
Cycle-GAN
Goal: Transfer an image in style X to an image in style Y
Problem: We don’t have pairs of images in (X, Y) (they might not exist!)
Ideas?
Cycle-GAN
Goal: Transfer an image in style X to an image in style Y
Problem: We don’t have pairs of images in (X, Y) (they might not exist!)
Ideas? Do something like back translation but for images!
Cycle Consistency Loss
Cycle Consistency Loss
ML Art! CLIP+VQ-VAE
https://ml.berkeley.edu/blog/posts/clip-art/ (img source: Charlie Snell)
“a cityscape at night”
“an abstract painting of a planet ruled by little castles”
“a studio ghibli landscape”
Resources
Self-Supervised Learning
Contrastive Learning
Non-contrastive Learning
GANs