1 of 116

Unsupervised Learning

Tyler, Arjun (4/20/22)

2 of 116

Outline

  • Introduction
  • Self Supervised Learning
    • Autoregressive & Completion Models
    • Recent CV applications
  • Representation Learning
    • (Non-) Contrastive Learning
    • Clustering
  • GANs + Generative Modeling
    • (Variational) Autoencoders
    • Vanilla & CycleGAN
  • Conclusion

3 of 116

Introduction

What about the number of samples?

Bits per sample to learn from: Supervised, Unsupervised & RL…

4 of 116

Some Cool Examples

5 of 116

Self-Supervised Learning

6 of 116

Self Supervised Learning

  • Classic Supervised Learning: Data -> Label
    • Data ∈ {features, images, text, videos, etc.}
    • Label ∈ {classes, positions, etc.}
  • Self Supervised Learning: Data -> _?
    • Some part of Data -> other part of Data!
  • …but why?
    • data generation
    • likelihood estimation or anomaly detection
    • feature learning
    • pre training

7 of 116

Autoregressive Models

8 of 116

Autoregressive Models

  • Data points in the form of {x1, x2, x3, …, xN}
    • Supervised Sequence Learning -> y
  • Train a model that maximizes the likelihood of the data!
  • How to decompose P(x1, x2, x3, …, xN)?

P(x1, x2, x3, …, xN) = P(x1) * P(x2|x1) * … * p(xN|xN-1, … x1)

Use the same network to model each conditional likelihood

Predict value given previous values

9 of 116

Language Models (e.g. GPT)

ML

is

Language Model (e.g. Transformer)

is 0.78

for 0.14

0.08

cool

great 0.25

cool 0.21

0.54

P(word | previous words)

Max P of training data!

10 of 116

Language Models (e.g. GPT)

ML

for

Language Model (e.g. Transformer)

is 0.78

for 0.14

0.08

fun

fun 0.47

memes 0.36

0.17

is 0.78

for 0.14

0.08

11 of 116

12 of 116

13 of 116

Frame Prediction

  • Document ⇒ Video
  • Words ⇒ Frames
  • How to input words?
    • (Sub)word vectors learned and put in a lookup table
  • How to input frames?
    • Image CNN embeddings
  • How to output words?
    • Cosine sim. of <output vector, (sub)word vectors>
  • How to output frames?
    • We’ll talk about some architectures here later!

14 of 116

Frame Prediction

  • For self driving
    • Environment prediction
  • Weather Forecasting
    • Shoutout ClimateHack!
  • Youtube videos?
    • Mainly as an auxiliary or pre-training task

15 of 116

Granularity

  • At what level do split up xi’s?
  • Text
    • Characters
    • Subwords
    • Words
    • Phrases
  • Video
    • Pixels
    • Patches
    • Images
    • Snippets

16 of 116

PixelCNN (Autoregressive Pixel Model)

Pixels now in sequence!

Could just throw a sequence model but really inefficient…

17 of 116

PixelCNN

CNN

0.5

-1

2

-1

1.3

0.7

-0.5

0.5

0.8

18 of 116

PixelCNN

CNN

0.5

-1

2

-1

1.3

0.7

-0.5

0.5

0.8

19 of 116

PixelCNN

CNN

0.5

-1

2

-1

1.3

0.7

-0.5

0.5

0.8

NOTE: After the first layer center square need not be masked!

20 of 116

PixelCNN

1.3

0.7

-0.5

0.5

0.8

0.7

-0.5

0.5

0.8

conv

=

First layer

Second Layer

Receptive Field

21 of 116

Completion Models

Autoregressive models with a different generation order

Predict value given previous values and subsequent values

22 of 116

GPT ⇒ BERT

I ate a [MASK] yesterday.

Should I wear a [MASK]?

Pred masked values

Pred binary next sentence

Excellent pre-training for downstream tasks!

When would you want to finetune GPT vs BERT?

23 of 116

Frame Prediction ⇒ Super SloMo

24 of 116

PixelCNN ⇒ SuperRes

25 of 116

PixelCNN ⇒ Inpainting

26 of 116

More SSL for Vision?

Self Supervised Learning == good for language. What about Vision?

PixelCNN!

27 of 116

Seeing 20/20

Even better: use patches instead of pixels and transformers as our seq2seq architecture

28 of 116

Vision Transformer (ViT)

29 of 116

Attention unleashed (BERT in CV)

random masking

30 of 116

Masked Autoencoders

random masking

31 of 116

Masked Autoencoders

encode visible patches

32 of 116

Masked Autoencoders

add mask tokens

33 of 116

Masked Autoencoders

reconstruct

34 of 116

Example Reconstructions

35 of 116

Representation Learning

36 of 116

Why do representations matter?

210 / 6 = ?

CCX / VI = ?

37 of 116

“In the context of machine learning, what makes one representation better than another? Generally speaking, a good representation is one that makes a subsequent learning task easier. The choice of representation will usually depend on the choice of the subsequent learning task” - Chapter 15, Representation Learning, of deeplearningbook.org

38 of 116

Goodfellow’s thoughts:

39 of 116

Transfer Learning

40 of 116

Transfer Learning

Goal: Pre-train a backbone without labels!

How?

41 of 116

Recall…

Devise our own task based on the data: self-supervision!

42 of 116

Pretext Tasks

Tasks created for the purpose of obtaining good learned representations

43 of 116

A Simple Task: Rotations

input image

44 of 116

A Simple Task: Rotations

rotate it four different ways and keep the type of rotation as the label

45 of 116

A Simple Task: Rotations

train model to predict the rotation

46 of 116

Slightly Harder: Jigsaw Puzzle

Learn to solve a jigsaw puzzle of the input!

47 of 116

Slightly Harder: Jigsaw Puzzle

48 of 116

Slightly Harder: Jigsaw Puzzle

49 of 116

Slightly Harder: Jigsaw Puzzle

50 of 116

Slightly Harder: Jigsaw Puzzle

51 of 116

Slightly Harder: Jigsaw Puzzle

52 of 116

Slightly Harder: Jigsaw Puzzle

Why do we jitter the boxes?

What makes a label?

53 of 116

Slightly Harder: Jigsaw Puzzle

Similar to rotations, learn an index [0, …, 9!-1] for permutation

54 of 116

Contrastive Learning

Key Idea: Encourage representations for +’s to attract and -’s to repel

55 of 116

Timeline of Contrastive Learning

NPID, 2018

MoCo, 2019

SimCLR, 2020

BYOL, 2020

CPC, 2018

CLIP, 2021

56 of 116

The Loss Function

57 of 116

The Loss Function

Metric learning!

58 of 116

SimCLR

A Simple Contrastive Learning of Representations

We have our loss term:

How do we create (+) and (-) samples?

59 of 116

SimCLR

A Simple Contrastive Learning of Representations

We have our loss term:

How do we create (+) and (-) samples?

Data Augmentation!

60 of 116

SimCLR

A Simple Contrastive Learning of Representations

61 of 116

SimCLR

A Simple Contrastive Learning of Representations

  • Project rep’s into smaller space for

calculating the loss

Pain point: Need large batch size/many negative examples! (fixed by MoCo)

62 of 116

Momentum Contrast (MoCo)

SimCLR pain point: Need large batch size/many negative examples!

Idea: Keep a memory bank of negative examples to draw from (NPID too)

63 of 116

Momentum Contrast (MoCo)

Rephrase contrastive learning: q = query example, k = corresponding key.

→ Standard CL: want to maximize <q, k> if k is (+), minimize if k is (-)

→ Difference is how we find examples of k

64 of 116

Momentum Contrast (MoCo)

Rephrase contrastive learning: q = query example, k = corresponding key.

→ Standard CL: want to maximize <q, k> if k is (+), minimize if k is (-)

→ Difference is how we find examples of k

65 of 116

Momentum Contrast (MoCo)

MoCo uses a dynamic dictionary lookup

dict = a large FIFO queue of encoded representations

(since newer is better!)

66 of 116

Momentum Contrast (MoCo)

A queue is non-differentiable, so we instead update using momentum between parameters 𝚹k and 𝚹q for our momentum encoder.

67 of 116

CLIP

68 of 116

CLIP

Use natural language + images as supervision w/ contrastive pre-training

69 of 116

CLIP

Use natural language + images as supervision w/ contrastive pre-training

70 of 116

CLIP

Pre-train a model to learn good text and image representations

Key idea: Using a contrastive loss w/ (image, text) pairs to have text inform image info

Dense cosine similarity loss b/w pairs

71 of 116

CLIP

72 of 116

CLIP

Interesting takeaways:

Bag of Words >> Transformers for text encoding

Contrastive objective >> Exact caption prediction

73 of 116

CLIP

Use pre-trained encoders to create embeddings

For each image, create candidate captions and pick the best one to do zero-shot prediction

74 of 116

CLIP

75 of 116

Masked Autoencoders

note encoder/decoder asymmetry + no masks into encoder

76 of 116

MAEs outperform… everything

77 of 116

BERT-like scalability

78 of 116

BERT-like scalability

Original supervised training overfits

79 of 116

BERT-like scalability

Supervised training w/ strong regularization saturates!

80 of 116

BERT-like scalability

MAE pre-training generalizes much better

4%!

81 of 116

BERT-like scalability

Follows a similar trend to JFT-300M (300x more data)

82 of 116

83 of 116

84 of 116

85 of 116

86 of 116

87 of 116

88 of 116

Deep Feature Clustering

  • K-means!
    • Adjust means to assigned points
    • Assign points to means
    • Need to define d(Image 1, Image 2)
  • What’s a good distance metric?
    • Average pixel difference?
    • Feature vector difference
  • Where do we get the features!
    • Could do any of the previous things
    • Or fake it till you make it!

89 of 116

DeepCluster

  • K means feature representations
  • Label based on clusters
  • Update model to classify based on pseudo labels
  • Repeat!

90 of 116

Autoencoders

  • How many dimensions is a 224 x 224 color image?
    • 224 x 224 x 3 = 150,528!
  • Can probably preserve almost all the information using far fewer dimensions…

z

x

x’

91 of 116

Autoencoders

  • Encoder: Standard CNN
  • Decoder: Upsampling instead of downsampling
  • At least some Reconstruction loss
  • Going beyond: regularizers, denoising autoencoders

92 of 116

Generative Modeling

93 of 116

Generative Modeling

  • Typically want to generate new data that has some desired properties.
  • Text
    • Autoregressive model + prompt great!
  • Images
    • Hard to provide “prompt”
  • Let’s discuss some other approaches!

94 of 116

Decoder from an autoencoder?

  • Passing into the decoder…
  • An encoding from a real image yields…
    • Best approximation for the real image
  • A random vector yields…
    • ???
    • How do we even sample the random vector?

95 of 116

Decoder from an autoencoder?

Latent Space

96 of 116

Variational Autoencoder

97 of 116

Variational Autoencoder

  • First a purely applied perspective
  • Instead of Encoder: xz
    • Encoder: xμz, 𝝈2z, both of same hidden dimension
    • z ~ N(μz, 𝝈2zI); z = μz+ 𝝈zε
    • ε ~ N(0, I)
  • Additional loss term to pull μz, 𝝈2z to 0 and 1 vector resp.
    • sum(μ2z + 𝝈2z - log(𝝈2z))

98 of 116

Variational Autoencoder

99 of 116

Variational Autoencoder

Reconstruction Loss

“Regularization” Term

100 of 116

VAE Latent Math!

+

-

=

Arithmetic in latent space then decode!

101 of 116

Conditional VAE

z

abstract

c=dog

class labels

102 of 116

Conditional VAE

z

abstract

c=cat

class labels

103 of 116

GANs

  • VAEs are cool, but don’t actually train on novel image generation, so we’re primarily relying on generalization.
  • Enter Generative Adversarial Networks
  • Key Idea: Train a generator capable of generating realistic images, as opposed to reconstructing provided ones.

104 of 116

VAEs -> GANs

z

105 of 116

VAEs -> GANs

z

Generator

Dataset

Discriminator

Real or Fake?

Loss?

Same arch. as VAE decoder!

106 of 116

GAN Training Objectives

Generator: Generate images that fool discriminator.

Discriminator: Classify real vs generated/fake images.

107 of 116

GANs

Training

108 of 116

GANs

109 of 116

GANs

110 of 116

Cycle-GAN

Example of a self-supervised solution to a task (image style transfer)!

111 of 116

Cycle-GAN

Goal: Transfer an image in style X to an image in style Y

Problem: We don’t have pairs of images in (X, Y) (they might not exist!)

Ideas?

112 of 116

Cycle-GAN

Goal: Transfer an image in style X to an image in style Y

Problem: We don’t have pairs of images in (X, Y) (they might not exist!)

Ideas? Do something like back translation but for images!

113 of 116

Cycle Consistency Loss

114 of 116

Cycle Consistency Loss

115 of 116

ML Art! CLIP+VQ-VAE

“a cityscape at night”

“an abstract painting of a planet ruled by little castles”

“a studio ghibli landscape”

116 of 116

Resources

Self-Supervised Learning

Contrastive Learning

Non-contrastive Learning

GANs