1 of 116

Unsupervised Learning

Tyler, Arjun (4/20/22)

2 of 116

Outline

Introduction
Self Supervised Learning

Autoregressive & Completion Models
Recent CV applications

Representation Learning

(Non-) Contrastive Learning
Clustering

GANs + Generative Modeling

(Variational) Autoencoders
Vanilla & CycleGAN

Conclusion

3 of 116

Introduction

What about the number of samples?

Bits per sample to learn from: Supervised, Unsupervised & RL…

4 of 116

Some Cool Examples

5 of 116

Self-Supervised Learning

6 of 116

Self Supervised Learning

Classic Supervised Learning: Data -> Label

Data ∈ {features, images, text, videos, etc.}
Label ∈ {classes, positions, etc.}

Self Supervised Learning: Data -> _?

Some part of Data -> other part of Data!

…but why?

data generation
likelihood estimation or anomaly detection
feature learning
pre training

7 of 116

Autoregressive Models

8 of 116

Autoregressive Models

Data points in the form of {x₁, x₂, x₃, …, x_N}

Supervised Sequence Learning -> y

Train a model that maximizes the likelihood of the data!
How to decompose P(x₁, x₂, x₃, …, x_N)?

P(x₁, x₂, x₃, …, x_N) = P(x₁) * P(x₂|x₁) * … * p(x_N|x_N-1, … x₁)

Use the same network to model each conditional likelihood

Predict value given previous values

9 of 116

Language Models (e.g. GPT)

ML

is

Language Model (e.g. Transformer)

is 0.78

for 0.14

… 0.08

cool

great 0.25

cool 0.21

… 0.54

P(word | previous words)

Max P of training data!

10 of 116

Language Models (e.g. GPT)

ML

for

Language Model (e.g. Transformer)

is 0.78

for 0.14

… 0.08

fun

fun 0.47

memes 0.36

… 0.17

is 0.78

for 0.14

… 0.08

11 of 116

GPT-3

12 of 116

GPT-3

13 of 116

Frame Prediction

Document ⇒ Video
Words ⇒ Frames
How to input words?

(Sub)word vectors learned and put in a lookup table

How to input frames?

Image CNN embeddings

How to output words?

Cosine sim. of <output vector, (sub)word vectors>

How to output frames?

We’ll talk about some architectures here later!

14 of 116

Frame Prediction

For self driving

Environment prediction

Weather Forecasting

Shoutout ClimateHack!

Youtube videos?

Mainly as an auxiliary or pre-training task

15 of 116

Granularity

At what level do split up x_i’s?
Text

Characters
Subwords
Words
Phrases

Video

Pixels
Patches
Images
Snippets

16 of 116

PixelCNN (Autoregressive Pixel Model)

Pixels now in sequence!

Could just throw a sequence model but really inefficient…

17 of 116

PixelCNN

CNN

0.5	-1	2
-1	1.3	0.7
-0.5	0.5	0.8

18 of 116

PixelCNN

CNN

0.5	-1	2
-1	1.3	0.7
-0.5	0.5	0.8

19 of 116

PixelCNN

CNN

0.5	-1	2
-1	1.3	0.7
-0.5	0.5	0.8

NOTE: After the first layer center square need not be masked!

20 of 116

PixelCNN


	1.3	0.7
-0.5	0.5	0.8


		0.7
-0.5	0.5	0.8

conv

=

First layer

Second Layer

Receptive Field

21 of 116

Completion Models

Autoregressive models with a different generation order

Predict value given previous values and subsequent values

22 of 116

GPT ⇒ BERT

I ate a [MASK] yesterday.

Should I wear a [MASK]?

Pred masked values

Pred binary next sentence

Excellent pre-training for downstream tasks!

When would you want to finetune GPT vs BERT?

23 of 116

Frame Prediction ⇒ Super SloMo

24 of 116

PixelCNN ⇒ SuperRes

25 of 116

PixelCNN ⇒ Inpainting

26 of 116

More SSL for Vision?

Self Supervised Learning == good for language. What about Vision?

PixelCNN!

27 of 116

Seeing 20/20

Even better: use patches instead of pixels and transformers as our seq2seq architecture

28 of 116

Vision Transformer (ViT)

29 of 116

Attention unleashed (BERT in CV)

random masking

30 of 116

Masked Autoencoders

random masking

31 of 116

Masked Autoencoders

encode visible patches

32 of 116

Masked Autoencoders

add mask tokens

33 of 116

Masked Autoencoders

reconstruct

34 of 116

Example Reconstructions

35 of 116

Representation Learning

36 of 116

Why do representations matter?

210 / 6 = ?

CCX / VI = ?

37 of 116

“In the context of machine learning, what makes one representation better than another? Generally speaking, a good representation is one that makes a subsequent learning task easier. The choice of representation will usually depend on the choice of the subsequent learning task” - Chapter 15, Representation Learning, of deeplearningbook.org

38 of 116

Goodfellow’s thoughts:

39 of 116

Transfer Learning

40 of 116

Transfer Learning

Goal: Pre-train a backbone without labels!

How?

41 of 116

Recall…

Devise our own task based on the data: self-supervision!

42 of 116

Pretext Tasks

Tasks created for the purpose of obtaining good learned representations

43 of 116

A Simple Task: Rotations

input image

44 of 116

A Simple Task: Rotations

rotate it four different ways and keep the type of rotation as the label

45 of 116

A Simple Task: Rotations

train model to predict the rotation

46 of 116