1 of 33

Masked Autoencoders

Are Scalable Vision Learners

Masked Autoencoders

Are Scalable Vision Learners

Paper Discussion 13.07.2022

2 of 33

Transformers Rule!!!

Give Convs a chance

It was another day at the office in ML City for Huggy.

He just finished creating a nice new space and was ready for his lunch in the park.

3 of 33

He was looking forward to a nice sandwich out on a sunny summer day.

But he soon realized that it was getting way too hot.

38°C

That was the moment he decided to have a nice beach vacation.

4 of 33

He had heard about this nice place - Neural Network Island. Where ML folks go with their image pets to have a good time.

5 of 33

Google ViT Training today

Bring your image for free TPU credits !!!

ViT

But when he got to the beach he was extremely disappointed - it was packed with images.

He wondered why they were all hanging around at this one beach.

Then he saw a sign.

6 of 33

FAIR MAE Training today

Bring your image for free Instagram followers !!!

Luckily Huggy got a tip from a local about another nearby beach.

When he got there, he was relieved - way less images.

But something was strange about them.

He looked around and saw a sign, again.

MAE

7 of 33

Then he saw a booth offering free decoder glasses. He put them on and was amazed.

8 of 33

With his new decoder glasses, Huggy spent a nice day at the beach.

In the afternoon there was a classifiaction competition between the vanilla ViT and MAE.

Huggy was sure that MAE had no chance.

It was a close result, but in the end MAE managed to outperform vanilla ViT.

9 of 33

10 of 33

Vision Transformer (ViT) Recap

Published in 2020 by Google
First really successful Transformer for Vision Tasks
Needs huge dataset for good results

11 of 33

Masked Autoencoders

Not a new Vision Transformer model
Framework for effectively training existing ViT(-like) transformers
Self-Supervised (like MoCo, BeiT, Dino, SimMim)

12 of 33

Masked Autoencoders

E

n

c

o

d

e

r

D

e

c

o

d

e

r

L

a

t

e

n

t

13 of 33

MAE Architecture

14 of 33

Masking

Create patches + positional embeddings
Random shuffle all patches
Mask the last x% of patches

15 of 33

Encoder

Layer Normalization

Linear Projection

16 of 33

Decoder

Layer Normalization

17 of 33

Reconstruction

Original

Reconstruction

Mean Squared Error

18 of 33

Classification

E

n

c

o

d

e

r

D

e

c

o

d

e

r

L

a

t

e

n

t

C

l

a

s

s.

H

e

a

d

19 of 33

Reconstruction

20 of 33

Reconstruction

21 of 33

Reconstruction

22 of 33

Reconstruction

23 of 33

Performance

Linear Probing

Fine Tuning

24 of 33

Performance - Comparison

25 of 33

Performance - Mask Sampling

26 of 33

Performance - Masking Ratio

27 of 33

Performance - Decoder Size

28 of 33

Performance - Augmentation

29 of 33

Performance - Decoder Size

30 of 33

Object Detection and Segmentation

31 of 33

Semantic Segmentation

32 of 33

Thank you!

33 of 33

Official Implementation

Hugging Face ViTMAE

keras.io Example