1 of 33

Masked Autoencoders

Are Scalable Vision Learners

Masked Autoencoders

Are Scalable Vision Learners

Paper Discussion 13.07.2022

2 of 33

Transformers Rule!!!

Give Convs a chance

It was another day at the office in ML City for Huggy.

He just finished creating a nice new space and was ready for his lunch in the park.

3 of 33

He was looking forward to a nice sandwich out on a sunny summer day.

But he soon realized that it was getting way too hot.

38°C

That was the moment he decided to have a nice beach vacation.

4 of 33

He had heard about this nice place - Neural Network Island. Where ML folks go with their image pets to have a good time.

5 of 33

Google ViT Training today

Bring your image for free TPU credits !!!

ViT

But when he got to the beach he was extremely disappointed - it was packed with images.

He wondered why they were all hanging around at this one beach.

Then he saw a sign.

6 of 33

FAIR MAE Training today

Bring your image for free Instagram followers !!!

Luckily Huggy got a tip from a local about another nearby beach.

When he got there, he was relieved - way less images.

But something was strange about them.

He looked around and saw a sign, again.

MAE

7 of 33

Then he saw a booth offering free decoder glasses. He put them on and was amazed.

8 of 33

With his new decoder glasses, Huggy spent a nice day at the beach.

In the afternoon there was a classifiaction competition between the vanilla ViT and MAE.

Huggy was sure that MAE had no chance.

It was a close result, but in the end MAE managed to outperform vanilla ViT.

9 of 33

10 of 33

Vision Transformer (ViT) Recap

  • Published in 2020 by Google
  • First really successful Transformer for Vision Tasks
  • Needs huge dataset for good results

11 of 33

Masked Autoencoders

  • Not a new Vision Transformer model
  • Framework for effectively training existing ViT(-like) transformers
  • Self-Supervised (like MoCo, BeiT, Dino, SimMim)

12 of 33

Masked Autoencoders

E

n

c

o

d

e

r

D

e

c

o

d

e

r

L

a

t

e

n

t

13 of 33

MAE Architecture

14 of 33

Masking

  • Create patches + positional embeddings
  • Random shuffle all patches
  • Mask the last x% of patches

15 of 33

Encoder

Encoder

Encoder

Layer Normalization

Linear Projection

16 of 33

Decoder

Decoder

Layer Normalization

17 of 33

Reconstruction

Original

Reconstruction

Mean Squared Error

18 of 33

Classification

E

n

c

o

d

e

r

D

e

c

o

d

e

r

L

a

t

e

n

t

C

l

a

s

s.

H

e

a

d

19 of 33

Reconstruction

20 of 33

Reconstruction

21 of 33

Reconstruction

22 of 33

Reconstruction

23 of 33

Performance

Linear Probing

Fine Tuning

24 of 33

Performance - Comparison

25 of 33

Performance - Mask Sampling

26 of 33

Performance - Masking Ratio

27 of 33

Performance - Decoder Size

28 of 33

Performance - Augmentation

29 of 33

Performance - Decoder Size

30 of 33

Object Detection and Segmentation

31 of 33

Semantic Segmentation

32 of 33

Thank you!

33 of 33