1 of 13

Transformers

in Computer Vision

Seyed Iman Mirzadeh

2 of 13

Introduction

  • Architectural progress in Deep Learning is very slow.
  • We are still using mostly MLPs, CNNs, and RNNs from 1990s.
  • In 2017, Transformers came and became the de facto standard in NLP (almost no RNN is used anymore in NLP).
  • It is interesting to see if they are useful in computer vision, too.

3 of 13

Background: Transformers

4 of 13

Vision Transformer (ViT)

5 of 13

How does ViT Work?

6 of 13

Wait!

Couldn’t anyone else think of breaking images to patches?

Yes. But that’s not the magic of ViT. The magic comes from pre-training large models on ridiculously gigantic datasets!

The model is pre-trained from multiple large scale datasets with deduplication to support fine tuning (smaller dataset) downstream tasks.

  • ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images
  • ImageNet-21k with 21k classes and 14M images
  • JFT with 18k classes and 303M high-resolution images

7 of 13

Results

8 of 13

Results (2)

9 of 13

Punchline of the Paper

  • In over-parametrized training regime on with pre-training on ridiculously large datasets, It doesn’t matter if we use CNNs or Transformers.

10 of 13

Why did I choose this paper?

  • I personally believe transformers are not good and some architecture to CNNs will be better for vision tasks.

  • But this is a win-win scenario for me in a long run:
    • Either I’m wrong, and It’s interesting to find out how researchers will overcome the challenges to make transformers better than CNNs.
    • Or I’m right, then l learn how researchers [unintentionally] sell their wrong ideas! (might be a useful skill!)

11 of 13

ViT on Smaller datasets

12 of 13

Transformers in Vision: Other Attempts

13 of 13

DETR: Results