JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 13

Transformers

in Computer Vision

Seyed Iman Mirzadeh

2 of 13

Introduction

Architectural progress in Deep Learning is very slow.
We are still using mostly MLPs, CNNs, and RNNs from 1990s.
In 2017, Transformers came and became the de facto standard in NLP (almost no RNN is used anymore in NLP).
It is interesting to see if they are useful in computer vision, too.

Background: Transformers

Vision Transformer (ViT)

How does ViT Work?

Wait!

Couldn’t anyone else think of breaking images to patches?

Yes. But that’s not the magic of ViT. The magic comes from pre-training large models on ridiculously gigantic datasets!

The model is pre-trained from multiple large scale datasets with deduplication to support fine tuning (smaller dataset) downstream tasks.

Results

Results (2)

Punchline of the Paper

In over-parametrized training regime on with pre-training on ridiculously large datasets, It doesn’t matter if we use CNNs or Transformers.

Why did I choose this paper?

I personally believe transformers are not good and some architecture to CNNs will be better for vision tasks.

Either I’m wrong, and It’s interesting to find out how researchers will overcome the challenges to make transformers better than CNNs.
Or I’m right, then l learn how researchers [unintentionally] sell their wrong ideas! (might be a useful skill!)

ViT on Smaller datasets

Transformers in Vision: Other Attempts

DETR: Results