Transformers
in Computer Vision
Seyed Iman Mirzadeh
Introduction
Background: Transformers
Vision Transformer (ViT)
How does ViT Work?
Wait!
Couldn’t anyone else think of breaking images to patches?
Yes. But that’s not the magic of ViT. The magic comes from pre-training large models on ridiculously gigantic datasets!
The model is pre-trained from multiple large scale datasets with deduplication to support fine tuning (smaller dataset) downstream tasks.
Results
Results (2)
Punchline of the Paper
Why did I choose this paper?
ViT on Smaller datasets
Transformers in Vision: Other Attempts
DETR: Results