1 of 12

Reporter : Bor-Kai Pan

Advisor : Dr. Yuan-Kai Wang

Intelligent System Laboratory of Electrical Engineering Department,

Fu Jen Catholic University

2024/06/19

Vision Transformer

Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,

Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,

Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†

∗equal technical contribution, †equal advising

Google Research, Brain Team

ICLR 2021

2 of 12

  • Introduction
  • Method
  • Experiments
  • Conclusions

2

Outline

3 of 12

  • The Transformer architecture has become the standard for natural language processing but its applications in computer vision are still limited. In vision, attention mechanisms are usually combined with convolutional networks or replace certain parts while keeping the overall structure intact.
  • This study shows that a pure Transformer can be applied directly to sequences of image patches and perform well in image classification tasks.
  • When pre-trained on large datasets and transferred to mid-sized or small image recognition benchmarks (like ImageNet, CIFAR-100, VTAB), Vision Transformer (ViT) performs excellently compared to state-of-the-art convolutional networks while requiring fewer computational resources to train.
  • The study splits images into patches and uses the sequence of linear embeddings of these patches as input to the Transformer.
  • When trained on mid-sized datasets, these models show modest accuracies slightly below ResNets of comparable size.
  • However, when trained on larger datasets (14M-300M images), ViT's performance significantly improves.
  • Pre-trained on large datasets like ImageNet-21k or JFT-300M, ViT approaches or surpasses state-of-the-art performance on multiple image recognition benchmarks.

3

I.Introduction

4 of 12

  • Transformers were originally proposed by Vaswani et al. (2017) for machine translation and have become the state-of-the-art method in many NLP tasks.

4

I.Introduction

5 of 12

  • Transfer Learning
  • Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
  • The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
  • Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
  • For the same reason, the amount of time and resources needed to get good results are much lower.

5

I.Introduction

6 of 12

6

II.Method

7 of 12

7

II.Method

8 of 12

8

II.Method

9 of 12

9

II.Method

10 of 12

  • Environment:
  • OS : ubuntu 18.04.6
  • python : 3.9.10
  • pytorch : torch 2.3.1+cu121, torchvision 0.18.1, (vit-pytorch 1.6.9).
  • AMD Ryzen 7 3700X 8-Core Processor
  • NVIDIA GeForce RTX 3060
  • Dataset:
  • CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
  • CIFAR-100 dataset except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
  • ImageNet-1k (pretrain) spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.

10

III. Experiments

11 of 12

11

III. Experiments

12 of 12

  • Investigated the possibility of directly applying Transformers to image recognition.
  • Interpreted images as sequences of patches and processed them using a standard Transformer encoder, similar to its use in NLP.
  • This simple yet scalable strategy works surprisingly well when coupled with pre-training on large datasets.
  • Vision Transformer matches or exceeds the state of the art on many image classification datasets while being relatively cheap to pre-train.
  • Challenges faced:
  • Applying ViT to other computer vision tasks such as detection and segmentation.
  • Continuing to explore self-supervised pre-training methods, despite initial experiments showing improvement from self-supervised pre-training, there is still a large gap compared to large-scale supervised pre-training.
  • Further scaling of ViT would likely lead to improved performance.

12

IV. Conclusions