1 of 12

Reporter : Bor-Kai Pan

Advisor : Dr. Yuan-Kai Wang

Intelligent System Laboratory of Electrical Engineering Department,

Fu Jen Catholic University

2024/06/19

Vision Transformer

Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,

Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,

Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†

∗equal technical contribution, †equal advising

Google Research, Brain Team

ICLR 2021

2 of 12

Introduction
Method
Experiments
Conclusions

2

Outline

3 of 12

The Transformer architecture has become the standard for natural language processing but its applications in computer vision are still limited. In vision, attention mechanisms are usually combined with convolutional networks or replace certain parts while keeping the overall structure intact.
This study shows that a pure Transformer can be applied directly to sequences of image patches and perform well in image classification tasks.
When pre-trained on large datasets and transferred to mid-sized or small image recognition benchmarks (like ImageNet, CIFAR-100, VTAB), Vision Transformer (ViT) performs excellently compared to state-of-the-art convolutional networks while requiring fewer computational resources to train.
The study splits images into patches and uses the sequence of linear embeddings of these patches as input to the Transformer.
When trained on mid-sized datasets, these models show modest accuracies slightly below ResNets of comparable size.
However, when trained on larger datasets (14M-300M images), ViT's performance significantly improves.
Pre-trained on large datasets like ImageNet-21k or JFT-300M, ViT approaches or surpasses state-of-the-art performance on multiple image recognition benchmarks.

3

I.Introduction

Transformer 架構已成為自然語言處理的標準，但在電腦視覺中的應用仍然有限。
在視覺領域，注意力機制通常與卷積網絡結合使用，或替換卷積網絡的某些部分，但保持整體結構不變。
本研究展示了純粹的 Transformer 可以直接應用於圖像片段序列，在圖像分類任務中表現出色。
當在大規模數據上進行預訓練並轉移到中小型圖像識別基準（如 ImageNet、CIFAR-100、VTAB 等）時，Vision Transformer (ViT) 的表現與最先進的卷積網絡相比表現優異，且訓練所需的計算資源更少。
#Transformer 架構在自然語言處理中的成功啟發了本研究，嘗試將標準 Transformer 直接應用於圖像。
研究中將圖像分割成片段，並將這些片段的線性嵌入序列作為 Transformer 的輸入。
在中等規模的數據集上訓練時，這些模型的準確性略低於同等規模的 ResNet。
當在較大的數據集（如 14M-300M 圖像）上訓練時，ViT 的表現顯著提高。
在大型數據集（如 ImageNet-21k 或 JFT-300M）上進行預訓練後，ViT 能在多個圖像識別基準上接近或超過最先進的水平。
#最佳模型在 ImageNet 上達到 88.55% 的準確率，在 ImageNet-ReaL 上達到 90.72%，在 CIFAR-100 上達到 94.55%，在 VTAB 19 個任務上達到 77.63%。

4 of 12

Transformers were originally proposed by Vaswani et al. (2017) for machine translation and have become the state-of-the-art method in many NLP tasks.

4

I.Introduction

Transformers 最初由 Vaswani 等人（2017）提出用於機器翻譯，並已成為許多 NLP 任務的最先進方法。

Transformer 架構於 2017 年 6 月推出。原本研究的重點是翻譯任務。隨後推出了幾個有影響力的模型，包括

2018 年 6 月: GPT, 第一個預訓練的 Transformer 模型，用於各種 NLP 任務並獲得極好的結果
2018 年 10 月: BERT, 另一個大型預訓練模型，該模型旨在生成更好的句子摘要（下一章將詳細介紹！）
2019 年 2 月: GPT-2, GPT 的改進（並且更大）版本，由於道德問題沒有立即公開發布
2019 年 10 月: DistilBERT, BERT 的提煉版本，速度提高 60%，內存減輕 40%，但仍保留 BERT 97% 的性能
2019 年 10 月: BART 和 T5, 兩個使用與原始 Transformer 模型相同架構的大型預訓練模型（第一個這樣做）
2020 年 5 月, GPT-3, GPT-2 的更大版本，無需微調即可在各種任務上表現良好（稱爲零樣本學習）

這個列表並不全面，只是爲了突出一些不同類型的 Transformer 模型。大體上，它們可以分爲三類：

GPT-like (也被稱作自迴歸Transformer模型)
BERT-like (也被稱作自動編碼Transformer模型)
BART/T5-like (也被稱作序列到序列的 Transformer模型)
因果語言建模 / 遮罩語言建模

5 of 12

Transfer Learning
Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
For the same reason, the amount of time and resources needed to get good results are much lower.

5

I.Introduction

預訓練是訓練模型前的一個操作：隨機初始化權重，在沒有任何先驗知識的情況下開始訓練。

這種預訓練通常是在非常大量的數據上進行的。因此，它需要大量的數據，而且訓練可能需要幾周的時間。

另一方面，微調是在模型經過預訓練後完成的訓練。要執行微調，首先需要獲取一個經過預訓練的語言模型，然後使用特定於任務的數據集執行額外的訓練。等等，爲什麼不直接爲最後的任務而訓練呢？有幾個原因：

預訓練模型已經在與微調數據集有一些相似之處的數據集上進行了訓練。因此，微調過程能夠利用模型在預訓練期間獲得的知識（例如，對於NLP問題，預訓練模型將對您在任務中使用的語言有某種統計規律上的理解）。
由於預訓練模型已經在大量數據上進行了訓練，因此微調需要更少的數據來獲得不錯的結果。
出於同樣的原因，獲得好結果所需的時間和資源要少得多

例如，可以利用英語的預訓練過的模型，然後在arXiv語料庫上對其進行微調，從而形成一個基於科學/研究的模型。微調只需要有限的數據量：預訓練模型獲得的知識可以“遷移”到目標任務上，因此被稱爲遷移學習。

因此，微調模型具有較低的時間、數據、財務和環境成本。迭代不同的微調方案也更快、更容易，因爲與完整的預訓練相比，訓練的約束更少。

這個過程也會比從頭開始的訓練（除非你有很多數據）取得更好的效果，這就是爲什麼你應該總是嘗試利用一個預訓練的模型—一個儘可能接近你手頭的任務的模型—並對其進行微調。

�

6 of 12

6

II.Method

圖1：模型概述。我們將圖像分割成固定大小的圖像塊(image patch/token)，對每個圖像塊進行線性嵌入並添加位置資訊，並將得到的向量序列送入一個標準的Transformer編碼器。
相對於NLP領域的每個輸入單位使用的是Word Embedding，本篇論文提出了Patch Embedding作法，將影像切分為圖塊向量

該模型透過將一張影像切成多個 patch 並丟入模型中。接著進到 Transformer Encoder 對輸入的所有資訊進行特徵萃取，最後再經過一個全連接層進行影像分類。

1. 將圖片轉成序列化資訊 (Split image)

為了將一張影像變成一串序列編碼，我們需要把 H×W×C 的影像變成 N×(P²×C)。以下圖為例，假設我們有一張寬(W)和高(H) 32 X 32 的彩色影像(C=3)。Patch size 表示為 (P, P) 範例中使用 4 X 4 大小的 patch。N 表示 pacth 的總數量，其計算方式為 N=HW/P²，在這個例子中我們將會得到 64 個 patches。

而論文中範例原始圖片大小為 48 x 48 x 3，Patch Size=16 因此將會把一張圖片切成 9 個 patch，每個 patch 大小為 16 x 16 x 3。第一張 patch 稱為 x¹ₚ，依此類推最後一張為 x⁹ₚ。

2. Linear Projection

此步驟會將原本 N 個 patch 圖片映射成 N 個 D 維的向量。實際的作法是將每個 patch (x¹ₚ ~ xᴺₚ) 攤平(Flatten) 接著乘上一個透過訓練得到的 Linear Projection 稱為 E。E 是一個(P x P x C) x D的矩陣。D 的數字及代表將每個 patch 轉換後的維度(projection_dim)，這是一個可以自行控制的超參數。

3. Position embedding

由於每個 patch 在整張影像中是有順序性的，因此我們需要為這些 patch embedding 向量添加一些位置的資訊。如圖所示，將編號 0~9 的紫色框表示各個位置的 position embedding(編碼方式是透過神經網路學習)，而紫色框旁邊的粉色框則是上一部所提到的經過 linear projection 後的 patch embedding 向量。最後將每個 patch 的紫框和粉框相加後正式得到 Embadded Patches 的輸出。

值得一提的是 ViT 巧妙的運用 learnable class token 學習每個 patch 和目標物的關聯性。因此在圖中的最左邊有一個 * 的 Patch Embedding 是透過訓練得到的 [CLS] Embedding。因此這裡的 x⁰ₚ 經過 encoder 後對應的結果作為整個圖的表示，因為在 Transformer Encoder 中會拿 x⁰ₚ 當 key 與每一個 patch 進行 query。

4. Transformer Encoder

Transformer 觀念很推薦大家先去觀看李宏毅課程 Transformer 機制解說，筆者將它整理成筆記分享給各位。簡單來說一個 Transformer Encoder 是由多個 block 堆疊而成的。也就是下圖中灰色的區塊。

首先輸入 z₀ 以後先經過一層 Layer Normalization(LN)，接著進入由 Self-Attention 所組成的 Multiheaded Self-Attention(MSA)。此時得到的輸出再加上原本輸入的 z₀ 得到 z’ℓ，這一個動作也就是 residual connection。

接下來再通過一層 LN 以及 MultiLayer Perceptron(MLP)，最後再 residual connection 一次得到得到第 ℓ 層的輸出 zℓ。值得一提的是這裡的 MLP 是由兩層的 Dense layer 全連接神經網路所組成，第一層的神經元的數量可以自行設定，通常是 projection_dim*2 接著第二層的神經元數一定要等於 projection_dim。另外 activation 在論文中是採用 tf.nn.gelu (TF 2.4 以上版本以上)。

5. 輸出分類

最後要進行影像的分類，將經過 N 個 block 後得到的輸出僅拿取其中的 [CLS] token Encode 後的結果，也就是 z⁰L。將它丟入 MLP 最後再接 softmax 產生出每個 class 的機率輸出預測結果。

�

7 of 12

7

II.Method

將圖片轉成序列化資訊 (Split image)
為了將一張影像變成一串序列編碼，我們需要把 H×W×C 的影像變成 N×(P²×C)。以下圖為例，假設我們有一張寬(W)和高(H) 32 X 32 的彩色影像(C=3)。Patch size 表示為 (P, P) 範例中使用 4 X 4 大小的 patch。N 表示 pacth 的總數量，其計算方式為 N=HW/P²，在這個例子中我們將會得到 64 個 patches。
而論文中範例原始圖片大小為 48 x 48 x 3，Patch Size=16 因此將會把一張圖片切成 9 個 patch，每個 patch 大小為 16 x 16 x 3。第一張 patch 稱為 x¹ₚ，依此類推最後一張為 x⁹ₚ。

2. Linear Projection

此步驟會將原本 N 個 patch 圖片映射成 N 個 D 維的向量。實際的作法是將每個 patch (x¹ₚ ~ xᴺₚ) 攤平(Flatten) 接著乘上一個透過訓練得到的 Linear Projection 稱為 E。E 是一個(P x P x C) x D的矩陣。D 的數字及代表將每個 patch 轉換後的維度(projection_dim)，這是一個可以自行控制的超參數。

3. Position embedding
由於每個 patch 在整張影像中是有順序性的，因此我們需要為這些 patch embedding 向量添加一些位置的資訊。如圖所示，將編號 0~9 的紫色框表示各個位置的 position embedding(編碼方式是透過神經網路學習)，而紫色框旁邊的粉色框則是上一部所提到的經過 linear projection 後的 patch embedding 向量。最後將每個 patch 的紫框和粉框相加後正式得到 Embadded Patches 的輸出。
值得一提的是 ViT 巧妙的運用 learnable class token 學習每個 patch 和目標物的關聯性。因此在圖中的最左邊有一個 * 的 Patch Embedding 是透過訓練得到的 [CLS] Embedding。因此這裡的 x⁰ₚ 經過 encoder 後對應的結果作為整個圖的表示，因為在 Transformer Encoder 中會拿 x⁰ₚ 當 key 與每一個 patch 進行 query。

8 of 12

8

II.Method

9 of 12

9

II.Method

圖1：模型概述。我們將圖像分割成固定大小的圖像塊(image patch/token)，對每個圖像塊進行線性嵌入並添加位置資訊，並將得到的向量序列送入一個標準的Transformer編碼器。
相對於NLP領域的每個輸入單位使用的是Word Embedding，本篇論文提出了Patch Embedding作法，將影像切分為圖塊向量

該模型透過將一張影像切成多個 patch 並丟入模型中。接著進到 Transformer Encoder 對輸入的所有資訊進行特徵萃取，最後再經過一個全連接層進行影像分類。

1. 將圖片轉成序列化資訊 (Split image)

為了將一張影像變成一串序列編碼，我們需要把 H×W×C 的影像變成 N×(P²×C)。以下圖為例，假設我們有一張寬(W)和高(H) 32 X 32 的彩色影像(C=3)。Patch size 表示為 (P, P) 範例中使用 4 X 4 大小的 patch。N 表示 pacth 的總數量，其計算方式為 N=HW/P²，在這個例子中我們將會得到 64 個 patches。

而論文中範例原始圖片大小為 48 x 48 x 3，Patch Size=16 因此將會把一張圖片切成 9 個 patch，每個 patch 大小為 16 x 16 x 3。第一張 patch 稱為 x¹ₚ，依此類推最後一張為 x⁹ₚ。

2. Linear Projection

此步驟會將原本 N 個 patch 圖片映射成 N 個 D 維的向量。實際的作法是將每個 patch (x¹ₚ ~ xᴺₚ) 攤平(Flatten) 接著乘上一個透過訓練得到的 Linear Projection 稱為 E。E 是一個(P x P x C) x D的矩陣。D 的數字及代表將每個 patch 轉換後的維度(projection_dim)，這是一個可以自行控制的超參數。

3. Position embedding

由於每個 patch 在整張影像中是有順序性的，因此我們需要為這些 patch embedding 向量添加一些位置的資訊。如圖所示，將編號 0~9 的紫色框表示各個位置的 position embedding(編碼方式是透過神經網路學習)，而紫色框旁邊的粉色框則是上一部所提到的經過 linear projection 後的 patch embedding 向量。最後將每個 patch 的紫框和粉框相加後正式得到 Embadded Patches 的輸出。

值得一提的是 ViT 巧妙的運用 learnable class token 學習每個 patch 和目標物的關聯性。因此在圖中的最左邊有一個 * 的 Patch Embedding 是透過訓練得到的 [CLS] Embedding。因此這裡的 x⁰ₚ 經過 encoder 後對應的結果作為整個圖的表示，因為在 Transformer Encoder 中會拿 x⁰ₚ 當 key 與每一個 patch 進行 query。

4. Transformer Encoder

Transformer 觀念很推薦大家先去觀看李宏毅課程 Transformer 機制解說，筆者將它整理成筆記分享給各位。簡單來說一個 Transformer Encoder 是由多個 block 堆疊而成的。也就是下圖中灰色的區塊。

首先輸入 z₀ 以後先經過一層 Layer Normalization(LN)，接著進入由 Self-Attention 所組成的 Multiheaded Self-Attention(MSA)。此時得到的輸出再加上原本輸入的 z₀ 得到 z’ℓ，這一個動作也就是 residual connection。

接下來再通過一層 LN 以及 MultiLayer Perceptron(MLP)，最後再 residual connection 一次得到得到第 ℓ 層的輸出 zℓ。值得一提的是這裡的 MLP 是由兩層的 Dense layer 全連接神經網路所組成，第一層的神經元的數量可以自行設定，通常是 projection_dim*2 接著第二層的神經元數一定要等於 projection_dim。另外 activation 在論文中是採用 tf.nn.gelu (TF 2.4 以上版本以上)。

5. 輸出分類

最後要進行影像的分類，將經過 N 個 block 後得到的輸出僅拿取其中的 [CLS] token Encode 後的結果，也就是 z⁰L。將它丟入 MLP 最後再接 softmax 產生出每個 class 的機率輸出預測結果。

�

10 of 12

Environment:
OS : ubuntu 18.04.6
python : 3.9.10
pytorch : torch 2.3.1+cu121, torchvision 0.18.1, (vit-pytorch 1.6.9).
AMD Ryzen 7 3700X 8-Core Processor
NVIDIA GeForce RTX 3060
Dataset:
CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
CIFAR-100 dataset except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
ImageNet-1k (pretrain) spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.

10

III. Experiments

CIFAR-10 數據集包含 60,000 張 32x32 像素的彩色影像，這些影像分為 10 個類別，每個類別包含 6,000 張影像。類別包括：飛機、汽車、鳥、貓、鹿、狗、青蛙、馬、船、卡車。

訓練集：50,000 張影像測試集：10,000 張影像

CIFAR-100 數據集與 CIFAR-10 類似，但它包含 100 個類別，每個類別有 600 張影像。每個影像的分辨率仍然是 32x32 像素。CIFAR-100 的類別更加細緻，包括 20 個超類別（superclass），每個超類別包含 5 個細類別（subclass）。

ImageNet 包含超過 1,400 萬張標註過的影像，這些影像涵蓋了來自多個類別的日常物體。影像類別數量超過 20,000 個，其中常用於影像分類比賽（如 ImageNet Large Scale Visual Recognition Challenge, ILSVRC）的子集包含 1,000 個類別。

ImageNet 數據集中的圖片分辨率不固定，因為圖片來自於互聯網，各自的尺寸和比例都有所不同。

不過，常用於訓練和測試的標準做法是將圖片調整到統一尺寸。例如：

在 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 中，通常會將圖片進行縮放和裁剪，使得輸入模型的圖片最終尺寸為 224×224224 \times 224224×224 像素。
有時也會使用其他尺寸，如 256×256256 \times 256256×256 或 299×299299 \times 299299×299，具體取決於模型和實驗設計的需求。

Pytorch : Resize size: [256, 256] # 調整大小至256x256像素 Crop size: [224, 224] # 中心裁剪至224x224像素 Mean: [0.485, 0.456, 0.406] # 圖像標準化的均值 Standard deviation: [0.229, 0.224, 0.225] # 圖像標準化的標準差

11 of 12

https://docs.google.com/spreadsheets/d/1GA5OIeH2xlfqO_tfzMVIxeWfyuYoiPRj49Tg7ZCpDeE/edit?usp=sharing

11

III. Experiments

12 of 12

Investigated the possibility of directly applying Transformers to image recognition.
Interpreted images as sequences of patches and processed them using a standard Transformer encoder, similar to its use in NLP.
This simple yet scalable strategy works surprisingly well when coupled with pre-training on large datasets.
Vision Transformer matches or exceeds the state of the art on many image classification datasets while being relatively cheap to pre-train.
Challenges faced:
Applying ViT to other computer vision tasks such as detection and segmentation.
Continuing to explore self-supervised pre-training methods, despite initial experiments showing improvement from self-supervised pre-training, there is still a large gap compared to large-scale supervised pre-training.
Further scaling of ViT would likely lead to improved performance.

12

IV. Conclusions

探討直接將Transformer應用於影像辨識：

研究了直接應用Transformer於影像辨識的可能性。

將影像視為區塊序列處理：

將影像解釋為一系列區塊，並用標準的Transformer編碼器進行處理，如同在自然語言處理中使用的方式。

策略簡單但具擴展性：

這種簡單但具擴展性的策略在大數據集上進行預訓練後，效果出乎意料的好。

在多個影像分類數據集上表現出色：

Vision Transformer在許多影像分類數據集上匹敵或超越了最新技術，同時預訓練成本相對較低。

面臨的挑戰：

將ViT應用於其他電腦視覺任務，如檢測和分割。
繼續探索自監督預訓練方法，儘管初步實驗顯示自監督預訓練有改進，但與大規模監督預訓練仍有很大差距。
進一步擴展ViT的規模可能會帶來性能提升。