1 of 18

Understanding The Robustness In Vision Transformers

Daquan Zhou, Zhiding Yu, Enze Xie

Chaowei Xiao, Anima Anandkumar, Jiashi Feng and Jose M. Alvarez

2 of 18

Advances in Visual Recognition

Larger Models

Faster Computing

Bigger Data

3 of 18

Standard Visual Recognition Is Getting Saturated

ImageNet-1K

Top Performing Models

4 of 18

Challenge – Real World Data Are Imperfect

  • Domain shift
  • Data noise
  • Imbalanced distribution
  • Can contain Occlusions
  • Can be Cluttered
  • Can be Ambiguous
  • Can be Deceiving
  • •••

5 of 18

Corrupted ImageNet (ImageNet-C)

COCO-C/Cityscapes-C

More Challenging Scenarios

Hendrycks et al., Benchmarking Neural Network Robustness to Common Corruptions and Perturbations, ICLR19

6 of 18

ResNet

Image Classification

Semantic Segmentation

How Well Do Current DNNs Perform?

7 of 18

ViTs Are Robust Learners

Bai et al., Are Transformers More Robust Than CNNs? NeurIPS21

Naseer et al., Intriguing Properties of Vision Transformers, NeurIPS21

Mao et al., RVT: Towards Robust Vision Transformer, CVPR22 

Zhang et al., Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, CVPR22 

8 of 18

Delving Deeper into ViT’s Robustness

9 of 18

Visual Grouping and Information Bottleneck

“I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have "327"? No. I have sky, house, and trees.”

——Max Wertheimer

Visual Grouping

Information Bottleneck (IB)

“Information bottlenecks are extremely interesting. I have to listen to it ten thousand times to really understand it. It's hard to hear such original ideas today. Maybe it's the key to the puzzle.”

——Geoffrey Hinton

10 of 18

Visual Grouping

11 of 18

Spectral Clustering vs. Self-Attention

Image Credit: Jay Alammar, The Illustrated Transformer.

Image Credit: Spectral Clustering for Molecular Emission Segmentation.

12 of 18

Emerging Properties in ViTs

Caron et al., Emerging Properties in Self-Supervised Vision Transformers, ICCV21

Correlation between grouping and robustness over network blocks

13 of 18

The Trinity among Grouping, IB and Robust Generalization

14 of 18

MSHA as Mixture of IBs

15 of 18

Fully Attentional Network

  • Further deploy the attention mechanism reinforce the clustering phenomenon
        • Fore-ground objects are better captured
  • Directly apply SA along the channel dimension has two drawbacks
      • Large computational overhead
      • Low parameter efficiency

16 of 18

Main Results – Image Classification

17 of 18

Main Results – Downstream Tasks

18 of 18

https://github.com/NVlabs/FAN

Code Available