1 of 29

Efficient Visual Self-Attention

Efficiency, Generalization, and Paradigm Shift

Shen Zhuoran, Nov. 1, 2021

2 of 29

Overview

  • Efficient attention: from quadratic to linear complexities.
  • Global context module: efficient streaming attention for videos.
  • Global self-attention network: going all-in on attention.

3 of 29

Motivation for attention

  • Long-range dependency modeling.
  • Content-adaptive connections.

4 of 29

Dot-product attention

  •  

5 of 29

Efficient attention

6 of 29

Efficient attention

7 of 29

Interpretation

  •  

8 of 29

Empirical comparison vs. the non-local module

9 of 29

Additional results

Stereo depth estimation (Scene Flow)

Temporal action localization (THUMOS14)

10 of 29

Global context module

11 of 29

Semi-supervised video object segmentation

12 of 29

Deep learning approaches

  • Online learning
    • Accurate but slow
  • Offline learning
    • Template matching
      • Uses first frame as template, prone to appearance change
    • Mask propagation
      • Propagates the mask from previous frame, prone to error accumulation
    • Hybrid methods

13 of 29

Space-time memory module

14 of 29

Global context module

15 of 29

GC module vs. STM module

Complexity

STM

GC

Memory

Computation

16 of 29

Empirical results

17 of 29

Visualizations

18 of 29

Global Self-Attention Networks

19 of 29

Motivations for fully-attentional modeling

  • Elimination of semanticity-resolution trade-off
  • Representation learning
  • Ability to simulate convolution (Attention vs. CNN)

20 of 29

Goal

Bottleneck block

Attention bottleneck block

21 of 29

Efficient attention

  • Lack of spatial information.

21

22 of 29

Encoding positional information

  •  

22

23 of 29

Encoding relative positions for self-attention

  •  

23

24 of 29

Axial attention

  •  

24

25 of 29

GSA module

25

26 of 29

GSA-Net

26

Bottleneck block

GSA-Bottleneck block

27 of 29

Results on ImageNet

27

28 of 29

Discussion

  • Attentional networks exhibit higher capacity.
  • Attentional networks learn faster.
  • Axial attention is slow for batched inference on accelerators.

28

29 of 29

Thank you