JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 29

Efficient Visual Self-Attention

Efficiency, Generalization, and Paradigm Shift

Shen Zhuoran, Nov. 1, 2021

2 of 29

Overview

Efficient attention: from quadratic to linear complexities.
Global context module: efficient streaming attention for videos.
Global self-attention network: going all-in on attention.

3 of 29

Motivation for attention

Long-range dependency modeling.
Content-adaptive connections.

4 of 29

Dot-product attention

5 of 29

Efficient attention

6 of 29

Efficient attention

7 of 29

Interpretation

8 of 29

Empirical comparison vs. the non-local module

9 of 29

Additional results

Stereo depth estimation (Scene Flow)

Temporal action localization (THUMOS14)

10 of 29

Global context module

11 of 29

Semi-supervised video object segmentation

12 of 29

Deep learning approaches

Online learning

Accurate but slow

Offline learning

Template matching

Uses first frame as template, prone to appearance change

Mask propagation

Propagates the mask from previous frame, prone to error accumulation

Hybrid methods

13 of 29

Space-time memory module

14 of 29

Global context module

15 of 29

GC module vs. STM module

Complexity	STM	GC
Memory
Computation

16 of 29

Empirical results

17 of 29

Visualizations

18 of 29

Global Self-Attention Networks

19 of 29

Motivations for fully-attentional modeling

Elimination of semanticity-resolution trade-off
Representation learning
Ability to simulate convolution (Attention vs. CNN)

20 of 29

Goal

Bottleneck block

Attention bottleneck block

21 of 29

Efficient attention

Lack of spatial information.

22 of 29

Encoding positional information

23 of 29

Encoding relative positions for self-attention

24 of 29

Axial attention

25 of 29

GSA module

26 of 29

GSA-Net

Bottleneck block

GSA-Bottleneck block

27 of 29

Results on ImageNet

28 of 29

Discussion

Attentional networks exhibit higher capacity.
Attentional networks learn faster.
Axial attention is slow for batched inference on accelerators.

29 of 29

Thank you