1 of 18

Shuangrui Ding

Oct. 2023

Prune Spatio-temporal Tokens by

Semantic-aware Temporal Accumulation

Master student of SJTU EE department

2 of 18

Outline

Motivation

Semantic-aware Temporal Accumulation Score

Experimental Results

Take-away

3 of 18

Motivation

Vision Transformers have become the de-facto choice for the computer vision tasks!

(Image source: Dosovitskiy et. al. 2021)

4 of 18

Motivation

However, the global reception design incur the quadratic computation cost, not friendly for deployment in the real world.

(Image source: Dosovitskiy et. al. 2021)

5 of 18

Motivation

What about Video Transformer?

Even larger computational cost

due to the extra temporal dimension!

e.g., TimeSformer requires 7.14 Tera FLOPs to

achieve 80.7% accuracy on K400 benchmark.

(Image source: Arnab et. al. 2021)

6 of 18

Motivation

Utilizing Transformer for video tasks has boosted performance significantly, but the computational expenses have become too high due to the three-dimensional video input.

How to maintain affordable computational costs while maximizing Transformer's performance advantages?

7 of 18

Motivation

Utilizing Transformer for video tasks has boosted performance significantly, but the computational expenses have become too high due to the three-dimensional video input.

How to maintain affordable computational costs while maximizing Transformer's performance advantages?

Token pruning is a feasible approach to accelerate Transformer, as it can handle a variable number of tokens.

(Image source: Rao et. al. 2021)

8 of 18

Motivation

Token pruning is a feasible approach to accelerate Transformer, as it can handle a variable number of tokens.

There is few work on spatio-temporal tokens pruning. Our work fills the blank!

Motivated by two interesting phenomena, high temporal redundancy and sparse semantic contribution, we propose Semantic-aware Temporal Accumulation score (STA) to prune the spatio-temporal tokens.

9 of 18

Semantic-aware Temporal Accumulation Score

The temporal redundancy of spatio-temporal tokens is high.

Temporal redundancy value S, where a larger value indicates higher redundancy.

10 of 18

Semantic-aware Temporal Accumulation Score

Tokens containing high semantic information are sparse.

11 of 18

Semantic-aware Temporal Accumulation Score

Based on the high temporal redundancy and low semantic density, we propose the Semantic-aware Temporal Accumulation score (STA) to determine whether to discard each token.

12 of 18

Semantic-aware Temporal Accumulation Score

For high temporal redundancy, we build a simple Markov chain:

Define the Temporal Accumulation Score;

Compare the similarity of tokens between consecutive frames;

Transfer the redundancy score from the previous frame's tokens to the tokens in the subsequent frame in proportion to their similarity.

Prioritize discarding tokens with high Temporal Accumulation Scores.

13 of 18

Semantic-aware Temporal Accumulation Score

We assign a semantic importance score to each token through attention maps.

Using this score, we reweight the temporal accumulation scores.

14 of 18

Semantic-aware Temporal Accumulation Score

In summary, STA has several advantages:

The temporal aggregation design makes the scoring more motion-aware, eliminating true redundancy with low semantic content.
It's plug-and-play, requiring no additional parameters and no need to retrain the video Transformer.
With low computational complexity, it allows for parallel computation and is suitable for modern GPU devices.

STA is efficient and easy to deploy, making it an ideal token pruning solution.

15 of 18

Experimental Results

Kinetics-400

Something-something V2

16 of 18

Experimental Results

Our pruning algorithm preserves the area of rich semantics well.

17 of 18

Take-away

Spatio-temporal tokens display high temporal redundancy and low semantics density.

Our proposed scoring mechnism significantly reduce computation overhead with a subtle accuracy drop.

Future work can explore the pruning technique during the training phase.

18 of 18

Many Thanks

Q & A