Shuangrui Ding
Oct. 2023
Prune Spatio-temporal Tokens by
Semantic-aware Temporal Accumulation
Master student of SJTU EE department
Outline
Motivation
Vision Transformers have become the de-facto choice for the computer vision tasks!
(Image source: Dosovitskiy et. al. 2021)
Motivation
However, the global reception design incur the quadratic computation cost, not friendly for deployment in the real world.
(Image source: Dosovitskiy et. al. 2021)
Motivation
What about Video Transformer?
Even larger computational cost
due to the extra temporal dimension!
e.g., TimeSformer requires 7.14 Tera FLOPs to
achieve 80.7% accuracy on K400 benchmark.
(Image source: Arnab et. al. 2021)
Motivation
Utilizing Transformer for video tasks has boosted performance significantly, but the computational expenses have become too high due to the three-dimensional video input.
How to maintain affordable computational costs while maximizing Transformer's performance advantages?
Motivation
Utilizing Transformer for video tasks has boosted performance significantly, but the computational expenses have become too high due to the three-dimensional video input.
How to maintain affordable computational costs while maximizing Transformer's performance advantages?
Token pruning is a feasible approach to accelerate Transformer, as it can handle a variable number of tokens.
(Image source: Rao et. al. 2021)
Motivation
Token pruning is a feasible approach to accelerate Transformer, as it can handle a variable number of tokens.
There is few work on spatio-temporal tokens pruning. Our work fills the blank!
Motivated by two interesting phenomena, high temporal redundancy and sparse semantic contribution, we propose Semantic-aware Temporal Accumulation score (STA) to prune the spatio-temporal tokens.
Semantic-aware Temporal Accumulation Score
Temporal redundancy value S, where a larger value indicates higher redundancy.
Semantic-aware Temporal Accumulation Score
Semantic-aware Temporal Accumulation Score
Based on the high temporal redundancy and low semantic density, we propose the Semantic-aware Temporal Accumulation score (STA) to determine whether to discard each token.
Semantic-aware Temporal Accumulation Score
For high temporal redundancy, we build a simple Markov chain:
Semantic-aware Temporal Accumulation Score
We assign a semantic importance score to each token through attention maps.
Using this score, we reweight the temporal accumulation scores.
Semantic-aware Temporal Accumulation Score
In summary, STA has several advantages:
STA is efficient and easy to deploy, making it an ideal token pruning solution.
Experimental Results
Kinetics-400
Something-something V2
Experimental Results
Our pruning algorithm preserves the area of rich semantics well.
Take-away
Many Thanks
Q & A