1 of 13

Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi¹*, Sanghyeok Lee¹*, Byungoh Ko¹, �Eunseo Kim¹, Jihyung Kil², Hyunwoo J. Kim³

¹Korea University ²Adobe Resaerch ³KAIST

2 of 13

1. Motivation

1. Token Pruning

...

: patch

: CLS Token

Attention Map

Input : 16 Token

Output : 10 Token

38% ↓

Importance Score

3 of 13

1. Motivation

1. Token Pruning

PuMER (ACL ’23)

H : # of head

T : # of text tokens

→ Attention Map을 필요로 함

4 of 13

1. Motivation

2. Fused Kernel Attention (FlashAttention)

FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)

5 of 13

1. Motivation

2. Fused Kernel Attention (FlashAttention)

FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)

→ Attention Map을 저장하지 못함

6 of 13

1. Motivation

2. Fused Kernel Attention (FlashAttention)

FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)

→ Attention Map을 저장하지 못함

Research Question

Attention Map 없이 토큰들의 중요도를 어떻게 판단할 것 인가?

x 2.7

x 1.5

7 of 13

2. Method

Overall architecture and representation shift

8 of 13

3. Experiments

1. Attention vs Ours (Representation Shift)

9 of 13

3. Experiments

1. Attention vs Ours (Representation Shift)

Video-text retrieval

Video question answering

Image classification

+6.3

+3.9

+9.7

+9.1

+5.4

+8.7

+7.3

avg +7.2

x2.9

10 of 13

3. Experiments

2. Extension to Other Pruning Method (Vid-TLDR)

Applied to vid-TLDR

11 of 13

3. Experiments

3. Extension to Other Architecture (CNN & SSM)

Applied to ResNet

Applied to Mamba

Visualization in ResNet-50

12 of 13

4. Summary

We propose a novel training-free, model agnostic token importance criterion based on representation shift.

Unlike conventional methods, it operates independently of attention maps, allowing seamless integration with FlashAttention

Moreover, its applicability extends beyond Transformers to CNNs & SSM

We qualitatively demonstrate that our approach successfully detects, foreground objects in early and middle layers more effectively than existing methods

13 of 13

Thank You