1 of 13

Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi¹*, Sanghyeok Lee¹*, Byungoh Ko¹, �Eunseo Kim¹, Jihyung Kil², Hyunwoo J. Kim³

¹Korea University ²Adobe Resaerch ³KAIST

2 of 13

1. Motivation

1. Token Pruning

...

...

: patch

: CLS Token

Attention Map

Input : 16 Token

Output : 10 Token

38% ↓

Importance Score

3 of 13

1. Motivation

1. Token Pruning

PuMER (ACL ’23)

H : # of head

T : # of text tokens

→ Attention Map을 필요로 함

4 of 13

1. Motivation

2. Fused Kernel Attention (FlashAttention)

FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)

5 of 13

1. Motivation

2. Fused Kernel Attention (FlashAttention)

FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)

→ Attention Map을 저장하지 못함

6 of 13

1. Motivation

2. Fused Kernel Attention (FlashAttention)

FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)

→ Attention Map을 저장하지 못함

Research Question

Attention Map 없이 토큰들의 중요도를 어떻게 판단할 것 인가?

x 2.7

x 1.5

7 of 13

2. Method

Overall architecture and representation shift

8 of 13

3. Experiments

1. Attention vs Ours (Representation Shift)

9 of 13

3. Experiments

1. Attention vs Ours (Representation Shift)

Video-text retrieval

Video question answering

Image classification

+6.3

+3.9

+9.7

+9.1

+5.4

+8.7

+7.3

avg +7.2

x2.9

10 of 13

3. Experiments

2. Extension to Other Pruning Method (Vid-TLDR)

Applied to vid-TLDR

11 of 13

3. Experiments

3. Extension to Other Architecture (CNN & SSM)

Applied to ResNet

Applied to Mamba

Visualization in ResNet-50

12 of 13

4. Summary

  • We propose a novel training-free, model agnostic token importance criterion based on representation shift.

  • Unlike conventional methods, it operates independently of attention maps, allowing seamless integration with FlashAttention

  • Moreover, its applicability extends beyond Transformers to CNNs & SSM

  • We qualitatively demonstrate that our approach successfully detects, foreground objects in early and middle layers more effectively than existing methods

13 of 13

Thank You