Representation Shift: Unifying Token Compression with FlashAttention
Joonmyung Choi¹*, Sanghyeok Lee¹*, Byungoh Ko¹, �Eunseo Kim¹, Jihyung Kil², Hyunwoo J. Kim³
¹Korea University ²Adobe Resaerch ³KAIST
1. Motivation
1. Token Pruning
...
...
: patch
: CLS Token
Attention Map
Input : 16 Token
Output : 10 Token
38% ↓
Importance Score
1. Motivation
1. Token Pruning
PuMER (ACL ’23)
H : # of head
T : # of text tokens
→ Attention Map을 필요로 함
1. Motivation
2. Fused Kernel Attention (FlashAttention)
FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)
1. Motivation
2. Fused Kernel Attention (FlashAttention)
FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)
→ Attention Map을 저장하지 못함
1. Motivation
2. Fused Kernel Attention (FlashAttention)
FlashAttention: Fast and Memory-Efficient Exact Attention �with IO-Awareness (NeurIPS 22’)
→ Attention Map을 저장하지 못함
Research Question
Attention Map 없이 토큰들의 중요도를 어떻게 판단할 것 인가?
x 2.7
x 1.5
2. Method
Overall architecture and representation shift
3. Experiments
1. Attention vs Ours (Representation Shift)
3. Experiments
1. Attention vs Ours (Representation Shift)
Video-text retrieval
Video question answering
Image classification
+6.3
+3.9
+9.7
+9.1
+5.4
+8.7
+7.3
avg +7.2
x2.9
3. Experiments
2. Extension to Other Pruning Method (Vid-TLDR)
Applied to vid-TLDR
3. Experiments
3. Extension to Other Architecture (CNN & SSM)
Applied to ResNet
Applied to Mamba
Visualization in ResNet-50
4. Summary
Thank You