Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen
Microsoft GenAI
UIUC
Context length of LLMs
2
From Gemini 1.5 blog
How to support infinite context length?
3
Mamba Layer
A closer look into Mamba layer
4
[1] Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A closer look into Mamba layer
5
[1] Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Why adding SWA?
6
Samba Architecture
7
Samba 1.7B on 230B tokens from Phi2
8
Samba 3.8B on 3.2T tokens from Phi3
9
Efficient Length Extrapolation
10
Samba can memorize long-term information
11
Training Curves for Instruction Tuning
12
Samba is good at long-context summarization
13
What about training on open-source data?
14
Models trained on SlimPajama
Ablation: How to train models with SWA?
15
Why not hybridize with full attention?
16
How to allocate parameters for attention?
17
Why hybrid works?
18
Effect of Short Convolution
19
Conclusion
20
Future Directions
21
Thanks for your time!