Ring Attention & Friends
How Gemini 1.5 Scaled To 10 Million Tokens of Context
2nd March - Yannicβs Discord
Contents
π Key Points Will Beβ¦
Chapters
Some Context on Context
Some Context on Context
Chapters
π€ Vanilla Transformer
π€ Vanilla Transformer
π€ Vanilla Transformer
Row-wise softmax normalisation constant
π€ Vanilla Transformer
O(N2) compute and memory requirements
π€ Vanilla Transformer
Whatβs the bottleneck?
Chapters
β¨ Flash Attention - Naive Attention
We can split Q into Qi tiles pretty easily but we donβt seem to be able to split K,V because theyβre coupled along the sequence dimension by the normalisation constant π€
β¨ Flash Attention - 1st Attempt at Tiling
Red before Blue
β¨ Flash Attention - Tiling with the Softmax Rescaling Trick
β¨ Flash Attention - Tiling Video
β¨ Flash Attention - Upshot
Enables 4x larger context window
Also Train GPT-3 >2x faster than previous methods
Reduces memory footprint of attention algorithm from O(n2) to O(logn)
Chapters
π Ring Attention
Suppose we want to increase the sequence length to 100 million tokens.
π Ring Attention
π Ring Attention
Chapters
π¦ Striped Attention
Can we make Ring Attention even faster?
Chapters
π Large World Models
Chapters
π Recap of Key Points
Chapters
References & Marginalia
Core Papers:
Other resources: