1 of 31

Ring Attention & Friends

How Gemini 1.5 Scaled To 10 Million Tokens of Context

2nd March - Yannic’s Discord

2 of 31

3 of 31

Contents

  1. Background
    1. Some context on context
  2. The technical work in a 4 part journey
    • Vanilla Transformer Attention
    • Flash Attention
    • Ring Attention
    • Striped Attention
  3. What long context enables:
    • World Models
    • Gemini 1.5
  4. Questions and After Hours

4 of 31

πŸ”‘ Key Points Will Be…

  • Reducing Memory Footprint is unreasonably effective for a lot of problems
    • To that end, tiling is the trick that seems to work
  • Play to the hardware - Always think about GPUs
  • Avoid premature optimisation - look at the actual bottleneck
  • Sometimes small ideas make the largest difference

5 of 31

Chapters

  • Background
    • Some context on context
  • The technical work in a 4 part journey
    • Vanilla Transformer
    • Flash Attention
    • Ring Attention
    • Striped Attention
  • What long context enables:
    • World Models
    • Gemini 1.5
  • Questions and After Hours

6 of 31

Some Context on Context

7 of 31

Some Context on Context

8 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • The technical work in a 4 part journey
    • πŸ€– Vanilla Transformer Attention
    • Flash Attention
    • Ring Attention
    • Striped Attention
  • What long context enables:
    • World Models
    • Gemini 1.5
  • Questions and After Hours

9 of 31

πŸ€– Vanilla Transformer

10 of 31

πŸ€– Vanilla Transformer

11 of 31

πŸ€– Vanilla Transformer

Row-wise softmax normalisation constant

12 of 31

πŸ€– Vanilla Transformer

O(N2) compute and memory requirements

13 of 31

πŸ€– Vanilla Transformer

What’s the bottleneck?

14 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • The technical work in a 4 part journey
    • βœ… Vanilla Transformer
    • ✨ Flash Attention
    • Ring Attention
    • Striped Attention
  • What long context enables:
    • World Models
    • Gemini 1.5
  • Questions and After Hours

15 of 31

✨ Flash Attention - Naive Attention

We can split Q into Qi tiles pretty easily but we don’t seem to be able to split K,V because they’re coupled along the sequence dimension by the normalisation constant πŸ€”

16 of 31

✨ Flash Attention - 1st Attempt at Tiling

Red before Blue

17 of 31

✨ Flash Attention - Tiling with the Softmax Rescaling Trick

18 of 31

✨ Flash Attention - Tiling Video

19 of 31

✨ Flash Attention - Upshot

Enables 4x larger context window

Also Train GPT-3 >2x faster than previous methods

Reduces memory footprint of attention algorithm from O(n2) to O(logn)

20 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • The technical work in a 4 part journey
    • βœ… Vanilla Transformer
    • βœ… Flash Attention
    • πŸ’ Ring Attention
    • Striped Attention
  • What long context enables:
    • World Models
    • Gemini 1.5
  • Questions and After Hours

21 of 31

πŸ’ Ring Attention

Suppose we want to increase the sequence length to 100 million tokens.

  • What’s the new bottleneck?

22 of 31

πŸ’ Ring Attention

23 of 31

πŸ’ Ring Attention

24 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • The technical work in a 4 part journey
    • βœ… Vanilla Transformer
    • βœ… Flash Attention
    • βœ… Ring Attention
    • πŸ¦“ Striped Attention
  • What long context enables:
    • World Models
    • Gemini 1.5
  • Questions and After Hours

25 of 31

πŸ¦“ Striped Attention

Can we make Ring Attention even faster?

  • If we’re doing autoregressive generation, then apparently yes!

26 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • βœ… The technical work in a 4 part journey
    • βœ… Vanilla Transformer
    • βœ… Flash Attention
    • βœ… Ring Attention
    • βœ… Striped Attention
  • What long context enables:
    • World Models
    • Gemini 1.5
  • Questions and After Hours

27 of 31

🌍 Large World Models

28 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • βœ… The technical work in a 4 part journey
    • βœ… Vanilla Transformer
    • βœ… Flash Attention
    • βœ… Ring Attention
    • βœ… Striped Attention
  • What long context enables:
    • βœ… World Models
    • Gemini 1.5
  • Questions and After Hours

29 of 31

πŸ”‘ Recap of Key Points

  • Reducing Memory Footprint is unreasonably effective for a lot of problems
    • To that end, tiling is the trick that seems to work
  • Play to the hardware - Always think about GPUs
  • Avoid premature optimisation - look at the actual bottleneck
  • Sometimes small ideas make the largest difference

30 of 31

Chapters

  • βœ… Background
    • βœ… Some context on context
  • βœ… The technical work in a 4 part journey
    • βœ… Vanilla Transformer
    • βœ… Flash Attention
    • βœ… Ring Attention
    • βœ… Striped Attention
  • βœ… What long context enables:
    • βœ… World Models
    • βœ… Gemini 1.5
  • Questions, References and After Hours

31 of 31

References & Marginalia

Core Papers: