1 of 31

Ring Attention & Friends

How Gemini 1.5 Scaled To 10 Million Tokens of Context

2nd March - Yannic’s Discord

2 of 31

3 of 31

Contents

Background

Some context on context

The technical work in a 4 part journey

Vanilla Transformer Attention
Flash Attention
Ring Attention
Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

4 of 31

🔑 Key Points Will Be…

Reducing Memory Footprint is unreasonably effective for a lot of problems

To that end, tiling is the trick that seems to work

Play to the hardware - Always think about GPUs
Avoid premature optimisation - look at the actual bottleneck
Sometimes small ideas make the largest difference

5 of 31

Chapters

Background

Some context on context

The technical work in a 4 part journey

Vanilla Transformer
Flash Attention
Ring Attention
Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

6 of 31

Some Context on Context

7 of 31

Some Context on Context

8 of 31

Chapters

✅ Background

✅ Some context on context

The technical work in a 4 part journey

🤖 Vanilla Transformer Attention
Flash Attention
Ring Attention
Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

9 of 31

🤖 Vanilla Transformer

10 of 31

🤖 Vanilla Transformer

11 of 31

🤖 Vanilla Transformer

Row-wise softmax normalisation constant

12 of 31

🤖 Vanilla Transformer

O(N²) compute and memory requirements

13 of 31

🤖 Vanilla Transformer

What’s the bottleneck?

14 of 31

Chapters

✅ Background

✅ Some context on context

The technical work in a 4 part journey

✅ Vanilla Transformer
✨ Flash Attention
Ring Attention
Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

15 of 31

✨ Flash Attention - Naive Attention

We can split Q into Q_i tiles pretty easily but we don’t seem to be able to split K,V because they’re coupled along the sequence dimension by the normalisation constant 🤔

16 of 31

✨ Flash Attention - 1st Attempt at Tiling

Red before Blue

17 of 31

✨ Flash Attention - Tiling with the Softmax Rescaling Trick

18 of 31

✨ Flash Attention - Tiling Video

19 of 31

✨ Flash Attention - Upshot

Enables 4x larger context window

Also Train GPT-3 >2x faster than previous methods

Reduces memory footprint of attention algorithm from O(n²) to O(logn)

20 of 31

Chapters

✅ Background

✅ Some context on context

The technical work in a 4 part journey

✅ Vanilla Transformer
✅ Flash Attention
💍 Ring Attention
Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

21 of 31

💍 Ring Attention

Suppose we want to increase the sequence length to 100 million tokens.

What’s the new bottleneck?

22 of 31

💍 Ring Attention

23 of 31

💍 Ring Attention

24 of 31

Chapters

✅ Background

✅ Some context on context

The technical work in a 4 part journey

✅ Vanilla Transformer
✅ Flash Attention
✅ Ring Attention
🦓 Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

25 of 31

🦓 Striped Attention

Can we make Ring Attention even faster?

If we’re doing autoregressive generation, then apparently yes!

26 of 31

Chapters

✅ Background

✅ Some context on context

✅ The technical work in a 4 part journey

✅ Vanilla Transformer
✅ Flash Attention
✅ Ring Attention
✅ Striped Attention

What long context enables:

World Models
Gemini 1.5

Questions and After Hours

27 of 31

🌍 Large World Models

28 of 31

Chapters

✅ Background

✅ Some context on context

✅ The technical work in a 4 part journey

✅ Vanilla Transformer
✅ Flash Attention
✅ Ring Attention
✅ Striped Attention

What long context enables:

✅ World Models
Gemini 1.5

Questions and After Hours

29 of 31

🔑 Recap of Key Points

Reducing Memory Footprint is unreasonably effective for a lot of problems

To that end, tiling is the trick that seems to work

Play to the hardware - Always think about GPUs
Avoid premature optimisation - look at the actual bottleneck
Sometimes small ideas make the largest difference

30 of 31

Chapters

✅ Background

✅ Some context on context

✅ The technical work in a 4 part journey

✅ Vanilla Transformer
✅ Flash Attention
✅ Ring Attention
✅ Striped Attention

✅ What long context enables:

✅ World Models
✅ Gemini 1.5

Questions, References and After Hours

31 of 31

References & Marginalia

Core Papers:

Other resources: