10-605 / 10-805
Machine Learning from Large Datasets
Announcements
Outline and Relation to Prior Lectures
KEY-VALUE CACHING FOR TRANSFORMERS
A quick recap of KV caching
Masked self-attention is used in the decoder
In masked self-attention, the query for token at position i can only attend to thr keys for tokens at positions 0,1,…,i
This is done with a mask
Prediction for token at i is not affected by “future” tokens at i+1,i+2, …
Recap: masked self-attention
GPT
Recap: masked self-attention is always used in decoder-only Transformers
Observation
input | input | Step 3 | Step 4 |
Third output depends only on the first two tokens
Key-Value Caching
input | input | input | Step 4 |
Third output depends only on the first two tokens
Fourth output output depends only on the first three tokens
Stuff in green boxes has not changed since step 3
The keys and values can be cached and re-used in forward pass
How important is KV caching?
How important is KV caching?
How important is KV caching?
Prefilling: system prompt for Claude 3.7
Prefilling: system prompt for Claude 3.7
Prefilling: system prompt for Claude 3.7
Prefilling: system prompt for Claude 3.7
Prefilling: system prompt for Claude 3.7
Prefilling: system prompt for Claude 3.7
… about 5 more pages more …
How important is KV caching?
A “Chain of Thought” Prompt
one of three “in context” examples
A “Chain of Thought” Prompt
A “Chain of Thought” Prompt
How important is KV caching?
How important is KV caching?
Time to first token
Prefilling
All the inputs needed in pre-filling are immediately available.
There are opportunities for parallelism
How important is KV caching?
GPT
Recap: generation is auto-regressive
How important is KV caching?
How important is KV caching?
Architecture impacts KV caching
Recall: we can re-use keys and values in “groups”
SPEEDUPS BASED ON KEY-VALUE CACHING
Why is this a lecture?
Approximating quadratic attention with KV caching
Outline and Relation to Prior Lectures
Sparse Attention in Transformers
Approximating quadratic attention
Approximating quadratic attention
Approximating quadratic attention
Approximating quadratic attention
Approximating quadratic attention
shortest paths between nodes in a random Erdos-Renyi graph are about
logn / log np
Approximating quadratic attention
Approximating quadratic attention
Outline and Relation to Prior Lectures
KEY-VALUE CACHING FOR TRANSFORMERS:�TECHNIQUES
Approximating quadratic attention
Finding “Heavy Hitters” in KV Caches
2023
CM Sketch Structure
A Quick Intro to Data Stream Algorithmics – CS262
+c
+c
+c
+c
h1(s)
hd(s)
<s, +c>
d=log 1/δ
w = 2/ε
from: Minos Garofalakis�
i.e. with prob > 1-𝛿
RECAP
CM Sketch Guarantees
A Quick Intro to Data Stream Algorithmics – CS262
from: Minos Garofalakis�
RECAP
CM Sketch Guarantees
Application: finding “Heavy Hitters” in streaming data
A plot of the frequency of each word as a function of its frequency rank for two English language texts: Culpeper's Complete Herbal (1652) and H. G. Wells's The War of the Worlds (1898) in a log-log scale.
62690 the
36043 of
27952 and
25725 to
22000 a
19581 in
10328 that
9969 is
9770 was
8833 for
RECAP
Finding “Heavy Hitters” in KV Caches
sparsity: fraction with < 1% of max attention score
Finding “Heavy Hitters” in KV Caches
red scatter is total attention summed over all decoding steps
x-axis is “co-occurrence” frequency (same token appears ≥ 2x)
Finding “Heavy Hitters” in KV Caches
Finding “Heavy Hitters” in KV Caches
Finding “Heavy Hitters” in KV Caches
Streaming
Full counts
Outline and Relation to Prior Lectures
Other ways of predicting attention
2024
Other ways of predicting attention
Other ways of predicting attention
Other ways of predicting attention
Why are initial positions getting so much attention?
Other ways of predicting attention
Attention sinks
StreamingLLM
Note: I’m skipping some details involving relative positional encoding
red: stream LLM
orange: sliding window attention
blue: dense attention
StreamingLLM vs H2O
Outline and Relation to Prior Lectures
Predictive KV Caching: SnapKV
2024
Predictive KV Caching: SnapKV
Predictive KV Caching: SnapKV
Predictive KV Caching: SnapKV
SnapKV: Details
SnapKV: Experiment
Needle in a Haystack: Background
Needle in a Haystack: SnapKV results
Outline
Predictive KV Caching: FastGen
2024
Predictive KV Caching: FastGen
Prior analysis of BERT showed it often attends to separators
Prior work (SepLLM) showed separators + initial tokens are useful for KV-caching
Predictive KV Caching: FastGen
Which works best? It depends….
Different heads at level 20
Predictive KV Caching: FastGen
Which works best? It depends….
Same head different layers
Predictive KV Caching: FastGen
Predictive KV Caching: FastGen Results