10-605 / 10-805
Machine Learning from Large Datasets
Announcements
Outline
Outline
KV CACHING RECAP
Recap: Key-Value Caching
Recap: “Heavy Hitters” in KV Caches
Recap: “Heavy Hitters” in KV Caches
Recap: StreamingLLM
Why are initial positions getting so much attention?
Recap: StreamingLLM
red: stream LLM
orange: sliding window attention
blue: dense attention
Recap: SnapKV
Recap: SnapKV
Recap: SnapKV Details
Recap: Needle in a Haystack: Background
Recap: Needle in a Haystack: SnapKV
Recap: Predictive KV Caching: FastGen
Recap: Predictive KV Caching: FastGen
Which works best? It depends….
Different heads at level 20
Recap: Predictive KV Caching: FastGen
Outline
RECENT KV-CACHING PAPERS
NeurIPS 2024 – 80 cites
Recap: What architectural changes impact KV caching?
Recall: we can re-use keys and values in “groups”
Idea: some layers re-use KV cache from a previous layer
Like GQA except grouping is across layers instead of inside a single layer
Good results on a 1B-sized model
Improvements less clear at 3B
Outline
Feb 2025 – Kimi/Tsinghua – 140 cites
Some takeaways
rebatch frequently
Inference workflow
MoonCake architecture
TTFT = “time to first token”
TBT = “time between tokens”
SLO = “service level objectives”
MFU = “model FLOP utilization”
DRAM: CPU vs VRAM: GPU
10 prefill nodes in cluster, each has 3M token cache
KV CACHING COMPARED TO….
Outline
How should we think about these?
Heuristic: find a survey paper!
How should we think about these?
TACL 2024
To reduce the cost about the weight update process required by SparseGPT, Wanda (Sun et al., 2024) achieves model sparsity by pruning weights with the smallest magnitudes multiplied by the norm of the corresponding input activations, without the need for retraining or weight updates.
Recap: making a large model smaller
Quantization, Pruning, and KV Cache Management are really learning problems … with constraints on the model learned
Optimization with discrete constraints is harder—so in practice people approximate
Recap: making a large model smaller
Old-school ML in practice:
Recap: making a large model smaller
Quantization, Pruning, and KV Cache Management are really learning problems … with constraints on the model learned
Optimization with discrete constraints is harder—you need to find the best way to satisfy the discrete constraints and the best parameters
Work often looks like searching the discrete part of the space manually (or greedily or …)
Quantization: Beyond 16 bits
Recap
Optimize parameter values via gradient descent (as usual) subject to constraint:
Quantization: int8.llm
Hypothesis: outlier features are why int8 stops working!
Recap
Quantization: Beyond 16 bits
Recap
Optimize parameter values via gradient descent (as usual) subject to constraint:
subset to be quantized is chosen with manually fixed rules (based on activation values)
Quantization: Performance Tradeoffs
2024
Recap
Quantization: Beyond 4 bits
Recap
Recap: making a large model smaller
simple quant rules
complex rules, learning
“32 bits everywhere”
“16 bits except for bias”
int8() except for outliers, defined as ….
jointly optimize codebook/codes and where to quant while doing gradient updates on weights
Recap: making a large model smaller
Unstructured pruning
Recap
Optimize parameter while minimizing NNZ weights by alternating
Unstructured pruning: Wanda
Recap
Optimize parameter while minimizing NNZ weights by alternating
What sparse operations are supported?
Recap
What sparse operations are supported?
Speeds for WANDA using semi-structured N:M pruning
Recap
Optimize parameter while minimizing NNZ weights by alternating
Structured Pruning: Shortened Llama
Recap
Optimize parameter while reducing # layers:
Structured Pruning: Sheared Llama
Recap
Structured Pruning: Sheared Llama
Recap
Use gradients with Lagrange multipliers to jointly optimize perplexity and distance to a valid mask
Recap: making a large model smaller
simple optimization
complex optimization
Wanda: greedy weight pruning
shortened Llama: greedy layer pruning + LoRA
Magnitude: greedy weight pruning alternating with gradients
Sheared Llama: jointly optimize weights and continuous approximation to mask.
Important note:
complicated ≠ good!
Recap: making a large model smaller
H2O: “Heavy Hitters” in KV Caches
Recap
Greedy pruning (eviction) based on running estimate of cumulative attention
StreamingLLM
red: stream LLM
orange: sliding window attention
blue: dense attention
Recap
Eviction is FIFO with an exception for first few tokens
StreamingLLM and H2O
Recap
Eviction is H2O rule with an exception for first few tokens
SnapKV: Details
Recap
Eviction for a prompt is based on a score learned from the observation window for that prompt
Predictive KV Caching: FastGen
2024
Recap
Predictive KV Caching: FastGen
Prior analysis of BERT showed it often attends to separators
Also SepLLM showed separators + initial tokens are useful for KV-caching
Recap
Predictive KV Caching: FastGen
Recap
Eviction strategy for a prompt is learned based on the strategy’s effectiveness in prefilling
…and head identity and layer is also a feature (i.e., each can have a different strategy)
Recap: making a large model smaller
simple optimization
complex optimization
+ start token
H2O: score by running accumulated attention
SnapKV: Adaptively pick specific weights based on observation window in prefilling
FastGen: Pick strategy per head and layer based on prefilling.
Score by recency (FIFO)
+ separator
Recap: making a large model smaller
Old-school ML in practice:
What’s next?
RETRIEVAL AUGMENTED LLMS: INTRO
Anthropic Claude 3.7
Gemini Advanced 2.0 Flash
Anthropic Claude 3.7
Mostly wrong
Gemini Advanced 2.0 Flash
Gemini Advanced 2.0 Flash
Irrelevant from 2017
https://www.cs.cmu.edu/~wcohen/ has all this information
copy of ICML 2008 info
from 2007
Questions
Is the source of information reliable?
What if documents contradict each other?
What if information needed is spread across many documents?