Efficient Foundation Models
via
Quantization
Tim Dettmers
1
Language models grew 100x in compute requirements in a few years
2
This lecture …
3
This lecture …
4
Using foundation models
This lecture …
5
Using foundation models
Finetuning foundation models
This lecture …
6
Using foundation models
Finetuning foundation models
Quantization: companies vs users
Typical user groups for each case
7
Using foundation models
Finetuning foundation models
State of the art in quantization
Typical user groups for each case
8
State of the art in quantization
Using foundation models
Finetuning foundation models
Typical user groups for each case
9
State of the art in quantization
Using foundation models
Finetuning foundation models
Accessibility challenges of foundation models
10
State of the art in quantization
Using foundation models
Finetuning foundation models
Challenges of accessible use of foundation models
11
Reduced memory footprint
Challenges of accessible use of foundation models
12
Reduced memory footprint
Maintain prediction/generation quality
Challenges of accessible use of foundation models
13
Reduced memory footprint
Generation / prediction speed
Maintain prediction/generation quality
Compress 16-bit foundation models to 8 bit and 4 bit
8-bit Foundation Models Fail at Scale
15
Our LLM.int8() method is the first method that works at scale
16
Accessibility challenges of foundation models
17
State of the art in quantization
Using foundation models
Finetuning foundation models
Evolution of scale of protein models
18
AlphaFold
21 Million
Parameters
July 2021
ESM-1
650 Million
Parameters
November 2021
Evolution of scale of protein models
19
AlphaFold
21 Million
Parameters
July 2021
ESM-2
15 billion
Parameters
August 2022
ESM-1
650 Million
Parameters
November 2021
Finetuning is expensive due to GPU memory requirements
20
QLoRA: Finetuning large models on a single GPU.
21
QLoRA
(4-bit finetuning)
Accessibility challenges of foundation models
22
State of the art in quantization
Using foundation models
Finetuning foundation models
Quantization: companies vs users
23
Quantization: companies vs users
24
Quantization: companies vs users
25
Quantization: companies vs users
26
Four steps to making foundation models accessible
27
Understand
models
Improve
model
efficiency
Design and build systems
Open-source and make
accessible
The bitsandbytes library implements all my research algorithms.
One of the most popular machine learning libraries, growing at 1.7M installations per month.
Widely used in industry.
28
Usage of bitsandbytes outside of computer science after 2 years
Clinical research: Veen et al., 2023; Nerella et al., 2023; Shoham & Rappoport 2023; Liu et al., 2023; Gosh et al., 2024; Fan et al., 2024; Han et al., 2023; Yang et al., 2023; Schlegel et al., 2023; An et al., 2023
Biomedical: Ateia et al., 2023; Wang et al., 2023; Li et al., 2023; Wang et al., 2023; Delmas et al., 2023; Robinson et al., 2023; Ateia et al., 2023; Hong et al., 2023; Amara et al., 2023; Fries et al., 2022; He et al., 2023;
Humanities: Fok et al., 2023; Kuzman et al., 2023; Han et al., 2023, Deng et al., 2024;
Education: Zeilikman et al., 2023; Sonkar et al., 2023;
Political science: Linegar et al., 2023; He et al., 2023; Bornheim et al., 2023; Gesnouin et al., 2024; Allaham et al., 2024
Social science: Attanasio et al., 2023; Hu et al., 2023; Weld et al., 2024
Manufacturing: Freire et al., 2024; Zhang et al., 2024; Momodu 2023;
Other fields: Kraus et al., 2023; Hadi et al., 2023; Zelikman et al., 2023; Jiang 2023; Freudenberg 2023; Wang et al., 2023; Chu et al., 2024; Buehler et al., 2023; Saben & Chandrasekar, 2024;
29
Accessibility challenges of foundation models
30
Quantization: companies vs users
Using foundation models
Finetuning foundation models
Background
31
32
Transformers: The backbone of foundation models
33
Transformers: The backbone of foundation models
34
60%
35%
Transformers are mostly matrix multiplication
35
Transformers are mostly matrix multiplication
36
Transformers are mostly matrix multiplication
37
Matrix multiplication consumes:
2. Neural networks
38
Background: neural networks. A sequence of layers.
39
Inputs
Background: neural networks. A sequence of layers.
40
Inputs
Background: neural networks. A sequence of layers.
41
Layer
Inputs
Weight
matrix
Background: neural networks. A sequence of layers.
42
Layer
Inputs
Weight
matrix
Weighted sum
Background: neural networks. A sequence of layers.
43
Layer
Inputs
Weight
matrix
Where are resources used in neural networks?
44
Layer
Inputs
Outputs
Weight
matrix
Background: neural networks. A sequence of layers.
45
Layer
Inputs
Outputs
Weight
matrix
Background: neural networks. A sequence of layers.
46
Model dim
4k - 100k
Layers
50 - 100
Background: neural networks. A sequence of layers.
47
Model dim
4k - 100k
Layers
50 - 100
Background: neural networks. A sequence of layers.
48
Model dim
4k - 100k
Layers
50 - 100
Background: neural networks. A sequence of layers.
49
Layers
50 - 100
Model dim
4k - 100k
Background: neural networks. A sequence of layers.
50
Model dim
4k - 100k
Layers
50 - 100
Matrix multiplication is quite optimal in terms of hardware and software
51
Matrix multiplication is quite optimal in terms of hardware and software
…
As such we need to find good approximations to gain efficiency
52
Approximations need to be faithful
53
Approximation
through
low-rank projection
Approximations need to be faithful
54
Approximation
through
low-rank projection
Speed
Memory
Approximations need to be faithful
55
Approximation
through
low-rank projection
Speed
Memory
Quality
Approximations need to be useful in practice
56
Approximation
through
sparsification
Approximations need to be useful in practice
57
Approximation
through
sparsification
Quality
Approximations need to be useful in practice
58
Approximation
through
sparsification
Speed
Memory
Quality
Quantization has challenges …
59
Approximation
through
8-bit computation
16-bit
Quantization has challenges …
60
Approximation
through
8-bit computation
16-bit
Speed
Memory
Quality
Quantization has challenges, but we can overcome them!
61
Approximation
through
8-bit computation
16-bit
Speed
Memory
Quality
3. Quantization
62
Quantizing 16-bit normal distributed data to 4-bit integers
63
16-bit float
-65520
65520
Quantizing 16-bit normal distributed data to 4-bit integers
64
16-bit float
-7
7
4-bit integer
-65520
65520
Quantizing 16-bit normal distributed data to 4-bit integers
65
16-bit float
-7
7
4-bit integer
-65520
65520
Quantizing 16-bit normal distributed data to 4-bit integers
66
16-bit float
-7
7
4-bit integer
-65520
65520
Quantizing 16-bit normal distributed data to 4-bit integers
67
-7
7
-7
7
4-bit integer
16-bit float
Quantizing 16-bit normal distributed data to 4-bit integers
68
16-bit float
-7
7
-7
7
4-bit integer
Integer quantization is similar to histogram binning
Quantizing 16-bit normal distributed data to 4-bit integers
70
16-bit float
-65520
65520
-7
7
4-bit integer
Outlier
Outlier
Quantizing 16-bit normal distributed data to 4-bit integers
71
16-bit float
-7
7
4-bit integer
-7
7
What do outliers in quantization look like?
Accessibility challenges of foundation models
73
Quantization: companies vs users
Using foundation models
Finetuning foundation models
8-bit Foundation Models Fail at Scale
74
Outlier patterns in small neural networks (350M parameters)
75
Outlier at 6 sigma
Outlier patterns in small neural networks (350M parameters)
76
Outlier at 6 sigma
Outlier patterns in small neural networks (350M parameters)
77
Outlier at 6 sigma
Outlier patterns in small neural networks (350M parameters)
78
Outlier at 6 sigma
Outlier patterns in small neural networks (1.3B parameters)
79
Outlier at 6 sigma
Outlier patterns in small neural networks (1.3B parameters)
80
Outlier at 6 sigma
Outlier patterns in small neural networks (1.3B parameters)
81
Outlier at 6 sigma
Outlier patterns in small neural networks (2.7B parameters)
82
Outlier at 6 sigma
Outlier patterns in small neural networks (2.7B parameters)
83
Outlier at 6 sigma
Outlier patterns in small neural networks (2.7B parameters)
84
Outlier at 6 sigma
Outlier patterns in large neural networks (6.7B parameters)
85
Outlier at 6 sigma
Outlier patterns in large neural networks (6.7B parameters)
86
Outlier at 6 sigma
Outlier patterns in large neural networks (6.7B parameters)
87
Outlier at 6 sigma
Outlier patterns in large neural networks (6.7B parameters)
88
Outlier at 6 sigma
Outlier patterns in large neural networks (13B parameters)
89
Outlier at 6 sigma
Emergent outliers vs language model performance
90
Emergent outliers vs language model performance
91
Emergent outliers vs language model performance
92
Emergent outliers vs outlier magnitude
93
Mixed precision decomposition
Matrix multiply outliers (0.1%) in 16-bit.
Matrix multiply other values (99.9%) in 8-bit.
94
8-bit Foundation Models Fail at Scale
95
Our LLM.int8() method is the first method that works at scale
96
How can we maximize performance density per bit?
97
Maximizing performance density in foundation models
98
10B
20B
Parameters
Maximizing performance density in foundation models
99
10B
20B
Parameters
8-bit
4-bit
Precision
Maximizing performance density in foundation models
100
10B
20B
Parameters
8-bit
4-bit
Precision
Total bits
80B
80B
Maximizing performance density in foundation models
101
Maximizing performance density in foundation models
102
Maximizing performance density in foundation models
103
Maximizing performance density in foundation models
104
Maximizing performance density in foundation models
105
Integer quantization is similar to histogram binning
What do outliers in quantization look like?
Block-wise quantization
What does help to improve scaling? Block size
Hardware-based block-wise quantization with Blackwell B100/B200
110
Take-away
Fundamental insights into foundation models information processing enables efficiency and accessibility
111
Accessibility challenges of foundation models
112
Quantization: companies vs users
Using foundation models
Finetuning foundation models
Finetuning is expensive due to GPU memory requirements
113
How to finetune a model
114
How to finetune a model
115
How to finetune a model
116
How to finetune a model
117
How to finetune a model
118
How to finetune a model
119
Error
How to finetune a model
120
Error
How to finetune a model
121
Error
Weight gradients
How to finetune a model
122
Error
Weight gradients
Background: How to finetune a model
123
Update the weights
Quality
Finetuning a 4-bit model with our insights …
124
16-bit
4-bit
Quality
Finetuning a 4-bit model with our insights …
125
4-bit Error
4-bit computation
Finetuning a 4-bit model with our insights …
126
4-bit Error
4-bit Weight gradients
4-bit backpropagation
Finetuning a 4-bit model with our insights …
127
Update the weights
Quality
Low-rank Adaptation (LoRA)
128
Low-rank Adaptation (LoRA)
129
Not updated
Low-rank Adaptation (LoRA)
130
Adapters
Not updated
Low-rank Adaptation (LoRA)
131
Adapters
Not updated
Only update adapter weights
Quantized Low-rank Adaptation (QLoRA)
132
Quality
16-bit
4-bit
Quantized Low-rank Adaptation (QLoRA)
133
4-bit
4-bit
Add adapters
16-bit adapters
Quantized Low-rank Adaptation (QLoRA)
134
4-bit model
16-bit adapters
4-bit Error
Quantized Low-rank Adaptation (QLoRA)
135
4-bit model
16-bit adapters
4-bit Error
Quantized Low-rank Adaptation (QLoRA)
136
4-bit model
16-bit adapters
4-bit Error
Quantized Low-rank Adaptation (QLoRA)
137
4-bit model
16-bit adapters
Quality
What 4-bit data type is information theoretically optimal?
138
-7
7
-7
7
4-bit ???
16-bit float
4-bit NormalFloat (NF4) an information-theoretically optimal data type for normal distributions
139
QLoRA systems contributions
140
Results
141
QLoRA recovers lost performance through fine-tuning
142
4-bit Guanaco: A ChatGPT-quality 4-bit chatbot finetuned in 24h on a single GPU
143
Take-away
4-bit finetuning is possible by passing gradients through a 4-bit neural network to 16-bit adapters.
144
Accessibility challenges of foundation models
145
Quantization: companies vs users
Using foundation models
Finetuning foundation models
State of quantization 2026
146
147
Quantization precision optimality depends on data per parameter
148
149
Vector quantization
Quantize multiple elements to a single number. For example, 2 elements to 5-bit.
150
0.1 | 0.3 | -1.1 | -2.0 |
0.0 | 1.0 | -1.0 | 4.3 |
-0.1 | 0.2 | 5.0 | -0.3 |
Inputs
Codebook
0.1 | 0.2 | -> | 0 |
0.0 | 1.0 | -> | 2 |
-0.5 | -1.0 | -> | 1 |
… | … | -> | … |
3.5 | -0.1 | -> | 32 |
+
=
0 | 2 |
1 | 7 |
… | … |
Hadamard rotations: removing outliers through rotations
151
152
Hadamard rotations + Lloyd-Max codebook yields optimal perplexity in theory. Empirically up to ~3-bit.
Optimal precision for pretraining/post-training quantization
153
Circle: 8B
Triangle: 70B
Star: 405B
154
Quantization is biased.
Since we do a max normalization the largest magnitude values are always quantized to [1, -1]. This leads to a bias where we underestimate the overall quantization values in the expectation.
RaBitQ accounts for this analytically by having a codebook in the unitball (QUIP) and removing the bias analytically.
155
RaBitQ + HIGGS but instead of analytic unbiasedness it uses a 1-bit Johnson-Lindenstrauss transform.
156
Best biased method: GPTQ learned “Hadamard” rotations.
157