1 of 157

Efficient Foundation Models

via

Quantization

Tim Dettmers

1

2 of 157

Language models grew 100x in compute requirements in a few years

2

3 of 157

This lecture …

3

4 of 157

This lecture …

4

Using foundation models

5 of 157

This lecture …

5

Using foundation models

Finetuning foundation models

6 of 157

This lecture …

6

Using foundation models

Finetuning foundation models

Quantization: companies vs users

7 of 157

Typical user groups for each case

7

Using foundation models

Finetuning foundation models

State of the art in quantization

8 of 157

Typical user groups for each case

8

State of the art in quantization

Using foundation models

Finetuning foundation models

9 of 157

Typical user groups for each case

9

State of the art in quantization

Using foundation models

Finetuning foundation models

10 of 157

Accessibility challenges of foundation models

10

State of the art in quantization

Using foundation models

Finetuning foundation models

11 of 157

Challenges of accessible use of foundation models

11

Reduced memory footprint

12 of 157

Challenges of accessible use of foundation models

12

Reduced memory footprint

Maintain prediction/generation quality

13 of 157

Challenges of accessible use of foundation models

13

Reduced memory footprint

Generation / prediction speed

Maintain prediction/generation quality

14 of 157

Compress 16-bit foundation models to 8 bit and 4 bit

15 of 157

8-bit Foundation Models Fail at Scale

15

16 of 157

Our LLM.int8() method is the first method that works at scale

16

17 of 157

Accessibility challenges of foundation models

17

State of the art in quantization

Using foundation models

Finetuning foundation models

18 of 157

Evolution of scale of protein models

18

AlphaFold

21 Million

Parameters

July 2021

ESM-1

650 Million

Parameters

November 2021

19 of 157

Evolution of scale of protein models

19

AlphaFold

21 Million

Parameters

July 2021

ESM-2

15 billion

Parameters

August 2022

ESM-1

650 Million

Parameters

November 2021

20 of 157

Finetuning is expensive due to GPU memory requirements

20

21 of 157

QLoRA: Finetuning large models on a single GPU.

21

QLoRA

(4-bit finetuning)

22 of 157

Accessibility challenges of foundation models

22

State of the art in quantization

Using foundation models

Finetuning foundation models

23 of 157

Quantization: companies vs users

23

24 of 157

Quantization: companies vs users

24

25 of 157

Quantization: companies vs users

25

26 of 157

Quantization: companies vs users

26

27 of 157

Four steps to making foundation models accessible

27

Understand

models

Improve

model

efficiency

Design and build systems

Open-source and make

accessible

28 of 157

The bitsandbytes library implements all my research algorithms.

One of the most popular machine learning libraries, growing at 1.7M installations per month.

Widely used in industry.

28

29 of 157

Usage of bitsandbytes outside of computer science after 2 years

Clinical research: Veen et al., 2023; Nerella et al., 2023; Shoham & Rappoport 2023; Liu et al., 2023; Gosh et al., 2024; Fan et al., 2024; Han et al., 2023; Yang et al., 2023; Schlegel et al., 2023; An et al., 2023

Biomedical: Ateia et al., 2023; Wang et al., 2023; Li et al., 2023; Wang et al., 2023; Delmas et al., 2023; Robinson et al., 2023; Ateia et al., 2023; Hong et al., 2023; Amara et al., 2023; Fries et al., 2022; He et al., 2023;

Humanities: Fok et al., 2023; Kuzman et al., 2023; Han et al., 2023, Deng et al., 2024;

Education: Zeilikman et al., 2023; Sonkar et al., 2023;

Political science: Linegar et al., 2023; He et al., 2023; Bornheim et al., 2023; Gesnouin et al., 2024; Allaham et al., 2024

Social science: Attanasio et al., 2023; Hu et al., 2023; Weld et al., 2024

Manufacturing: Freire et al., 2024; Zhang et al., 2024; Momodu 2023;

Other fields: Kraus et al., 2023; Hadi et al., 2023; Zelikman et al., 2023; Jiang 2023; Freudenberg 2023; Wang et al., 2023; Chu et al., 2024; Buehler et al., 2023; Saben & Chandrasekar, 2024;

29

30 of 157

Accessibility challenges of foundation models

30

Quantization: companies vs users

Using foundation models

Finetuning foundation models

31 of 157

  1. Resource use in neural networks
  2. Neural networks
  3. Quantization

Background

31

32 of 157

  1. Resource use in neural networks

32

33 of 157

Transformers: The backbone of foundation models

33

34 of 157

Transformers: The backbone of foundation models

34

60%

35%

35 of 157

Transformers are mostly matrix multiplication

35

36 of 157

Transformers are mostly matrix multiplication

36

37 of 157

Transformers are mostly matrix multiplication

37

Matrix multiplication consumes:

  • 95% Memory
  • 95% Computation

38 of 157

2. Neural networks

38

39 of 157

Background: neural networks. A sequence of layers.

39

Inputs

40 of 157

Background: neural networks. A sequence of layers.

40

Inputs

41 of 157

Background: neural networks. A sequence of layers.

41

Layer

Inputs

Weight

matrix

42 of 157

Background: neural networks. A sequence of layers.

42

Layer

Inputs

Weight

matrix

Weighted sum

43 of 157

Background: neural networks. A sequence of layers.

43

Layer

Inputs

Weight

matrix

44 of 157

Where are resources used in neural networks?

44

Layer

Inputs

Outputs

Weight

matrix

45 of 157

Background: neural networks. A sequence of layers.

45

Layer

Inputs

Outputs

Weight

matrix

46 of 157

Background: neural networks. A sequence of layers.

46

Model dim

4k - 100k

Layers

50 - 100

47 of 157

Background: neural networks. A sequence of layers.

47

Model dim

4k - 100k

Layers

50 - 100

48 of 157

Background: neural networks. A sequence of layers.

48

Model dim

4k - 100k

Layers

50 - 100

49 of 157

Background: neural networks. A sequence of layers.

49

Layers

50 - 100

Model dim

4k - 100k

50 of 157

Background: neural networks. A sequence of layers.

50

Model dim

4k - 100k

Layers

50 - 100

51 of 157

Matrix multiplication is quite optimal in terms of hardware and software

51

52 of 157

Matrix multiplication is quite optimal in terms of hardware and software

As such we need to find good approximations to gain efficiency

52

53 of 157

Approximations need to be faithful

53

Approximation

through

low-rank projection

54 of 157

Approximations need to be faithful

54

Approximation

through

low-rank projection

Speed

Memory

55 of 157

Approximations need to be faithful

55

Approximation

through

low-rank projection

Speed

Memory

Quality

56 of 157

Approximations need to be useful in practice

56

Approximation

through

sparsification

57 of 157

Approximations need to be useful in practice

57

Approximation

through

sparsification

Quality

58 of 157

Approximations need to be useful in practice

58

Approximation

through

sparsification

Speed

Memory

Quality

59 of 157

Quantization has challenges …

59

Approximation

through

8-bit computation

16-bit

60 of 157

Quantization has challenges …

60

Approximation

through

8-bit computation

16-bit

Speed

Memory

Quality

61 of 157

Quantization has challenges, but we can overcome them!

61

Approximation

through

8-bit computation

16-bit

Speed

Memory

Quality

62 of 157

3. Quantization

62

63 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

63

16-bit float

-65520

65520

64 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

64

16-bit float

-7

7

4-bit integer

-65520

65520

65 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

65

16-bit float

-7

7

4-bit integer

-65520

65520

66 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

66

16-bit float

-7

7

4-bit integer

-65520

65520

67 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

67

-7

7

-7

7

4-bit integer

16-bit float

68 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

68

16-bit float

-7

7

-7

7

4-bit integer

69 of 157

Integer quantization is similar to histogram binning

70 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

70

16-bit float

-65520

65520

-7

7

4-bit integer

Outlier

Outlier

71 of 157

Quantizing 16-bit normal distributed data to 4-bit integers

71

16-bit float

-7

7

4-bit integer

-7

7

72 of 157

What do outliers in quantization look like?

73 of 157

Accessibility challenges of foundation models

73

Quantization: companies vs users

Using foundation models

Finetuning foundation models

74 of 157

8-bit Foundation Models Fail at Scale

74

75 of 157

Outlier patterns in small neural networks (350M parameters)

75

Outlier at 6 sigma

76 of 157

Outlier patterns in small neural networks (350M parameters)

76

Outlier at 6 sigma

77 of 157

Outlier patterns in small neural networks (350M parameters)

77

Outlier at 6 sigma

78 of 157

Outlier patterns in small neural networks (350M parameters)

78

Outlier at 6 sigma

79 of 157

Outlier patterns in small neural networks (1.3B parameters)

79

Outlier at 6 sigma

80 of 157

Outlier patterns in small neural networks (1.3B parameters)

80

Outlier at 6 sigma

81 of 157

Outlier patterns in small neural networks (1.3B parameters)

81

Outlier at 6 sigma

82 of 157

Outlier patterns in small neural networks (2.7B parameters)

82

Outlier at 6 sigma

83 of 157

Outlier patterns in small neural networks (2.7B parameters)

83

Outlier at 6 sigma

84 of 157

Outlier patterns in small neural networks (2.7B parameters)

84

Outlier at 6 sigma

85 of 157

Outlier patterns in large neural networks (6.7B parameters)

85

Outlier at 6 sigma

86 of 157

Outlier patterns in large neural networks (6.7B parameters)

86

Outlier at 6 sigma

87 of 157

Outlier patterns in large neural networks (6.7B parameters)

87

Outlier at 6 sigma

88 of 157

Outlier patterns in large neural networks (6.7B parameters)

88

Outlier at 6 sigma

89 of 157

Outlier patterns in large neural networks (13B parameters)

89

Outlier at 6 sigma

90 of 157

Emergent outliers vs language model performance

90

91 of 157

Emergent outliers vs language model performance

91

92 of 157

Emergent outliers vs language model performance

92

93 of 157

Emergent outliers vs outlier magnitude

93

94 of 157

Mixed precision decomposition

Matrix multiply outliers (0.1%) in 16-bit.

Matrix multiply other values (99.9%) in 8-bit.

94

95 of 157

8-bit Foundation Models Fail at Scale

95

96 of 157

Our LLM.int8() method is the first method that works at scale

96

97 of 157

How can we maximize performance density per bit?

97

98 of 157

Maximizing performance density in foundation models

98

10B

20B

Parameters

99 of 157

Maximizing performance density in foundation models

99

10B

20B

Parameters

8-bit

4-bit

Precision

100 of 157

Maximizing performance density in foundation models

100

10B

20B

Parameters

8-bit

4-bit

Precision

Total bits

80B

80B

101 of 157

Maximizing performance density in foundation models

101

102 of 157

Maximizing performance density in foundation models

102

103 of 157

Maximizing performance density in foundation models

103

104 of 157

Maximizing performance density in foundation models

104

105 of 157

Maximizing performance density in foundation models

105

106 of 157

Integer quantization is similar to histogram binning

107 of 157

What do outliers in quantization look like?

108 of 157

Block-wise quantization

109 of 157

What does help to improve scaling? Block size

110 of 157

Hardware-based block-wise quantization with Blackwell B100/B200

110

111 of 157

Take-away

Fundamental insights into foundation models information processing enables efficiency and accessibility

111

112 of 157

Accessibility challenges of foundation models

112

Quantization: companies vs users

Using foundation models

Finetuning foundation models

113 of 157

Finetuning is expensive due to GPU memory requirements

113

114 of 157

How to finetune a model

114

115 of 157

How to finetune a model

115

116 of 157

How to finetune a model

116

117 of 157

How to finetune a model

117

118 of 157

How to finetune a model

118

119 of 157

How to finetune a model

119

Error

120 of 157

How to finetune a model

120

Error

121 of 157

How to finetune a model

121

Error

Weight gradients

122 of 157

How to finetune a model

122

Error

Weight gradients

123 of 157

Background: How to finetune a model

123

Update the weights

Quality

124 of 157

Finetuning a 4-bit model with our insights …

124

16-bit

4-bit

Quality

125 of 157

Finetuning a 4-bit model with our insights …

125

4-bit Error

4-bit computation

126 of 157

Finetuning a 4-bit model with our insights …

126

4-bit Error

4-bit Weight gradients

4-bit backpropagation

127 of 157

Finetuning a 4-bit model with our insights …

127

Update the weights

Quality

128 of 157

Low-rank Adaptation (LoRA)

128

129 of 157

Low-rank Adaptation (LoRA)

129

Not updated

130 of 157

Low-rank Adaptation (LoRA)

130

Adapters

Not updated

131 of 157

Low-rank Adaptation (LoRA)

131

Adapters

Not updated

Only update adapter weights

132 of 157

Quantized Low-rank Adaptation (QLoRA)

132

Quality

16-bit

4-bit

133 of 157

Quantized Low-rank Adaptation (QLoRA)

133

4-bit

4-bit

Add adapters

16-bit adapters

134 of 157

Quantized Low-rank Adaptation (QLoRA)

134

4-bit model

16-bit adapters

4-bit Error

135 of 157

Quantized Low-rank Adaptation (QLoRA)

135

4-bit model

16-bit adapters

4-bit Error

136 of 157

Quantized Low-rank Adaptation (QLoRA)

136

4-bit model

16-bit adapters

4-bit Error

137 of 157

Quantized Low-rank Adaptation (QLoRA)

137

4-bit model

16-bit adapters

Quality

138 of 157

What 4-bit data type is information theoretically optimal?

138

-7

7

-7

7

4-bit ???

16-bit float

139 of 157

4-bit NormalFloat (NF4) an information-theoretically optimal data type for normal distributions

139

140 of 157

QLoRA systems contributions

140

141 of 157

Results

141

142 of 157

QLoRA recovers lost performance through fine-tuning

142

143 of 157

4-bit Guanaco: A ChatGPT-quality 4-bit chatbot finetuned in 24h on a single GPU

143

144 of 157

Take-away

4-bit finetuning is possible by passing gradients through a 4-bit neural network to 16-bit adapters.

144

145 of 157

Accessibility challenges of foundation models

145

Quantization: companies vs users

Using foundation models

Finetuning foundation models

146 of 157

State of quantization 2026

146

147 of 157

147

148 of 157

Quantization precision optimality depends on data per parameter

148

149 of 157

149

150 of 157

Vector quantization

Quantize multiple elements to a single number. For example, 2 elements to 5-bit.

150

0.1

0.3

-1.1

-2.0

0.0

1.0

-1.0

4.3

-0.1

0.2

5.0

-0.3

Inputs

Codebook

0.1

0.2

->

0

0.0

1.0

->

2

-0.5

-1.0

->

1

->

3.5

-0.1

->

32

+

=

0

2

1

7

151 of 157

Hadamard rotations: removing outliers through rotations

151

152 of 157

152

Hadamard rotations + Lloyd-Max codebook yields optimal perplexity in theory. Empirically up to ~3-bit.

153 of 157

Optimal precision for pretraining/post-training quantization

153

Circle: 8B

Triangle: 70B

Star: 405B

154 of 157

154

Quantization is biased.

Since we do a max normalization the largest magnitude values are always quantized to [1, -1]. This leads to a bias where we underestimate the overall quantization values in the expectation.

RaBitQ accounts for this analytically by having a codebook in the unitball (QUIP) and removing the bias analytically.

155 of 157

155

RaBitQ + HIGGS but instead of analytic unbiasedness it uses a 1-bit Johnson-Lindenstrauss transform.

156 of 157

  • Lloyd-max optimizer is blocksize dependent: smaller blocks have lower bias.
  • Bias can be moved into the codebook
  • Essentially HIGGS + RaBitQ for different block sizes with multiplicative bias factor.

156

157 of 157

Best biased method: GPTQ learned “Hadamard” rotations.

157