1 of 70

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language Models

Will ternary models outshine half-precision and quantised models?

22-07-2024

2 of 70

Introduction

2

3 of 70

Introduction

3

0.0 Introduction

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

4 of 70

Introduction

4

0.0 Introduction

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

5 of 70

Introduction

5

0.0 Introduction

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

6 of 70

Introduction

6

0.0 Introduction

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Compute (FLOPs) are grow faster than memory capacity and bandwidth

7 of 70

Background:

Large Language Model deployment is bottlenecked by:�
Model size�

Memory Usage
Data transfer�

7

0.0 Background

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

8 of 70

Background:

Large Language Model deployment is bottlenecked by:�
Model size�

Memory Usage
Data transfer

Token generation speed (latency) is limited by memory bandwidth

8

0.0 Background

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

9 of 70

How are these Memory Bottlenecks in LLMs addressed?

9

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

10 of 70

How are these Memory Bottlenecks in LLMs addressed?

10

Exceeding Chinchilla’s compute-optimal regime for small models?�

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

11 of 70

How are these Memory Bottlenecks in LLMs addressed?

11

Exceeding Chinchilla’s compute-optimal regime for small models?

Extremely large amount of data (>=15 Trillion)�

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

12 of 70

How are these Memory Bottlenecks in LLMs addressed?

12

Exceeding Chinchilla’s compute-optimal regime for small models?�

Extremely large amount of data (>=15 Trillion)�
Highly compute-inefficient due to low parameter counts (<=8 Billion)

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

13 of 70

How are these Memory Bottlenecks in LLMs addressed?

13

Post Training Quantization?

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

14 of 70

How are these Memory Bottlenecks in LLMs addressed?

14

Post Training Quantization?�
Using 4-bit precision is nearly always optimal.
Significant performance degradation observed beyond 4-bits.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

15 of 70

How are these Memory Bottlenecks in LLMs addressed?

15

Post Training Quantization?�
Using 4-bit precision is nearly always optimal.
Significant performance degradation observed beyond 4-bits.
Sensitivity to calibration dataset.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

16 of 70

How are these Memory Bottlenecks in LLMs addressed?

16

Training neural networks with low effective bit-widths? �
Unlike quantization, it requires training from scratch.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

17 of 70

Memory Bottlenecks and Low-Bitwidth Language Modelling

17

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

18 of 70

Deployment: Memory Capacity over peak TFLOPs

18

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Consider recent microarchitectures

Nvidia: Volta, Ampere, Hopper & Blackwell
AMD: MI200 and MI300 Series
Intel: Gaudi 2 and Gaudi 3
Google: TPU V3, V4 and V5

19 of 70

Deployment: Memory Capacity over peak TFLOPs

19

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Consider recent microarchitectures

Nvidia: Volta, Ampere, Hopper & Blackwell
AMD: MI200 and MI300 Series
Intel: Gaudi 2 and Gaudi 3
Google: TPU V3, V4 and V5

20 of 70

Deployment: Memory Capacity over peak TFLOPs

Downward Slope shows that memory capacity grows slower than compute (FLOPs)

20

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Consider recent microarchitectures

Nvidia: Volta, Ampere, Hopper & Blackwell
AMD: MI200 and MI300 Series
Intel: Gaudi 2 and Gaudi 3
Google: TPU V3, V4 and V5

21 of 70

Memory Capacity and Low-Bitwidth Modelling

TriLM are also better for edge deployment, where device have less than 8GB of RAM

21

We don’t consider. Overhead of KV cache, activation and compilation incurred during model deployment

A single H-100 can easily fit:

>34B FloatLM
>70B QuantLM
>300B TriLMs

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

22 of 70

Latency: Memory Bandwidth over FLOPs

22

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Consider recent microarchitectures

Nvidia: Volta, Ampere, Hopper & Blackwell
AMD: MI200 and MI300 Series
Intel: Gaudi 2 and Gaudi 3
Google: TPU V3, V4 and V5

23 of 70

Latency: Memory Bandwidth over FLOPs

Downward slope shows that memory bandwidth grows slower than compute (FLOPs)

23

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Consider recent microarchitectures

Nvidia: Volta, Ampere, Hopper & Blackwell
AMD: MI200 and MI300 Series
Intel: Gaudi 2 and Gaudi 3
Google: TPU V3, V4 and V5

24 of 70

Latency and Low-Bitwidth Language Modelling

At 7B params, TriLMs are more than 4x faster than FloatLM and 2x faster than QuantLMs

24

1.0 Memory Bottlenecks and Low-Bitwidth Language Modelling

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Consider recent microarchitectures

Nvidia: Volta, Ampere, Hopper & Blackwell
AMD: MI200 and MI300 Series
Intel: Gaudi 2 and Gaudi 3
Google: TPU V3, V4 and V5

25 of 70

TriLM (Language Modelling with Ternary Weights)

25

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

26 of 70

Architecture of TriLM

26

2.0 TriLM

Key Architectural Features

LLaMa-style Transformer
Ternary weights in linear layer
RMSNorm
RoPE
No Bias term

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

27 of 70

Architecture of TriLM

TriLMs, the linear layers weights are ternary {-1, 0, 1}, with a shared floating-point scale

27

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

28 of 70

Architecture of TriLM

TriLMs, the linear layers weights are ternary {-1, 0, 1}, with a shared floating-point scale

28

2.0 TriLM

Linear Layers

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

29 of 70

Architecture of TriLM

TriLMs, the linear layers weights are ternary {-1, 0, 1}, with a shared floating-point scale

29

2.0 TriLM

Linear Layers

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

30 of 70

Computational Flow

The computational flow of forward, backward and inference processes in TriLM linear layer with N-way model parallelism

30

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

31 of 70

Computational Flow

The computational flow of forward, backward and inference processes in TriLM linear layer with N-way model parallelism

31

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

32 of 70

Computational Flow

The computational flow of forward, backward and inference processes in TriLM linear layer with N-way model parallelism

32

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

33 of 70

Computational Flow

The computational flow of forward, backward and inference processes in TriLM linear layer with N-way model parallelism

33

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

34 of 70

TriLM vs Bitnet

34

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Relative Performance across architecture

35 of 70

TriLM vs Bitnet

35

Key Highlights:

BitNet Replication achieves performance between 700M and 1.3B.
TriLM 1.1 B outperforms BitNet, including a larger 1.3B model.
TriLM 1.1 B does not achieve parity with FloatLM 1.1B at this scale.

2.0 TriLM

Relative Performance across architecture

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

36 of 70

Optimisation Schedule

36

2.0 TriLM

Training loss over 100B tokens for different optimization interventions: both L2 Regularization and Peak LR, only L2 Regularization, only Peak LR, and neither.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

37 of 70

Optimisation Schedule

37

2.0 TriLM

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Training loss over 100B tokens for different optimization interventions: both L2 Regularization and Peak LR, only L2 Regularization, only Peak LR, and neither.

38 of 70

Spectra Suite:

Spanning Parameters & Bitwidth

38

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

39 of 70

Overview of suite

The suite includes three model families:

39

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

40 of 70

Overview of suite

The suite includes three model families:

TriLMs (Ternary Language Model)
FloatLMs (Float 16LM)
QuantLMs (Quantised 3, 4, 6 & 8 bits FloatLMs)

40

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

41 of 70

Key Properties of our suite:

41

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Scale: Scales across parameters and bit-widths.

42 of 70

Key Properties of our suite:

42

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Scale: Scales across parameters and bit-widths.
Uniform Training: Identical training data sequence.

43 of 70

Key Properties of our suite:

43

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

Scale: Scales across parameters and bit-widths.
Uniform Training: Identical training data sequence.
Public Accessibility: Training data is public.

44 of 70

Key Properties of our suite:

44

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Scale: Scales across parameters and bit-widths.
Uniform Training: Identical training data sequence.
Public Accessibility: Training data is public.
Consistent Model Size Mapping: one-to-one mapping for parameter count.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

45 of 70

About FloatLM (Float16 LM)

45

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Architecture: LLaMa style, similar to TriLM�
Parameters: Represented as FP 16/BF 16�
Optimization:
Cosine decay scheduling.
Weight decay.
Learning rate warmup.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

46 of 70

About QuantLM (Quantized LM)

46

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Quantization Technique: GPTQ (Post Training Quantization)�
Precision Levels: 3, 4, 5, 6 and 8 bits�
Optimization
Quantized all transformer layer weights
3-bit and 4-bit use group size of 128.
Effective bits: 3.25 and 4.25 bits per parameter.

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

47 of 70

Training Dynamics and Scaling Laws

Training Cross Entropy Loss across steps for the TriLM family of models

47

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

48 of 70

Training Dynamics and Scaling Laws

Training Cross Entropy Loss across steps for the TriLM family of models

At ½ point (150B token) when we lower the peak learning rate, we observe sudden drop in training loss.

48

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

49 of 70

Training Dynamics and Scaling Laws

At ½ point (150B token) when we lower the peak learning rate, we observe sudden drop in training loss.
At ⅔ way, removing weight decay leads to faster convergence.

Training Cross Entropy Loss across steps for the TriLM family of models

49

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

50 of 70

Training Dynamics and Scaling Laws

Training Cross Entropy Loss across steps for the TriLM family of models

50

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

51 of 70

Final Validation Loss across Size and Parameters

At the size of TriLM 3.9B, these ternary models start offering better performance than models more than five times their size

51

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

52 of 70

Final Validation Loss across Size and Parameters

52

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

53 of 70

Final Validation Loss across Size and Parameters

53

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

54 of 70

Final Validation Loss across Size and Parameters

TriLMs with increasing size offer much better performance than FloatLMs of same number of bits and the gap in validation perplexity closes at large scale

54

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

55 of 70

Advancing research via open access

We opensource over 500 intermediate checkpoints across the training of TriLMs and FloatLMs in the Spectra Suite.

SpectraSuite Models

55

3.0 Spectra-Suite: Spanning Parameters & Bitwidth

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

56 of 70

Results

56

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

57 of 70

Commonsense & Reasoning

At 3B+ scales, TriLMs demonstrate better performance for their size than QuantLM and competitive performance to FloatLM of the same parameters

57

4.0 Results

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

58 of 70

Commonsense & Reasoning

58

4.0 Results

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

59 of 70

Knowledge

59

4.0 Results

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

60 of 70

Knowledge

60

4.0 Results

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

61 of 70

Knowledge

61

4.0 Results

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

62 of 70

Conclusion and Discussion

62

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

63 of 70

Conclusion

Introduce SpectraSuite.

63

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

64 of 70

Conclusion

Introduce SpectraSuite
TriLMs offer best performance for their size than quantized models

64

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

65 of 70

Conclusion

Introduce SpectraSuite
TriLMs offer best performance for their size than quantized models
TriLM 3.9B achieves competitive performance to the larger FloatLM 3.9B across various common sense, reasoning, and knowledge-based benchmarks

65

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

66 of 70

Broader Impact:

Environmental Benefits and Resource Efficiency

66

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

67 of 70

Broader Impact:

Environmental Benefits and Resource Efficiency

Benefits on Specialised Hardware like Groq, Cerebras

67

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

68 of 70

Broader Impact:

Reduced Training Costs.

Environmental Benefits and Resource Efficiency

Benefits on Specialised Hardware like Groq, Cerebras

68

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

69 of 70

Thanks

Tejas

Pandey

Nolano AI,

IIT Kharagpur

Tejas

Vaidhya

Nolano AI, MILA,

University of Montreal

Ayush

Kaushal

Nolano AI,

University of Montreal

Aaryan

Bhagat

UC Riverside

Irina

Rish

Nolano AI, MILA

University of Montreal

69

4.0 Thanks

Models are available at https://huggingface.co/SpectraSuite

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024

70 of 70

Thank you <3

Read our paper

SpectraSuite Models

GitHub

https://huggingface.co/SpectraSuite

https://arxiv.org/pdf/2210.17323

https://github.com/NolanoOrg/SpectraSuite

70

Spectra: A Comprehensive Study of Ternary, Quantized and FP16 Language models

22-07-2024