1 of 35

Put LLMs on device? Challenges and new opportunities

Zechun Liu, Meta Reality Labs

2 of 35

On-device deployment

Considerations of portability and computational cost propel the necessity to deploy language models on smart phones and mobile devices.
A new stream of research direction have emerged to downsize LMs to enable on-device inference.

3 of 35

In this talk

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models (Arxiv)

LLM-FP4: 4-Bit Floating-Point Quantized Transformers (EMNLP 2023)

4 of 35

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra

5 of 35

Questions:

Why do we want to put LLMs on device?

What are the constraints for on-device LLM deployment?

What is the capability of small models?

6 of 35

Motivation

Contemporary LLM products, such as ChatGPT, are predominantly designed to operate in cloud environments. However, there's an increasing demand for deploying LLMs on portable devices, which necessitates smaller models.

(click)

For example, The memory hierarchy of mobile devices is depicted in the figure. DRAM capacities of a phone typically range from 6 GB to 12 GB. Given that DRAM is shared with the operating system and other applications, a mobile app should ideally consume less than 10% of the DRAM.

This motivates us to develop sub-billion sized MobileLLMs that can effectively utilize the limited resources available on these devices, and while still delivering high-quality results. By doing so, we can enable more efficient on-device language assistants, which have the potential to revolutionize the way we interact with our devices.

7 of 35

MobileLLM

Zero-shot commonsense reasoning

8 of 35

Design choices

9 of 35

1. SwiGLU

By changing vanilla FFN (FC → ReLU → FC) to SwiGLU, The average performance on zero-shot reasoning tasks is boost from 42.6 to 43.9 for the 125M model

10 of 35

2. Deep and thin architecture

Then, there's a prevalent belief, stemming from the scaling law paper, that suggests the performance of large language models is primarily determined by the number of parameters, the size of the training dataset, and the number of training iterations. This belief concludes that architectural designs have a negligible impact on LLM performance. However, our findings indicate that this may not hold true for smaller models.

Our experiments with small models of limited capacity show that increasing depth is more beneficial than increasing the number of channels for performance enhancement. We trained models with around 125 million and 350 million parameters, varying their depth and width. The results consistently show that deeper, thinner models outperform shallower, wider ones on various tasks, as illustrated in the figure.

11 of 35

3. Embedding Sharing

In sub-billion scale language models, the embedding layers constitute a significant portion of the parameter count: 20% in a 125M model.

Therefore, we revisit the embedding sharing method proposed and implemented in OPT models

12 of 35

4. Grouped query attention

13 of 35

5. Layer sharing

14 of 35

5. Layer sharing

15 of 35

Final results – zero-shot reasoning

16 of 35

Final results – latency

17 of 35

Final results – chat

18 of 35

Final results – chat example

19 of 35

Final results – chat example

20 of 35

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Zechun Liu*, Barlas Oguz*, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

21 of 35

Previous works in LLM quantization

Mainly focused on post-training quantization

GPTQ W4A32
LLM.int8() W8A8 + W16A16
SmoothQuant W8A8

Performance on lower bits quantization and weight + activation + kv-cache quantization are relatively low.

22 of 35

QAT vs PTQ

Advantage:

Higher inference performance and compression ratio.

Challenges:

Requires representative data set.

Longer training time

--> Generate data from pretrained language model

--> We only use 100000 iterations for finetuning.

Solutions

In this study, we propose Quantization-Aware Training to mitigate accuracy degradation by fine-tuning the quantized model. QAT, compared to PTQ offers superior accuracy and a higher compression ratio. However, it presents additional challenges.

The primary challenge is the necessity for a representative dataset for fine-tuning, ensuring the fine-tuned model maintains accuracy in generalized tasks beyond the fine-tuning dataset, such as zero-shot common sense reasoning tasks. Then for Large Language Models (LLMs), the pretraining dataset is typically extensive and sometimes even inaccessible, making quantization fine-tuning challenging and underexplored in previous studies.

To address these challenges,

(click) We first propose the generatingdata from a pretrained language model.

(click) We demonstrate that generating only 100k sentences for fine-tuning the quantized model yields satisfactory accuracy. This process requires only 1-2 days on 8 A100 GPUs for fine-tuning a 7B LLM model.

23 of 35

Overview

24 of 35

Make data available for QAT

Public available datasets

Wiki2/103 : too small 🡪 overfitting

C4 : larger but still can’t preserve pre-training data distribution 🡪 drop in zero-shot accuracy

25 of 35

Quantization basics

26 of 35

Final performance

8-bit settings: all quantization methods are doing well

27 of 35

Final performance

4-bit settings: PTQ methods result in accuracy loss, whereas LLM-QAT holds up much better.

28 of 35

Final performance

LLaMA-7B 16bits

LLaMA-13B 8bits

LLaMA-30B 4bits

With similar model size, 4-bit 30B LLaMA > 8-bit 13B LLaMA > 16-bit 7B LLaMA

LLaMA-13B 4bits

LLaMA-7B 8bits

With similar model size, 4-bit 13B LLaMA (LLM-QAT) > 8-bit 7B LLaMA.

We recommend to use 4-bit quantization to achieve a good accuracy-memory trade-off!

29 of 35

Takeaways

For LLM quantization finetuning, gen data > real data.

For LLMs min-max quantization preserving the outliers is preferrable.

With the same model size, 4-bit quant model > 8-bit > 16-bit.

30 of 35

LLM-FP4:�4-Bit Floating-Point Quantized Transformers

Shih-yang Liu*, Zechun Liu*, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng

31 of 35

Existing PTQ solutions for transformer models are primarily integer-based and struggle with bit widths below 8 bits

FP8 has emerged as the default choice in various hardware platforms, including the NVIDIA H100, and more and more works has delved into studying low-bit FP format [1,2]

Motivation

32 of 35

Outlier channels will dominate the quantization precision of the quantized tensor, resulting in less representation capacity for those channels with smaller magnitudes

Per-channel activation quantization is not a solution --> fails efficient matrix multiplication

Challenge: channel-wise outliers

33 of 35

Pre-Shifted Exponent Bias

To address this issue, we propose the pre-shifted exponent bias, leveraging the characteristics of the Floating Point (FP) format to manage the high channel-wise variance problem.

This pre-shifted exponent bias is achieved by initially decomposing the channel-wise activation scaling factor into a tensor-wise real-valued scaling factor \phi and a channel-wise scaling factor being 2 to the power of b_ori. This b_ori is an interger can be re-parameterized into a weight tensor as the exponent bias in the floating-point representation.

During inference time, we pre-compute the weights and the exponent bias in the calibration phase. Consequently, the exponent bias shifting process only occurs during the calibration stage, allowing for efficient matrix multiplication during the inference phase.

34 of 35

Results

35 of 35

Thanks�Q&A

Zechun Liu