1 of 35

Put LLMs on device? Challenges and new opportunities

Zechun Liu, Meta Reality Labs

2 of 35

On-device deployment

  • Considerations of portability and computational cost propel the necessity to deploy language models on smart phones and mobile devices.
  • A new stream of research direction have emerged to downsize LMs to enable on-device inference.

3 of 35

In this talk

  • MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

  • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models (Arxiv)

  • LLM-FP4: 4-Bit Floating-Point Quantized Transformers (EMNLP 2023)

4 of 35

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra

5 of 35

Questions:

  • Why do we want to put LLMs on device?

  • What are the constraints for on-device LLM deployment?

  • What is the capability of small models?

6 of 35

Motivation

7 of 35

MobileLLM

Zero-shot commonsense reasoning

8 of 35

Design choices

9 of 35

1. SwiGLU

  • By changing vanilla FFN (FC → ReLU → FC) to SwiGLU, The average performance on zero-shot reasoning tasks is boost from 42.6 to 43.9 for the 125M model

10 of 35

2. Deep and thin architecture

11 of 35

3. Embedding Sharing

In sub-billion scale language models, the embedding layers constitute a significant portion of the parameter count: 20% in a 125M model.

Therefore, we revisit the embedding sharing method proposed and implemented in OPT models

12 of 35

4. Grouped query attention

13 of 35

5. Layer sharing

14 of 35

5. Layer sharing

15 of 35

Final results – zero-shot reasoning

16 of 35

Final results – latency

17 of 35

Final results – chat

18 of 35

Final results – chat example

19 of 35

Final results – chat example

20 of 35

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Zechun Liu*, Barlas Oguz*, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

21 of 35

Previous works in LLM quantization

  • Mainly focused on post-training quantization
    • GPTQ W4A32
    • LLM.int8() W8A8 + W16A16
    • SmoothQuant W8A8

  • Performance on lower bits quantization and weight + activation + kv-cache quantization are relatively low.

22 of 35

QAT vs PTQ

Advantage:

  • Higher inference performance and compression ratio.

Challenges:

  • Requires representative data set.

  • Longer training time

--> Generate data from pretrained language model

--> We only use 100000 iterations for finetuning.

Solutions

23 of 35

Overview

24 of 35

Make data available for QAT

Public available datasets

Wiki2/103 : too small 🡪 overfitting

C4 : larger but still can’t preserve pre-training data distribution 🡪 drop in zero-shot accuracy

25 of 35

Quantization basics

  •  

26 of 35

Final performance

8-bit settings: all quantization methods are doing well

27 of 35

Final performance

4-bit settings: PTQ methods result in accuracy loss, whereas LLM-QAT holds up much better.

28 of 35

Final performance

LLaMA-7B 16bits

LLaMA-13B 8bits

LLaMA-30B 4bits

With similar model size, 4-bit 30B LLaMA > 8-bit 13B LLaMA > 16-bit 7B LLaMA

LLaMA-13B 4bits

LLaMA-7B 8bits

With similar model size, 4-bit 13B LLaMA (LLM-QAT) > 8-bit 7B LLaMA.

We recommend to use 4-bit quantization to achieve a good accuracy-memory trade-off!

29 of 35

Takeaways

  • For LLM quantization finetuning, gen data > real data.

  • For LLMs min-max quantization preserving the outliers is preferrable.

  • With the same model size, 4-bit quant model > 8-bit > 16-bit.

30 of 35

LLM-FP4:�4-Bit Floating-Point Quantized Transformers

Shih-yang Liu*, Zechun Liu*, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng

31 of 35

  • Existing PTQ solutions for transformer models are primarily integer-based and struggle with bit widths below 8 bits

  • FP8 has emerged as the default choice in various hardware platforms, including the NVIDIA H100, and more and more works has delved into studying low-bit FP format [1,2]

Motivation

32 of 35

  • Outlier channels will dominate the quantization precision of the quantized tensor, resulting in less representation capacity for those channels with smaller magnitudes

  • Per-channel activation quantization is not a solution --> fails efficient matrix multiplication

Challenge: channel-wise outliers

33 of 35

Pre-Shifted Exponent Bias

34 of 35

Results

35 of 35

Thanks�Q&A

Zechun Liu