Put LLMs on device? Challenges and new opportunities
Zechun Liu, Meta Reality Labs
On-device deployment
In this talk
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra
Questions:
Motivation
MobileLLM
Zero-shot commonsense reasoning
Design choices
1. SwiGLU
2. Deep and thin architecture
3. Embedding Sharing
In sub-billion scale language models, the embedding layers constitute a significant portion of the parameter count: 20% in a 125M model.
Therefore, we revisit the embedding sharing method proposed and implemented in OPT models
4. Grouped query attention
5. Layer sharing
5. Layer sharing
Final results – zero-shot reasoning
Final results – latency
Final results – chat
Final results – chat example
Final results – chat example
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu*, Barlas Oguz*, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra
Previous works in LLM quantization
QAT vs PTQ
Advantage:
Challenges:
--> Generate data from pretrained language model
--> We only use 100000 iterations for finetuning.
Solutions
Overview
Make data available for QAT
Public available datasets
Wiki2/103 : too small 🡪 overfitting
C4 : larger but still can’t preserve pre-training data distribution 🡪 drop in zero-shot accuracy
Quantization basics
Final performance
8-bit settings: all quantization methods are doing well
Final performance
4-bit settings: PTQ methods result in accuracy loss, whereas LLM-QAT holds up much better.
Final performance
LLaMA-7B 16bits
LLaMA-13B 8bits
LLaMA-30B 4bits
With similar model size, 4-bit 30B LLaMA > 8-bit 13B LLaMA > 16-bit 7B LLaMA
LLaMA-13B 4bits
LLaMA-7B 8bits
With similar model size, 4-bit 13B LLaMA (LLM-QAT) > 8-bit 7B LLaMA.
We recommend to use 4-bit quantization to achieve a good accuracy-memory trade-off!
Takeaways
LLM-FP4:�4-Bit Floating-Point Quantized Transformers
Shih-yang Liu*, Zechun Liu*, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
Motivation
Challenge: channel-wise outliers
Pre-Shifted Exponent Bias
Results
Thanks�Q&A
Zechun Liu