Practical Guide to Pretrained Language Models
Taiwei Shi
LLM Training Paradigm
Pretraining
Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
Post-training (Finetuning + Preference Optimization)
Post-training, on the other hand, is the training done after a model has been pretrained.
Transformers for causal language modeling
Transformers for masked language modeling (MLM)
Transformers are big models
Encoder vs Decoder
Three Categories of Transformers
Examples of Encoder-only Models (MLM)
Examples of Decoder-only Models
Can we get something that is even close to ChatGPT by just doing next-token prediction on the web???
Can we get something that is even close to ChatGPT by just doing next-token prediction on the web???
NO
Original GPT-3
GPT-3 after Instruction Tuning and RLHF
Technical Evolution of GPT-series Models
LLM Training Paradigm
HuggingFace
HuggingFace Datasets → https://huggingface.co/datasets
HuggingFace Models → https://huggingface.co/models
LLM Inference Framework
vLLM (https://github.com/vllm-project/vllm)
24 times faster than using vanilla pytorch!!!
LLM Training Framework
LLaMA Factory (https://github.com/hiyouga/LLaMA-Factory)
OpenRLHF (https://github.com/OpenRLHF/OpenRLHF)
LLM Hardware Requirement
Scaling Laws for Train-Time Compute
The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters N, the dataset size D, or the optimally allocated compute budget Cmin
Scaling Laws for Test-Time Compute
The performance of transformers consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).