1 of 24

Practical Guide to Pretrained Language Models

Taiwei Shi

2 of 24

LLM Training Paradigm

Pretraining
Post-training

Supervised Fine-Tuning (SFT)
Preference Optimization (e.g., Reinforcement Learning from Human Feedback - RLHF)

3 of 24

Pretraining

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

4 of 24

Post-training (Finetuning + Preference Optimization)

Post-training, on the other hand, is the training done after a model has been pretrained.

Transfer learning: the pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset.
Efficiency: since the pretrained model was already trained on lots of data, the fine-tuning requires way less data and resources to get decent results.

5 of 24

Transformers for causal language modeling

6 of 24

Transformers for masked language modeling (MLM)

7 of 24

Transformers are big models

8 of 24

Encoder vs Decoder

Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

9 of 24

Three Categories of Transformers

GPT-like (also called auto-regressive Transformer models)
BERT-like (also called auto-encoding Transformer models)
BART/T5-like (also called sequence-to-sequence Transformer models)

10 of 24

Examples of Encoder-only Models (MLM)

BERT (Bidirectional Encoder Representations from Transformers)

BERT is designed to understand the context of words in search queries. It has significantly improved tasks like question answering and sentiment analysis.

RoBERTa (A Robustly Optimized BERT Pretraining Approach)

RoBERTa is trained with more data and optimized techniques

DistilBERT

A distilled version of BERT that is smaller, faster, and lighter

XLNet

XLNet integrates bidirectional context into autoregressive models

11 of 24

Examples of Decoder-only Models

Proprietary (only API access)

ChatGPT, Gemini, Claude

Open-weight (no data/code access, just the model weight)

Meta LLaMA, Google Gemma, Mistral, Microsoft Phi, Alibaba Qwen, …

Open-source (access to everything including data and code)

AI 2 OLMo, EleutherAI GPT-Neo/Pythia/GPT-J

12 of 24

Can we get something that is even close to ChatGPT by just doing next-token prediction on the web???

13 of 24

Can we get something that is even close to ChatGPT by just doing next-token prediction on the web???

14 of 24

Original GPT-3

GPT-3 after Instruction Tuning and RLHF

15 of 24

Technical Evolution of GPT-series Models

16 of 24

LLM Training Paradigm

Pretraining
Supervised Fine-Tuning (SFT)
Preference Optimization (e.g., Reinforcement Learning from Human Feedback - RLHF)

17 of 24

HuggingFace

One-stop shop for modern NLP

Datasets
Pretrained models
Training scripts
Evaluation metrics
Efficient training

Parameter Efficient Finetuning → PEFT
Multi-gpu or multi-node training → accelerate

18 of 24

HuggingFace Datasets → https://huggingface.co/datasets

19 of 24

HuggingFace Models → https://huggingface.co/models

20 of 24

LLM Inference Framework

vLLM (https://github.com/vllm-project/vllm)

24 times faster than using vanilla pytorch!!!

21 of 24

LLM Training Framework

LLaMA Factory (https://github.com/hiyouga/LLaMA-Factory)

OpenRLHF (https://github.com/OpenRLHF/OpenRLHF)

22 of 24

LLM Hardware Requirement

23 of 24

Scaling Laws for Train-Time Compute

The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters N, the dataset size D, or the optimally allocated compute budget Cmin

24 of 24

Scaling Laws for Test-Time Compute

The performance of transformers consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).