1 of 24

Practical Guide to Pretrained Language Models

Taiwei Shi

2 of 24

LLM Training Paradigm

  1. Pretraining
  2. Post-training
    1. Supervised Fine-Tuning (SFT)
    2. Preference Optimization (e.g., Reinforcement Learning from Human Feedback - RLHF)

3 of 24

Pretraining

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

4 of 24

Post-training (Finetuning + Preference Optimization)

Post-training, on the other hand, is the training done after a model has been pretrained.

  • Transfer learning: the pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset.
  • Efficiency: since the pretrained model was already trained on lots of data, the fine-tuning requires way less data and resources to get decent results.

5 of 24

Transformers for causal language modeling

6 of 24

Transformers for masked language modeling (MLM)

7 of 24

Transformers are big models

8 of 24

Encoder vs Decoder

  • Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  • Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

9 of 24

Three Categories of Transformers

  1. GPT-like (also called auto-regressive Transformer models)
  2. BERT-like (also called auto-encoding Transformer models)
  3. BART/T5-like (also called sequence-to-sequence Transformer models)

10 of 24

Examples of Encoder-only Models (MLM)

  • BERT (Bidirectional Encoder Representations from Transformers)
    • BERT is designed to understand the context of words in search queries. It has significantly improved tasks like question answering and sentiment analysis.
  • RoBERTa (A Robustly Optimized BERT Pretraining Approach)
    • RoBERTa is trained with more data and optimized techniques
  • DistilBERT
    • A distilled version of BERT that is smaller, faster, and lighter
  • XLNet
    • XLNet integrates bidirectional context into autoregressive models

11 of 24

Examples of Decoder-only Models

  • Proprietary (only API access)
    • ChatGPT, Gemini, Claude
  • Open-weight (no data/code access, just the model weight)
    • Meta LLaMA, Google Gemma, Mistral, Microsoft Phi, Alibaba Qwen, …
  • Open-source (access to everything including data and code)
    • AI 2 OLMo, EleutherAI GPT-Neo/Pythia/GPT-J

12 of 24

Can we get something that is even close to ChatGPT by just doing next-token prediction on the web???

13 of 24

Can we get something that is even close to ChatGPT by just doing next-token prediction on the web???

NO

14 of 24

Original GPT-3

GPT-3 after Instruction Tuning and RLHF

15 of 24

Technical Evolution of GPT-series Models

16 of 24

LLM Training Paradigm

  • Pretraining
  • Supervised Fine-Tuning (SFT)
  • Preference Optimization (e.g., Reinforcement Learning from Human Feedback - RLHF)

17 of 24

HuggingFace

  • One-stop shop for modern NLP
    • Datasets
    • Pretrained models
    • Training scripts
    • Evaluation metrics
    • Efficient training
      • Parameter Efficient Finetuning → PEFT
      • Multi-gpu or multi-node training → accelerate

18 of 24

HuggingFace Datasets → https://huggingface.co/datasets

19 of 24

HuggingFace Models → https://huggingface.co/models

20 of 24

LLM Inference Framework

vLLM (https://github.com/vllm-project/vllm)

24 times faster than using vanilla pytorch!!!

21 of 24

LLM Training Framework

LLaMA Factory (https://github.com/hiyouga/LLaMA-Factory)

OpenRLHF (https://github.com/OpenRLHF/OpenRLHF)

22 of 24

LLM Hardware Requirement

23 of 24

Scaling Laws for Train-Time Compute

The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters N, the dataset size D, or the optimally allocated compute budget Cmin

24 of 24

Scaling Laws for Test-Time Compute

The performance of transformers consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).