1 of 35

CSCI-SHU 376: Natural Language Processing

Hua Shen

2026-03-05

Spring 2026

Lecture 10: LLM Pretraining and Instruction Finetuning

2 of 35

Recap: Pre-training and Fine-Tuning

  • Pre-train on a large dataset for task X
  • Fine-tune on a (smaller) dataset for task Y
  • Goal: Learn neural representations from X that benefit Y

2

3 of 35

Recap: ELMo

  • The idea of ELMo:
    • Train two stacked LSTM-based language models on large corpus
    • Use the hidden states of the LSTMs for each token to compute a vector representation of each word

3

4 of 35

Recap: BERT

  • Deep Bidirectional Encoder
  • Learn Representations based on Bidirectional Contexts
  • Pre-training Tasks:
    • Masked Language Modeling (MLM)
    • Next Sentence Prediction (NSP)
  • Use BERT for fine-tuning, instead of word embeddings!

4

5 of 35

Today’s Plan

  • LLM Pretraining
  • Instruction Finetuning

6 of 35

Pretraining Data

6

7 of 35

Pretraining Data Factors

  • Quantity: How much data I have?

  • Quality: Is it beneficial for training?

  • Coverage: Does the data cover the domains I care about, and in right proportions?

7

8 of 35

Pretraining Data Quantities

8

9 of 35

Pretraining Data: Common Crawl

  • Large snapshot of web pages

    • Extraction: HTML to Text

    • Filtering: filter out unwanted pages

    • Deduplication: many duplicate web pages

9

10 of 35

Pretraining Data: Extraction

  • Extraction: HTML to Text

    • Remove boilerplate

    • Retain Latex, code, etc.

10

11 of 35

Pretraining Data: Filtering

  • Filter out unwanted text

    • Language filter

    • Repetitions

    • Too many short lines

11

12 of 35

Pretraining Data: Deduplication

  • Remove duplicate content

12

13 of 35

Pretraining Data Mixture

  • In practice, training data is a mixture of different sources

13

14 of 35

Pretraining and Compute

 

14

15 of 35

Scaling laws

  • Scaling up compute leads to a better model!

15

16 of 35

Today’s Plan

  • GPT Pretraining
  • Instruction Finetuning

17 of 35

Language modeling != assisting users

  • Language models are not aligned with user intent

17

18 of 35

Language modeling != assisting users

  • Language models are not aligned with user intent

We need fine-tuning to rescue!

18

19 of 35

Recap: Pre-training and Fine-Tuning

  • Pre-train on a large dataset for task X
  • Fine-tune on a (smaller) dataset for task Y
  • Goal: Learn neural representations from X that benefit Y

19

20 of 35

Scaling up finetuning

  • Fine-tune on many tasks

20

21 of 35

Instruction finetuning

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., ... & Wei, J. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1-53.

21

22 of 35

Instruction finetuning

Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).

22

23 of 35

Instruction finetuning

  • Collect examples of (instruction, output) pairs across many tasks and fine-tune a LM
  • Evaluate on unseen tasks

23

24 of 35

Instruction finetuning (Flan-T5)

24

25 of 35

Instruction finetuning (Flan-T5)

25

26 of 35

MMLU: new benchmarks for multitask LMs

  • Massive Multitask Language Understanding (MMLU)
  • New benchmarks for measuring LM performance on 57 diverse knowledge intensive tasks

26

27 of 35

MMLU: Examples

27

28 of 35

MMLU: Rapid Progress

28

29 of 35

Instruction finetuning results

  • Flan-T5: T5 models finetuned on 1.8K additional tasks

29

30 of 35

Chat Tuning

  • Ultimately, format a chat as a sequence of tokens
    • System prompt
    • [user, assistant, user, assistant, …]

  • Instruction + input are implicitly in the conversation

30

31 of 35

Knowledge Distillation

  • Goal: use a good model to generate data for another model

31

32 of 35

Supervised Distillation

32

33 of 35

Supervised Distillation

  • Kullback–Leibler (KL) divergence:
    • measure how much a model distribution differs from a true distribution

33

34 of 35

Token-level Knowledge Distillation

  • Minimize KL divergence between teacher and student

34

35 of 35

Sequence-level Knowledge Distillation

  • Generate with a teacher model

  • Student model fine-tunes on the generated data

  • Minimize KL between teacher and student

35