1 of 35

CSCI-SHU 376: Natural Language Processing

Hua Shen

Course Agenda: 2026 Spring-NLP-[CSCI-SHU-376]-Class Schedule

2026-03-05

Spring 2026

Lecture 10: LLM Pretraining and Instruction Finetuning

2 of 35

Recap: Pre-training and Fine-Tuning

Pre-train on a large dataset for task X
Fine-tune on a (smaller) dataset for task Y
Goal: Learn neural representations from X that benefit Y

3 of 35

Recap: ELMo

The idea of ELMo:

Train two stacked LSTM-based language models on large corpus
Use the hidden states of the LSTMs for each token to compute a vector representation of each word

4 of 35

Recap: BERT

Deep Bidirectional Encoder
Learn Representations based on Bidirectional Contexts
Pre-training Tasks:

Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)

Use BERT for fine-tuning, instead of word embeddings!

5 of 35

Today’s Plan

LLM Pretraining
Instruction Finetuning

6 of 35

Pretraining Data

7 of 35

Pretraining Data Factors

Quantity: How much data I have?

Quality: Is it beneficial for training?

Coverage: Does the data cover the domains I care about, and in right proportions?

8 of 35

Pretraining Data Quantities

9 of 35

Pretraining Data: Common Crawl

Large snapshot of web pages

Extraction: HTML to Text

Filtering: filter out unwanted pages

Deduplication: many duplicate web pages

10 of 35

Pretraining Data: Extraction

Extraction: HTML to Text

Remove boilerplate

Retain Latex, code, etc.

11 of 35

Pretraining Data: Filtering

Filter out unwanted text

Language filter

Repetitions

Too many short lines

12 of 35

Pretraining Data: Deduplication

Remove duplicate content

13 of 35

Pretraining Data Mixture

In practice, training data is a mixture of different sources

14 of 35

Pretraining and Compute

15 of 35

Scaling laws

Scaling up compute leads to a better model!

16 of 35

Today’s Plan

GPT Pretraining
Instruction Finetuning

17 of 35

Language modeling != assisting users

Language models are not aligned with user intent

18 of 35

Language modeling != assisting users

Language models are not aligned with user intent

We need fine-tuning to rescue!

19 of 35

Recap: Pre-training and Fine-Tuning

Pre-train on a large dataset for task X
Fine-tune on a (smaller) dataset for task Y
Goal: Learn neural representations from X that benefit Y

20 of 35

Scaling up finetuning

Fine-tune on many tasks

21 of 35

Instruction finetuning

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., ... & Wei, J. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1-53.

22 of 35

Instruction finetuning

Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).

23 of 35

Instruction finetuning

Collect examples of (instruction, output) pairs across many tasks and fine-tune a LM
Evaluate on unseen tasks

24 of 35

Instruction finetuning (Flan-T5)

25 of 35

Instruction finetuning (Flan-T5)

26 of 35

MMLU: new benchmarks for multitask LMs

Massive Multitask Language Understanding (MMLU)
New benchmarks for measuring LM performance on 57 diverse knowledge intensive tasks

27 of 35

MMLU: Examples

28 of 35

MMLU: Rapid Progress

29 of 35

Instruction finetuning results

Flan-T5: T5 models finetuned on 1.8K additional tasks

30 of 35

Chat Tuning

Ultimately, format a chat as a sequence of tokens

System prompt
[user, assistant, user, assistant, …]

Instruction + input are implicitly in the conversation

31 of 35

Knowledge Distillation

Goal: use a good model to generate data for another model

32 of 35

Supervised Distillation

33 of 35

Supervised Distillation

Kullback–Leibler (KL) divergence:

measure how much a model distribution differs from a true distribution

34 of 35

Token-level Knowledge Distillation

Minimize KL divergence between teacher and student

35 of 35

Sequence-level Knowledge Distillation

Generate with a teacher model

Student model fine-tunes on the generated data

Minimize KL between teacher and student