1 of 10

Pretrained Models

Tour Guide: Ting-Yun Chang

2 of 10

Outline

Pretraining w/o Labels

Masked Token Prediction: BERT/MAE
Next Token Prediction: GPT/iGPT

GPT2 code demo
Scaling Laws and Beyond

3 of 10

Pretraining w/o Labels

We have so many data in the world, but only a few of them are labeled
Idea: first pretrain a model on all data we have (e.g., the entire internet data), and then transfer it to our downstream applications

Usually a large model with a Transformer backbone

But the data are unlabeled, how to define the loss function?

Masked token prediction
Next token prediction

After the pretraining, how can we use the model?

Fine-tuning (need labeled data of the downstream task)
Prompting
Feature Extraction

4 of 10

Masked Token Prediction - BERT

[CLS]

To

[Mask]

or

not

…

BERT

(bidirectional)

be

do

go

…

vocab

objective: maximize the probability of outputting “be”

Cloze test
randomly mask some tokens in the data

to

be

5 of 10

Masked Patch Prediction - MAE

Masked Autoencoder

(MAE)

divide an image into non-overlapping patches
objective: reconstruct the masked patches

6 of 10

Next Token Prediction - GPT

Source: https://jalammar.github.io/illustrated-gpt2/

objective: maximize the probability of outputting the correct next token

(unidirectional)

7 of 10

Next Pixel Prediction - Image GPT

Raster order

https://openai.com/research/image-gpt

8 of 10

Scaling Laws

NLP community is trying to reach for the moon by pretraining larger models on larger data
Pretraining + Large Language Models can do some magic

In 2017, I was doing modeling

designing complicated RNNs

In 2019, I was fine-tuning BERT

Training 3 epochs in 10 min works on many classification tasks

In 2022, I was writing prompts

No training at all!

2023??

9 of 10

So we just keep scaling up?

BLOOM language models; different sizes
Task: trivia in Natural Questions dataset
closed-book QA evaluation

Use a few QA pairs as the prompts (in-context learning)

larger models are better at long-tail knowledge
need a model with 10^20 parameters to reach human performance??

Large Language Models Struggle to Learn Long-Tail Knowledge

10 of 10

Compute-Optimal Large Language Models?

Given a fixed compute budget, more data or larger model?
DeepMind: more data (number of training tokens)

Chinchilla model: 70B parameters are enough

MMLU Benchmark