Pretrained Models
Tour Guide: Ting-Yun Chang
Outline
Pretraining w/o Labels
Masked Token Prediction - BERT
[CLS]
To
[Mask]
or
not
…
BERT
(bidirectional)
be
do
go
…
vocab
objective: maximize the probability of outputting “be”
to
be
Masked Patch Prediction - MAE
Masked Autoencoder
(MAE)
Next Token Prediction - GPT
Source: https://jalammar.github.io/illustrated-gpt2/
objective: maximize the probability of outputting the correct next token
(unidirectional)
Next Pixel Prediction - Image GPT
Raster order
Scaling Laws
2023??
So we just keep scaling up?
Compute-Optimal Large Language Models?
MMLU Benchmark