1 of 10

Pretrained Models

Tour Guide: Ting-Yun Chang

2 of 10

Outline

  • Pretraining w/o Labels
    • Masked Token Prediction: BERT/MAE
    • Next Token Prediction: GPT/iGPT
  • GPT2 code demo
  • Scaling Laws and Beyond

3 of 10

Pretraining w/o Labels

  • We have so many data in the world, but only a few of them are labeled
  • Idea: first pretrain a model on all data we have (e.g., the entire internet data), and then transfer it to our downstream applications
    • Usually a large model with a Transformer backbone
  • But the data are unlabeled, how to define the loss function?
    • Masked token prediction
    • Next token prediction
  • After the pretraining, how can we use the model?
    • Fine-tuning (need labeled data of the downstream task)
    • Prompting
    • Feature Extraction

4 of 10

Masked Token Prediction - BERT

[CLS]

To

[Mask]

or

not

BERT

(bidirectional)

be

do

go

vocab

objective: maximize the probability of outputting “be

  • Cloze test
  • randomly mask some tokens in the data

to

be

5 of 10

Masked Patch Prediction - MAE

Masked Autoencoder

(MAE)

  • divide an image into non-overlapping patches
  • objective: reconstruct the masked patches

6 of 10

Next Token Prediction - GPT

Source: https://jalammar.github.io/illustrated-gpt2/

objective: maximize the probability of outputting the correct next token

(unidirectional)

7 of 10

Next Pixel Prediction - Image GPT

Raster order

8 of 10

Scaling Laws

  • NLP community is trying to reach for the moon by pretraining larger models on larger data
  • Pretraining + Large Language Models can do some magic
    • In 2017, I was doing modeling
      • designing complicated RNNs
    • In 2019, I was fine-tuning BERT
      • Training 3 epochs in 10 min works on many classification tasks
    • In 2022, I was writing prompts
      • No training at all!

2023??

9 of 10

So we just keep scaling up?

  • BLOOM language models; different sizes
  • Task: trivia in Natural Questions dataset
  • closed-book QA evaluation
    • Use a few QA pairs as the prompts (in-context learning)
  • larger models are better at long-tail knowledge
  • need a model with 10^20 parameters to reach human performance??

10 of 10

Compute-Optimal Large Language Models?

  • Given a fixed compute budget, more data or larger model?
  • DeepMind: more data (number of training tokens)
    • Chinchilla model: 70B parameters are enough

MMLU Benchmark