BERT
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Core model
Tokenization:
Embeddings:
BERT (transformer encoder)
Last hidden state:
My dog is cute
He likes playing
Input
Pretraining BERT
MLM
C .. [CLS]�S .. [SEP]� … masked token
BERT
T4
Linear layer
Softmax (dim 30k=size of vocab)
C
T1
T2
T3
T5
S
T4
NSP
BERT
C
T1
T2
T3
T5
S
T4
Linear layer
Softmax (dim 2 - binary classification)
Fine-tuning
Ablation studies
Ablation studies - feature-based approach
Ablation studies - model size
Comparison of BERT vs OpenAI GPT
| BERT | OpenAI GPT |
dataset (pretraining) | BooksCorpus + Wikipedia | BooksCorpus |
batch size | 256 | 64 |
bidirectional | yes (MH attention) | no (masked MH attention) |
size | 110M (base) | 117M |
training objective | MLM+NSP | max |
GLUE results
SQuAD
SWAG (Situations With Adversarial Generations)
END
Transformer