1 of 23

BERT

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

2 of 23

BERT: Bidirectional Encoder Representations from Transformers

BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers

As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks

3 of 23

Core model

Tokenization:

Embeddings:

BERT (transformer encoder)

Last hidden state:

My dog is cute

He likes playing

4 of 23

Input

single sentence or 2 sentences
WordPiece embedding
30,000 token vocabulary

The first token is always [CLS] token
The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
differentiation of sentences in 2 ways at once: 1. [SEP] token� 2. segment embedding

6 of 23

Pretraining BERT

two unsupervised tasks

1. Masked LM (MLM)
2. Next Sentence Prediction (NSP)

7 of 23

MLM

standard LM models can be trained only left-right or right-left

to simply mask some percentage of the input tokens at random, and then predict those masked tokens
[MASK] token does not appear during fine-tuning -> 15% of the token positions masked at random for prediction. If the i-th token is chosen, replace the i-th token with the [MASK] token 80% of the time; a random token 10% of the time; the unchanged i-th token 10% of the time
cross-entropy loss; T_i is used to predict the masked token

8 of 23

C .. [CLS]�S .. [SEP]� … masked token

BERT

Linear layer

Softmax (dim 30k=size of vocab)

9 of 23

NSP

to understand the relationship between two sentences
sentences A,B, in 50% B is following A in the corpus, in other 50% B is a random sentence from the training corpus
C (i.e. the last hidden state of the [CLS] token) is used to predict whether B follows A or not

BERT is pretrained on both tasks simultaneously, MLM and NSP losses are summed up

10 of 23

BERT

Linear layer

Softmax (dim 2 - binary classification)

11 of 23

Fine-tuning

last hidden state of [CLS] token is used for sentence classification tasks
last hidden states of the input tokens are used for token-based tasks (e.g. Question Answering - where we are looking for the most probable span in a document given a query)
Typically, to fine tune BERT on a dataset, only a single Linear layer is appended

GLUE: (General Language Understanding Evaluation) - collection of diverse NLU tasks, generally used to compare a model architecture to others

13 of 23

number of layers (i.e., Transformer blocks): L, the hidden size:H, and the number of self-attention heads: A
BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M)
BERT-base~OpenAI GPT

14 of 23

Ablation studies

15 of 23

Ablation studies - feature-based approach

16 of 23

Ablation studies - model size

scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.

17 of 23

Comparison of BERT vs OpenAI GPT

	BERT	OpenAI GPT
dataset (pretraining)	BooksCorpus + Wikipedia	BooksCorpus
batch size	256	64
bidirectional	yes (MH attention)	no (masked MH attention)
size	110M (base)	117M
training objective	MLM+NSP	max

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23