1 of 23

BERT

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

2 of 23

  • BERT: Bidirectional Encoder Representations from Transformers

  • BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers

  • As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks

3 of 23

Core model

Tokenization:

Embeddings:

BERT (transformer encoder)

Last hidden state:

My dog is cute

He likes playing

4 of 23

Input

  • single sentence or 2 sentences
  • WordPiece embedding
  • 30,000 token vocabulary

  • The first token is always [CLS] token
  • The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
  • differentiation of sentences in 2 ways at once: 1. [SEP] token� 2. segment embedding

5 of 23

6 of 23

Pretraining BERT

  • two unsupervised tasks

  • 1. Masked LM (MLM)
  • 2. Next Sentence Prediction (NSP)

7 of 23

MLM

  • standard LM models can be trained only left-right or right-left

  • to simply mask some percentage of the input tokens at random, and then predict those masked tokens
  • [MASK] token does not appear during fine-tuning -> 15% of the token positions masked at random for prediction. If the i-th token is chosen, replace the i-th token with the [MASK] token 80% of the time; a random token 10% of the time; the unchanged i-th token 10% of the time
  • cross-entropy loss; T_i is used to predict the masked token

8 of 23

C .. [CLS]�S .. [SEP]� … masked token

BERT

T4

Linear layer

Softmax (dim 30k=size of vocab)

C

T1

T2

T3

T5

S

T4

9 of 23

NSP

  • to understand the relationship between two sentences
  • sentences A,B, in 50% B is following A in the corpus, in other 50% B is a random sentence from the training corpus
  • C (i.e. the last hidden state of the [CLS] token) is used to predict whether B follows A or not

  • BERT is pretrained on both tasks simultaneously, MLM and NSP losses are summed up

10 of 23

BERT

C

T1

T2

T3

T5

S

T4

Linear layer

Softmax (dim 2 - binary classification)

11 of 23

Fine-tuning

  • last hidden state of [CLS] token is used for sentence classification tasks
  • last hidden states of the input tokens are used for token-based tasks (e.g. Question Answering - where we are looking for the most probable span in a document given a query)
  • Typically, to fine tune BERT on a dataset, only a single Linear layer is appended

  • GLUE: (General Language Understanding Evaluation) - collection of diverse NLU tasks, generally used to compare a model architecture to others

12 of 23

13 of 23

  • number of layers (i.e., Transformer blocks): L, the hidden size:H, and the number of self-attention heads: A
  • BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M)
  • BERT-base~OpenAI GPT

14 of 23

Ablation studies

15 of 23

Ablation studies - feature-based approach

16 of 23

Ablation studies - model size

  • scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.

17 of 23

Comparison of BERT vs OpenAI GPT

BERT

OpenAI GPT

dataset (pretraining)

BooksCorpus + Wikipedia

BooksCorpus

batch size

256

64

bidirectional

yes (MH attention)

no (masked MH attention)

size

110M (base)

117M

training objective

MLM+NSP

max

18 of 23

GLUE results

19 of 23

SQuAD

20 of 23

SWAG (Situations With Adversarial Generations)

21 of 23

END

22 of 23

Transformer

23 of 23