2 of 39

Improvements of Transformer

Google’s BERT

Bidirectional Encoder Representations from Transformers
BERT is an improvement of GPT

Key technical innovation is bidirectional training of Transformer

Open AI’s Generative Pre-Training (GPT)

Combines supervised and unsupervised learning to improve word vectors

Performs Generative Pre-Training

State-of-the art for following tasks:

Textual entailment, semantic similarity, reading comprehension, common sense reasoning, linguistic acceptability, multi-task benchmark

3 of 39

Bidirectional Encoder Representations from Transformers (BERT)

4 of 39

Google’s BERT

(Bidirectional Encoder Representations from Transformers)

Uses Transformer attention mechanism to learn contextual relations between words

Transformer includes two mechanisms

An Encoder that reads text input
A Decoder that produces a prediction for the task

Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary

Uses Bidirectional training

Deeper sense of context/flow than single-direction

Masked LM (MLM) allows bidirectional training

5 of 39

BERT

BERT = Encoder of Transformer

Encoder

BERT

Let's

improvise

the

skit

……

Learned from a large amount of text without annotation

……

6 of 39

Structure of BERT

The BERT Encoder block implements the base version of the BERT network

It is composed of 12 successive transformer layers, each having 12 attention heads

The total number of parameters is 110 million.

https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/

blocks/bert-encoder?fbclid=IwAR2rt4ObbHCMvuGsfrNNOnotfW_Dfc-5y3QorqW1nQUbw4XzuwjjCbnQDy0

7 of 39

BERT Architecture

BERT architecture builds on top of Transformer
There are two variants:

BERT Base:

12 layers (transformer blocks), 12 attention heads, and 110 million parameters

BERT Large:

24 layers (transformer blocks), 16 attention heads and, 340 million parameters

8 of 39

Model input dimension 512

Input and output vector size

9 of 39

Text Pre-processing

Position Embeddings:

BERT learns and uses positional embeddings to express the position of words

in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information

Segment Embeddings:

BERT can also take sentence pairs as inputs for tasks (Question-Answering).

That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)

Token Embeddings:

These are the embeddings learned for the specific token from the WordPiece token vocabulary

10 of 39

BERT pretraining

ULM-FiT (2018): Pre-training ideas, transfer learning in NLP.
ELMo: Bidirectional training (LSTM)
Transformer: Although used things from left, but still missing from the right.
GPT: Use Transformer Decoder half.
BERT: Switches from Decoder to Encoder, so that it can use both sides in training and invented corresponding training tasks: masked language model

11 of 39

Pre-training BERT

BERT is pre-trained on two NLP tasks:

Masked Language Modeling(For Bidrectionality)
Next Sentence Prediction

12 of 39

BERT Pretraining Task 1: masked words

Approach 1:

Masked LM

BERT

Let's

退了

the

skit

……

[MASK]

Linear Multi-class

Classifier

Predicting the masked word

vocabulary size

13 of 39

BERT Pretraining Task 1: masked words

Out of this 15%,

80% are [Mask],

10% random words

10% original words

Before feeding word sequences 15% of words in each sequence replaced with a [MASK] token

Model attempts to predict original value of masked words, based on context provided by non-masked words in sequence

Prediction of the output words requires

Adding a classifier layer on top of encoder output
Multiplying output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating probability of each word in the vocabulary with softmax

14 of 39

Word prediction using masking

15 of 39

BERT Pretraining Task 2: two sentences

A text dataset of 100,000 sentences has 50,000 training examples or pairs of sentences as training data.

For 50% of pairs, the second sentence would be the next sentence to the first sentence
For the remaining 50% of pairs, second sentence would be a random sentence from corpus

The labels for the first case would

be ‘IsNext’ and ‘NotNext’ for the second case

16 of 39

BERT

[CLS]

Wake

[SEP]

BERT Pretraining Task 2: Next Sentence Prediction

you

are

late

Linear Binary

Classifier

yes

[CLS]: the position that outputs classification results

[SEP]: the boundary of two sentences

Approaches 1 and 2 are used at the same time.

17 of 39

BERT

[CLS]

wake

[SEP]

sky

blue

Linear Binary

Classifier

[CLS]: the position that outputs classification results

[SEP]: the boundary of two sentences

Approaches 1 and 2 are used at the same time.

BERT Pretraining Task 2: Next Sentence Prediction

18 of 39

Fine-tuning BERT for other specific tasks

SST (Stanford sentiment treebank): 215k phrases with fine-grained sentiment labels in the parse trees of 11k sentences.

MNLI

QQP (Quaro Question Pairs)

Semantic equivalence)

QNLI (NL inference dataset)

STS-B (texture similarity)

MRPC (paraphrase, Microsoft)

RTE (textual entailment)

SWAG (commonsense inference)

SST-2 (sentiment)

CoLA (linguistic acceptability

SQuAD (question and answer)

19 of 39

How to use BERT – Case 1

BERT

[CLS]

w₁

w₂

w₃

Linear Classifier

class

Input: single sentence,

output: class

sentence

Example:

Sentiment analysis (our HW),

Document Classification

Trained from Scratch

Fine-tune

20 of 39

How to use BERT – Case 2

BERT

[CLS]

w₁

w₂

w₃

Linear Cls

class

Input: single sentence,

output: class of each word

sentence

Example: Slot filling

Linear Cls

class

Linear Cls

class

21 of 39

How to use BERT – Case 3

Linear Classifier

w₁

w₂

BERT

[CLS]

[SEP]

Class

Sentence 1

Sentence 2

w₃

w₄

w₅

Input: two sentences, output: class

Example: Natural Language Inference

Given a “premise”, determining whether a “hypothesis” is T/F/ unknown.

22 of 39

How to use BERT – Case 4

Extraction-based Question Answering (QA) (E.g. SQuAD)

Model

Document:

Query:

Answer:

23 of 39

How to use BERT – Case 4

q₁

q₂

BERT

[CLS]

[SEP]

question

document

d₁

d₂

d₃

dot product

Softmax

0.5

0.3

0.2

The answer is “d₂d₃”.

s = 2, e = 3

Learned from scratch

24 of 39

How to use BERT – Case 4

q₁

q₂

BERT

[CLS]

[SEP]

question

document

d₁

d₂

d₃

dot product

Softmax

0.2

0.1

0.7

The answer is “d₂d₃”.

s = 2, e = 3

Learned from scratch

25 of 39

Enhanced Representation through Knowledge Integration (ERNIE)

Designed for Chinese

https://arxiv.org/abs/1904.09223

BERT

ERNIE

Source of image:

https://zhuanlan.zhihu.com/p/59436589

26 of 39

What does BERT learn?

https://arxiv.org/abs/1905.05950

https://openreview.net/pdf?id=SJzSgnRcKX

27 of 39

Multilingual BERT

https://arxiv.org/abs/1904.09077

Trained on 104 languages

Task specific training data for English

Class 1

Class 2

Class 3

Task specific testing data for Chinese

28 of 39

GPT-2 generates human-like output

Trained to predict next word (40GB of internet text)

While trained to predict the next word it surprisingly learned basic competence in tasks like MT and QA.

– That's without ever being told that it would be evaluated on those tasks.

29 of 39

GPT-2 in action

not

injure

human

being

30 of 39

GPT-2 generated synthetic text

System Prompt (human-written)

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

Model Completion (machine-written, 10 tries)

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns

31 of 39

Architecture of GPT

Text and position will be transformed to a vectors
Pass to multi-head self-attention
Combining result from step 1 and step 2 and performing a normalization
Pass to a fully-connected feed-forward network
Combining result from step 3 and 4 and performing a normalization
Finally, combing multi-head (total 12 self-attention block) to together for computing vectors.

32 of 39

Byte Pair Encoding (BPE)

Word embedding sometimes is too high level, pure character embedding too low level. For example, if we have learned

old older oldest

We might also wish the computer to infer

smart smarter smartest

But at the whole word level, this might not be so direct. Thus the idea is to break the words up into pieces like er, est, and embed frequent fragments of words.

GPT adapts this BPE scheme.

33 of 39

Byte Pair Encoding (BPE)

GPT uses BPE scheme. The subwords are calculated by:

Split word to sequence of characters (add </w> char)
Joining the highest frequency pattern.
Keep doing step 2, until it hits the pre-defined maximum number of sub-words or iterations.

Example (5, 2, 6, 3 are number of occurrences)

{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w e s t </w>’: 6, ‘w i d e s t </w>’: 3 }

{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w es t </w>’: 6, ‘w i d es t </w>’: 3 }

{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w est </w>’: 6, ‘w i d est </w>’: 3 } (est freq. 9)

{‘lo w </w>’: 5, ‘lo w e r </w>’: 2, ‘n e w est</w>’: 6, ‘w i d est</w>’: 3 } (lo freq 7)

…..

34 of 39

Masked Self-Attention (to compute more efficiently)

35 of 39

Masked Self-Attention

36 of 39

Masked Self-Attention Calculation

Re-use previous computation results: at any step, only need to results of q, k , v related to the new output word, no need to re-compute the others. Additional computation is linear, instead of quadratic.

1 of 39

2 of 39

3 of 39

4 of 39

5 of 39

6 of 39

7 of 39

8 of 39

9 of 39

10 of 39

11 of 39

12 of 39

13 of 39

14 of 39

15 of 39

16 of 39

17 of 39

18 of 39

19 of 39

20 of 39

21 of 39

22 of 39

23 of 39

24 of 39

25 of 39

26 of 39

27 of 39

28 of 39

29 of 39

30 of 39

31 of 39

32 of 39

33 of 39

34 of 39

35 of 39

36 of 39

37 of 39

38 of 39

39 of 39