BERT and GPT-2
Improvements of Transformer
Bidirectional Encoder Representations from Transformers (BERT)
Google’s BERT
(Bidirectional Encoder Representations from Transformers)
BERT
Encoder
BERT
Let's
improvise
the
skit
……
Learned from a large amount of text without annotation
……
Structure of BERT
The BERT Encoder block implements the base version of the BERT network
It is composed of 12 successive transformer layers, each having 12 attention heads
The total number of parameters is 110 million.
https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/
blocks/bert-encoder?fbclid=IwAR2rt4ObbHCMvuGsfrNNOnotfW_Dfc-5y3QorqW1nQUbw4XzuwjjCbnQDy0
BERT Architecture
Model input dimension 512
Input and output vector size
Text Pre-processing
Position Embeddings:
BERT learns and uses positional embeddings to express the position of words
in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information
Segment Embeddings:
BERT can also take sentence pairs as inputs for tasks (Question-Answering).
That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
Token Embeddings:
These are the embeddings learned for the specific token from the WordPiece token vocabulary
BERT pretraining
Pre-training BERT
BERT Pretraining Task 1: masked words
Masked LM
BERT
Let's
退了
the
skit
……
……
[MASK]
Linear Multi-class
Classifier
Predicting the masked word
vocabulary size
BERT Pretraining Task 1: masked words
Out of this 15%,
80% are [Mask],
10% random words
10% original words
Before feeding word sequences 15% of words in each sequence replaced with a [MASK] token
Model attempts to predict original value of masked words, based on context provided by non-masked words in sequence
Prediction of the output words requires
Word prediction using masking
BERT Pretraining Task 2: two sentences
be ‘IsNext’ and ‘NotNext’ for the second case
BERT
[CLS]
Wake
up
[SEP]
BERT Pretraining Task 2: Next Sentence Prediction
you
are
late
Linear Binary
Classifier
yes
[CLS]: the position that outputs classification results
[SEP]: the boundary of two sentences
Approaches 1 and 2 are used at the same time.
BERT
[CLS]
wake
up
[SEP]
sky
is
blue
Linear Binary
Classifier
No
[CLS]: the position that outputs classification results
[SEP]: the boundary of two sentences
Approaches 1 and 2 are used at the same time.
BERT Pretraining Task 2: Next Sentence Prediction
Fine-tuning BERT for other specific tasks
SST (Stanford sentiment treebank): 215k phrases with fine-grained sentiment labels in the parse trees of 11k sentences.
MNLI
QQP (Quaro Question Pairs)
Semantic equivalence)
QNLI (NL inference dataset)
STS-B (texture similarity)
MRPC (paraphrase, Microsoft)
RTE (textual entailment)
SWAG (commonsense inference)
SST-2 (sentiment)
CoLA (linguistic acceptability
SQuAD (question and answer)
How to use BERT – Case 1
BERT
[CLS]
w1
w2
w3
Linear Classifier
class
Input: single sentence,
output: class
sentence
Example:
Sentiment analysis (our HW),
Document Classification
Trained from Scratch
Fine-tune
How to use BERT – Case 2
BERT
[CLS]
w1
w2
w3
Linear Cls
class
Input: single sentence,
output: class of each word
sentence
Example: Slot filling
Linear Cls
class
Linear Cls
class
How to use BERT – Case 3
Linear Classifier
w1
w2
BERT
[CLS]
[SEP]
Class
Sentence 1
Sentence 2
w3
w4
w5
Input: two sentences, output: class
Example: Natural Language Inference
Given a “premise”, determining whether a “hypothesis” is T/F/ unknown.
How to use BERT – Case 4
QA
Model
Document:
Query:
Answer:
17
77
79
How to use BERT – Case 4
q1
q2
BERT
[CLS]
[SEP]
question
document
d1
d2
d3
dot product
Softmax
0.5
0.3
0.2
The answer is “d2 d3”.
s = 2, e = 3
Learned from scratch
How to use BERT – Case 4
q1
q2
BERT
[CLS]
[SEP]
question
document
d1
d2
d3
dot product
Softmax
0.2
0.1
0.7
The answer is “d2 d3”.
s = 2, e = 3
Learned from scratch
Enhanced Representation through Knowledge Integration (ERNIE)
https://arxiv.org/abs/1904.09223
BERT
ERNIE
Source of image:
https://zhuanlan.zhihu.com/p/59436589
What does BERT learn?
https://arxiv.org/abs/1905.05950
https://openreview.net/pdf?id=SJzSgnRcKX
Multilingual BERT
https://arxiv.org/abs/1904.09077
Trained on 104 languages
Task specific training data for English
En
Class 1
En
Class 2
En
Class 3
Task specific testing data for Chinese
Zh
?
Zh
?
Zh
?
Zh
?
Zh
?
GPT-2 generates human-like output
– That's without ever being told that it would be evaluated on those tasks.
GPT-2 in action
not
injure
injure
a
a
human
human
being
being
GPT-2 generated synthetic text
Architecture of GPT
Byte Pair Encoding (BPE)
Word embedding sometimes is too high level, pure character embedding too low level. For example, if we have learned
old older oldest
We might also wish the computer to infer
smart smarter smartest
But at the whole word level, this might not be so direct. Thus the idea is to break the words up into pieces like er, est, and embed frequent fragments of words.
GPT adapts this BPE scheme.
Byte Pair Encoding (BPE)
GPT uses BPE scheme. The subwords are calculated by:
Example (5, 2, 6, 3 are number of occurrences)
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w e s t </w>’: 6, ‘w i d e s t </w>’: 3 }
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w es t </w>’: 6, ‘w i d es t </w>’: 3 }
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w est </w>’: 6, ‘w i d est </w>’: 3 } (est freq. 9)
{‘lo w </w>’: 5, ‘lo w e r </w>’: 2, ‘n e w est</w>’: 6, ‘w i d est</w>’: 3 } (lo freq 7)
…..
Masked Self-Attention (to compute more efficiently)
Masked Self-Attention
Masked Self-Attention Calculation
Re-use previous computation results: at any step, only need to results of q, k , v related to the new output word, no need to re-compute the others. Additional computation is linear, instead of quadratic.
Input Formatting for GPT
GPT-2 Application: Translation
GPT-2 Application: Summarization