BERT for Computational Social Science�Fine-Tuning and Classification
1
Maria Antoniak
David Mimno, Melanie Walsh
Slides Link: https://bit.ly/bert-for-css-slides
Notebook Link: https://bit.ly/bert-for-css-notebook
Our Goals Today
BERT and similar models have revolutionized natural language processing.
2
What is BERT?
3
What is BERT?
BERT stands for Bidirectional Encoder Representations from Transformers.
BERT is a “large language model” that is trained on a very large amount of text and can use words in context.
Transformers are a new class of NLP models.
Similar transformer models include GPT-2, GPT-3, RoBERTa, DistilBERT.
4
Word vectors: one point per vocabulary item
5
BERT: one point per instance of a word in context!
6
7
made rhyme an art
art thou not prone
when art was sacred
thou art to me a fly
imparting science, or celestial truth
theirs is a mixt religion
against nature can persuade
the nature of philosophie
what lessons science waits to teach
what he calls religion
Why are people excited about BERT?
8
Why are people wary or critical of BERT?
BERT powers Google Search and many other real-world applications, and it’s exciting to see such fast research progress!
However, precisely because of these real-world applications, there are reasons to be wary.
Keep these critiques in mind when working with these models!
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
9
How does BERT work?
10
Key Term Alert!
Perceptron
A basic classifier that takes an input vector and returns 0 or 1.
Layer
One step of a neural network, possibly many perceptrons in parallel.
Transformer
A class of NLP model that builds representations of words in parallel.
Attention
A way of automatically deciding which nearby words should influence a word's representation.
11
12
13
The art of science ...
Encoder
[CLS] The art of ...
Add position vectors
Convert tokens to vectors (from an embedding learned during pre-training)
Convert words to tokens
Our original input text
multiply by weights (attention)
add and normalize
layers of perceptrons (feed forward)
Add segment vectors
+
+
+
+
+
+
+
+
add and normalize
Lots more encoders
(this is where our word vectors came from!)
14
Encoder
Lots more encoders
Add position vectors
Convert tokens to vectors (from an embedding learned during pre-training)
Convert words to tokens
Our original input text
convert to number of classes (linear layer)
convert to probability distribution (softmax)
Get ready for classification
Add segment vectors
+
+
+
+
+
+
+
+
Choose the highest probability class as the output
(this is where our word vectors came from!)
The art of science ...
[CLS] The art of ...
multiply by weights (attention)
add and normalize
layers of perceptrons (feed forward)
add and normalize
Lifecycle of a BERT model
A BERT model consists of 400M of numbers.
Where do they come from, and when do they change?
15
Pre-training
Gradually improve parameters by looping over large datasets
Encode new docs
Take new inputs, produce outputs, but make no change to model parameters
Fine-tuning
Gradually improve parameters for a new purpose by looping over smaller datasets
Production use
Take new inputs, produce outputs, make no change to parameters
poems example
-Wikipedia
-Self-published novels
genre classification example
Uses of BERT with and without fine-tuning
Mapping words and sentences to vectors
Add task-specific heads that produce specific output patterns
16
BERT
model
Task-specific heads
How do I actually use BERT?
17
🤗 HuggingFace: Python Library for Transfomers
18
Selecting a Pre-Trained Model
There are many other models available via HuggingFace, including languages beyond English.
It can be hard the best/right pre-trained model for your particular data and goal.
Browse through the collection on HuggingFace, look at examples, and check papers.
19
Can I work with BERT on my laptop?
20
Yes!*
*often with some modifications and caveats
GPUs, and Why They’re Important
21
Google Colab Notebooks
22
Previewing a few BERT & HuggingFace Basics
23
To use BERT, you need to format your text in a way that BERT can understand
Each input sequence must contain 512 tokens:
24
BERT Special Tokens
25
BERT Token | Meaning |
[CLS] | The start token of every document. |
[SEP] | Placed between each sentence. |
[PAD] | Added to the end of the document as many times as necessary, up to 512 tokens, so that every document is the same length. |
##token | Indicates the start of a "word piece." |
BERT Tokenization: Example
“Extempore Effusion upon the Death of James Hogg”
By William Wordsworth
When first, descending from the moorlands,
I saw the Stream of Yarrow glide
Along a bare and open valley,
The Ettrick Shepherd was my guide.
And Ettrick mourns with her their Poet dead
When last along its banks I wandered,
Through groves that had begun to shed
26
BERT Tokenization: Example
[CLS] when first , descending from the moor ##lands , i saw the stream of ya ##rrow glide along a bare and open valley , the et ##trick shepherd was my guide . when last along its banks i wandered , through groves that had begun to shed their golden leaves upon the pathways , my steps the border - min ##strel led . the mighty min ##strel breathe ##s no longer , \' mid mo ##uld ##ering ruins low he lies ; and death upon the bra ##es of ya ##rrow , has closed the shepherd - poet \' s eyes : nor has the rolling year twice measured , from sign to sign , its ste ##df ##ast course , since every mortal power of cole ##ridge was frozen at its marvel ##lous source ; the rap ##t one , of the god ##like forehead , the heaven - eyed creature sleeps in earth : and lamb , the fr ##olic and the gentle , has vanished from his lonely hearth . like clouds that rake the mountain - summit ##s , or waves that own no curb ##ing hand , how fast has brother followed brother , from sunshine to the sun ##less land ! yet i , whose lids from infant sl ##umber were earlier raised , remain to hear a tim ##id voice , that asks in whispers , " who next will drop and disappear ? " our ha ##ught ##y life is crowned with darkness , like london with its own black wreath , on which with thee , o crab ##be ! forth - looking , i gazed from hampstead \' s bree ##zy heath . as if but yesterday departed , thou too art gone before ; but why , o \' er ripe fruit , season ##ably gathered , should frail survivors he ##ave a sigh ? mo ##urn rather for that holy spirit , sweet as the spring , as ocean deep ; for her who , er ##e her summer faded , has sunk into a breathless sleep . no more of old romantic sorrow ##s , for slaughtered youth or love - lo ##rn maid ! with sharpe ##r grief is ya ##rrow sm ##itte ##n , and et ##trick mo ##urn ##s with her their poet dead . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
27
Words, positions, and word IDs
28
Position | 0 | 1 | 2 | 3 | 4 | 5 |
Word | [CLS] | the | art | of | science | [SEP] |
Word ID | 101 | 1996 | 2396 | 1997 | 2671 | 102 |
Vector | | | | | | |
String | Word ID |
[CLS] | 101 |
[SEP] | 102 |
the | 1996 |
of | 1997 |
art | 2396 |
science | 2671 |
SPECIFIC INPUT
VOCABULARY
Pre-training vs fine-tuning
29
Pre-training
[Zhu et al., 2015]
30
Fine-tuning
This is the part where YOU use YOUR dataset!
You update BERT for your dataset and (optionally) your labeled classification task.
What happens during fine-tuning for classification?
31
Preparing your data for fine-tuning
32
The BERT Fine-Tuning “Recipe” with HuggingFace
1. Format your data in a way that BERT can understand.
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
2. Load a pre-trained BERT model.
model = DistilBertForSequenceClassification.from_pretrained(model_name,
num_labels=len(id2tag)).to(device_name)
3. Fine-tune the model on your data.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
compute_metrics=compute_metrics
)
trainer.train()
4. Evaluate the fine-tuned performance using held-out data.
trainer.evaluate()
33
Into the weeds...
34
How does BERT learn all those numbers?
Like many other machine learning models, BERT “learns” by using an algorithm called gradient descent.
Most of the training arguments we send to BERT during fine-tuning will refer to some part of gradient descent.
To understand those numbers, you need to have an intuition for how gradient descent works.
35
Gradient Descent Intuition
36
Gradient Descent
37
Cost
Parameters
Goal (Minimum)
Random Start
Learning Steps
“how many mistakes the model makes”
“importance of each feature for prediction”
Heads up! Expect to be confused!
38
Fine-tuning Hyperparameters (HuggingFace)
training_args = TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=20,
learning_rate=5e-5,
warmup_steps=10,
weight_decay=0.01,
output_dir='./results',
logging_dir='./logs',
logging_steps=10,
evaluation_strategy='steps'
)
39
Fine-tuning Hyperparameters (HuggingFace)
training_args = TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=20,
learning_rate=5e-5,
warmup_steps=10,
weight_decay=0.01,
output_dir='./results',
logging_dir='./logs',
logging_steps=10,
evaluation_strategy='steps'
)
40
num_train_epochs
how many times to pass over the full dataset
per_device_train_batch_size
how many examples to process at a time
per_device_eval_batch_size
how many examples to process at a time
{
Fine-tuning Hyperparameters (HuggingFace)
training_args = TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=20,
learning_rate=5e-5,
warmup_steps=10,
weight_decay=0.01,
output_dir='./results',
logging_dir='./logs',
logging_steps=10,
evaluation_strategy='steps'
)
41
learning_rate
the size of the learning steps (or how quickly to take them)
warmup_steps
number of steps used for a linear warmup from 0 to learning_rate
weight_decay
multiply all weights by this term to prevent weights from getting too big
{
Fine-tuning Hyperparameters (HuggingFace)
training_args = TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=20,
learning_rate=5e-5,
warmup_steps=10,
weight_decay=0.01,
output_dir='./results',
logging_dir='./logs',
logging_steps=10,
evaluation_strategy='steps'
)
42
output_dir
where to save model and config files
logging_dir
where to save logging files
logging_steps
how often to show training progress metrics
evaluation_strategy
run evaluation during training
{
What happens after fine-tuning?
43
Essential Resources
44
Let’s code together!�
45