1 of 45

BERT for Computational Social Science�Fine-Tuning and Classification

1

Maria Antoniak

David Mimno, Melanie Walsh

Slides Link: https://bit.ly/bert-for-css-slides

Notebook Link: https://bit.ly/bert-for-css-notebook

2 of 45

Our Goals Today

BERT and similar models have revolutionized natural language processing.

Will these tools be helpful for researchers in other fields?
If they are helpful, how can we support researchers in using and understanding these tools?
What can we learn about BERT-like models from social science research and researchers?

2

3 of 45

What is BERT?

3

4 of 45

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers.

BERT is a “large language model” that is trained on a very large amount of text and can use words in context.

Transformers are a new class of NLP models.

Similar transformer models include GPT-2, GPT-3, RoBERTa, DistilBERT.

4

5 of 45

Word vectors: one point per vocabulary item

5

6 of 45

BERT: one point per instance of a word in context!

6

7 of 45

7

made rhyme an art

art thou not prone

when art was sacred

thou art to me a fly

imparting science, or celestial truth

theirs is a mixt religion

against nature can persuade

the nature of philosophie

what lessons science waits to teach

what he calls religion

8 of 45

Why are people excited about BERT?

One big pre-trained model that we all share, and then fine-tune on any smaller dataset.
Amazing performance on

Language Understanding
Sentiment Analysis
Natural Language Inference
Paraphrase Detection
Textual Entailment
… and more!

8

9 of 45

Why are people wary or critical of BERT?

BERT powers Google Search and many other real-world applications, and it’s exciting to see such fast research progress!

However, precisely because of these real-world applications, there are reasons to be wary.

They encode many biases that are difficult to measure or correct
They can be hard to interpret
They are very large and training them takes up a lot of resources
They often use poorly documented training data

Keep these critiques in mind when working with these models!

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell

9

10 of 45

How does BERT work?

10

11 of 45

Key Term Alert!

Perceptron

A basic classifier that takes an input vector and returns 0 or 1.

Layer

One step of a neural network, possibly many perceptrons in parallel.

Transformer

A class of NLP model that builds representations of words in parallel.

Attention

A way of automatically deciding which nearby words should influence a word's representation.

11

12 of 45

12

13 of 45

13

The art of science ...

Encoder

[CLS] The art of ...

Add position vectors

Convert tokens to vectors (from an embedding learned during pre-training)

Convert words to tokens

Our original input text

multiply by weights (attention)

add and normalize

layers of perceptrons (feed forward)

Add segment vectors

+

add and normalize

Lots more encoders

(this is where our word vectors came from!)

MARIA

Ok so this is a type of diagram that you will see often

Let’s step through this diagram at a high level. Our goal right now isn’t to understand the inner workings of every piece, but to get a sense of what’s inside BERT so that you can learn more later.

We’re going to follow the process that happens when we send some words or text to a pre-trained model, just as we saw in the demo.

First, at the bottom of the slide, we have our input text. This text is then tokenized, or broken into word pieces and special tokens are added, and those tokens are then converted to vectors from an embedding learned during BERT pre-training. We then add those vectors to vectors representing the segment of text (this might be a sentence) and then add those vectors to a set of position vectors, that encodes information about the position of each word in the sequence.

Then we enter the first of many encoders. These encoders are an important part of all transformers, thought you’ll sometimes see them matched with decoders as well. These encoders are big, with lots and lots of numbers inside, learned during pre-training. Because there are so many numbers and those numbers are getting combined with each other, this makes the model difficult to engineer and conceptualize. But the actual operations are all very simple!

Inside the encoder, first we enter an “attention” block. There are actually several streams of activity happening inside, but essentially, we’re just multiplying vectors by weights. Remember from the last slide, attention is just telling us which words are important to representing the current word.

Then we add and normalize our previous and current vectors and enter a feed forward block. It’s called “feed forward” because we’re just passing the values forward through a giant set of perceptrons. There are a lot of perceptrons! But each piece, each perceptron, is very simple and make a simple classification decision. At the end of this block, we again add and normalize.

We then repeat this process for a series of encoder blocks.

This last yellow block is where we got the vectors we showed you in the last demo.

14 of 45

14

Encoder

Lots more encoders

Add position vectors

Convert tokens to vectors (from an embedding learned during pre-training)

Convert words to tokens

Our original input text

convert to number of classes (linear layer)

convert to probability distribution (softmax)

Get ready for classification

Add segment vectors

+

Choose the highest probability class as the output

(this is where our word vectors came from!)

The art of science ...

[CLS] The art of ...

multiply by weights (attention)

add and normalize

layers of perceptrons (feed forward)

add and normalize

15 of 45

Lifecycle of a BERT model

A BERT model consists of 400M of numbers.

Where do they come from, and when do they change?

15

Pre-training

Gradually improve parameters by looping over large datasets

Encode new docs

Take new inputs, produce outputs, but make no change to model parameters

Fine-tuning

Gradually improve parameters for a new purpose by looping over smaller datasets

Production use

Take new inputs, produce outputs, make no change to parameters

poems example

-Wikipedia

-Self-published novels

genre classification example

16 of 45

Uses of BERT with and without fine-tuning

Mapping words and sentences to vectors

Measuring word similarity

Add task-specific heads that produce specific output patterns

Classifying texts by genre
Named entity recognition

16

BERT

model

Task-specific heads

17 of 45

How do I actually use BERT?

17

18 of 45

🤗 HuggingFace: Python Library for Transfomers

Lots of datasets and models including BERT and GPT-2
Lots of documentation, and a very active community—if you have a question, someone will help you (or has already asked the question in a public forum)!
Documentation assumes knowledge of deep learning, machine learning libraries like PyTorch, and software engineering.
Documentation assumes use-cases are traditional NLP “tasks” like sentiment prediction.

18

19 of 45

Selecting a Pre-Trained Model

There are many other models available via HuggingFace, including languages beyond English.

It can be hard the best/right pre-trained model for your particular data and goal.

Browse through the collection on HuggingFace, look at examples, and check papers.

19

20 of 45

Can I work with BERT on my laptop?

20

Yes!*

*often with some modifications and caveats

21 of 45

GPUs, and Why They’re Important

A GPU, or “graphics processing unit,” is a specialized electronic circuit for computer hardware.
Because GPUs allow for parallel processing, they are often used for intensive graphics rendering tasks, like gaming, video editing, and machine learning.
Working with the full-scale version of BERT typically requires access to a GPU.

21

22 of 45

Google Colab Notebooks

Google Colab notebooks are interactive documents for programming code, very similar to Jupyter notebooks but hosted on the web.

Colab offers free (and paid) access to GPUs!
To use GPUs with Colab, you need to specify .to(‘cuda’) when you load the pre-trained BERT model.
CUDA is a parallel computing platform created by Nvidia.

22

23 of 45

Previewing a few BERT & HuggingFace Basics

23

24 of 45

To use BERT, you need to format your text in a way that BERT can understand

Each input sequence must contain 512 tokens:

Shorter sequences must be padded (add special tokens)
Longer sequences must be truncated (drop later words or divide long documents)
BERT special tokens
Divide the words into word pieces

24

25 of 45

BERT Special Tokens

25

BERT Token	Meaning
[CLS]	The start token of every document.
[SEP]	Placed between each sentence.
[PAD]	Added to the end of the document as many times as necessary, up to 512 tokens, so that every document is the same length.
##token	Indicates the start of a "word piece."

26 of 45

BERT Tokenization: Example

“Extempore Effusion upon the Death of James Hogg”

By William Wordsworth

When first, descending from the moorlands,

I saw the Stream of Yarrow glide

Along a bare and open valley,

The Ettrick Shepherd was my guide.

And Ettrick mourns with her their Poet dead

When last along its banks I wandered,

Through groves that had begun to shed

26

27 of 45

BERT Tokenization: Example

[CLS] when first , descending from the moor ##lands , i saw the stream of ya ##rrow glide along a bare and open valley , the et ##trick shepherd was my guide . when last along its banks i wandered , through groves that had begun to shed their golden leaves upon the pathways , my steps the border - min ##strel led . the mighty min ##strel breathe ##s no longer , \' mid mo ##uld ##ering ruins low he lies ; and death upon the bra ##es of ya ##rrow , has closed the shepherd - poet \' s eyes : nor has the rolling year twice measured , from sign to sign , its ste ##df ##ast course , since every mortal power of cole ##ridge was frozen at its marvel ##lous source ; the rap ##t one , of the god ##like forehead , the heaven - eyed creature sleeps in earth : and lamb , the fr ##olic and the gentle , has vanished from his lonely hearth . like clouds that rake the mountain - summit ##s , or waves that own no curb ##ing hand , how fast has brother followed brother , from sunshine to the sun ##less land ! yet i , whose lids from infant sl ##umber were earlier raised , remain to hear a tim ##id voice , that asks in whispers , " who next will drop and disappear ? " our ha ##ught ##y life is crowned with darkness , like london with its own black wreath , on which with thee , o crab ##be ! forth - looking , i gazed from hampstead \' s bree ##zy heath . as if but yesterday departed , thou too art gone before ; but why , o \' er ripe fruit , season ##ably gathered , should frail survivors he ##ave a sigh ? mo ##urn rather for that holy spirit , sweet as the spring , as ocean deep ; for her who , er ##e her summer faded , has sunk into a breathless sleep . no more of old romantic sorrow ##s , for slaughtered youth or love - lo ##rn maid ! with sharpe ##r grief is ya ##rrow sm ##itte ##n , and et ##trick mo ##urn ##s with her their poet dead . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

27

28 of 45

Words, positions, and word IDs

28

Position	0	1	2	3	4	5
Word	[CLS]	the	art	of	science	[SEP]
Word ID	101	1996	2396	1997	2671	102
Vector

String	Word ID
[CLS]	101
[SEP]	102
the	1996
of	1997
art	2396
science	2671

SPECIFIC INPUT

VOCABULARY

29 of 45

Pre-training vs fine-tuning

29

30 of 45

Pre-training

Pre-train once, use for many different applications.
There are two pre-training tasks:

masked word prediction
next sentence prediction

Training data includes

BooksCorpus (800M words)
English Wikipedia (2,500M words)

[Zhu et al., 2015]

This is all done before YOU do anything!

30

31 of 45

Fine-tuning

This is the part where YOU use YOUR dataset!

You update BERT for your dataset and (optionally) your labeled classification task.

What happens during fine-tuning for classification?

Adds a new layer containing classification parameters.
All BERT parameters are updated to maximize the log probability of the labels.

31

32 of 45

Preparing your data for fine-tuning

Convert labels to integers (rather than strings).
Break data into training, test, evaluation sets.
Use a BERT tokenizer to convert data to BERT format (truncated, padded, special tokens, word pieces, replace words with embedding IDs, etc.).
Combine the labels and data into a Torch dataset object.

32

33 of 45

The BERT Fine-Tuning “Recipe” with HuggingFace

1. Format your data in a way that BERT can understand.

tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)

2. Load a pre-trained BERT model.

model = DistilBertForSequenceClassification.from_pretrained(model_name,

num_labels=len(id2tag)).to(device_name)

3. Fine-tune the model on your data.

trainer = Trainer(

model=model,

args=training_args,

train_dataset=train_dataset,

eval_dataset=test_dataset,

compute_metrics=compute_metrics

)

trainer.train()

4. Evaluate the fine-tuned performance using held-out data.

trainer.evaluate()

33

34 of 45

Into the weeds...

34

35 of 45

How does BERT learn all those numbers?

Like many other machine learning models, BERT “learns” by using an algorithm called gradient descent.

Most of the training arguments we send to BERT during fine-tuning will refer to some part of gradient descent.

To understand those numbers, you need to have an intuition for how gradient descent works.

35

36 of 45

Gradient Descent Intuition

We have a cost function (e.g. how many mistakes we make) that tells us how badly our current model works.
We want to find a set of parameters that minimizes the cost function.
We start with a random set of parameters and we update the parameters one learning step at a time in the direction that results in smaller cost.
We can vary the size of the steps (or the rate at which we take steps).

36

37 of 45

Gradient Descent

37

Cost

Parameters

Goal (Minimum)

Random Start

Learning Steps

“how many mistakes the model makes”

“importance of each feature for prediction”

Here’s a visual representation of what happens during gradient descent.

Along the x-axis are our model parameters - all the numbers that BERT is going to learn and that will be contained in our pre-trained or fine-tuned model.

Along the y-axis is the cost, or how many mistakes our model makes when it has a certain set of parameters. We want the cost to be LOW.

We begin by setting the parameters all at random. Then using those parameters, we use calculus to figure out which was is “downhill” and we take a step in that direction. We repeat this process until we, hopefully, get to the bottom or minimum.

But how big should our steps be? Example: Looking for a post office, big step (get to city center) at first

In fact, BERT uses the Adam optimizer to set the learning rate for different parameters. (apply bigger updates to parameters that have been moving slowly)

38 of 45

Heads up! Expect to be confused!

38

39 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

39

There are many more arguments you can pass to the Trainer. Here we’re highlighting some basics.

output_dir -- where your fine-tuned model and configurations will be saved (remember to download this before you Colab session ends!)

num_train_epochs

per_device_train_batch_size -- The batch size per GPU/TPU core/CPU for training.

per_device_eval_batch_size -- The batch size per GPU/TPU core/CPU for evaluation.

learning_rate -- The initial learning rate for AdamW optimizer. (determines how much an updating step influences the current value of the weights)

warmup_steps -- Number of steps used for a linear warmup from 0 to learning_rate

weight_decay -- The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. After each update, you multiply all the weights by this hyperparameter. Kind of like regularization, it prevents the weights from getting too big. Regularization ensures that our model doesn’t get too complicated and overfit.

logging_dir -- where to save logging files

logging_steps -- how often to display logging output

evaluation_strategy -- The evaluation strategy to adopt during training. Possible values are:

"no": No evaluation is done during training.
"steps": Evaluation is done (and logged) every eval_steps.
"epoch": Evaluation is done at the end of each epoch.

40 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

40

num_train_epochs

how many times to pass over the full dataset

per_device_train_batch_size

how many examples to process at a time

per_device_eval_batch_size

how many examples to process at a time

{

There are many more arguments you can pass to the Trainer. Here we’re highlighting some basics.

output_dir -- where your fine-tuned model and configurations will be saved (remember to download this before you Colab session ends!)

num_train_epochs

per_device_train_batch_size -- The batch size per GPU/TPU core/CPU for training.

per_device_eval_batch_size -- The batch size per GPU/TPU core/CPU for evaluation.

learning_rate -- The initial learning rate for AdamW optimizer. (determines how much an updating step influences the current value of the weights)

warmup_steps -- Number of steps used for a linear warmup from 0 to learning_rate

weight_decay -- The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. After each update, you multiply all the weights by this hyperparameter. Kind of like regularization, it prevents the weights from getting too big. Regularization ensures that our model doesn’t get too complicated and overfit.

logging_dir -- where to save logging files

logging_steps -- how often to display logging output

evaluation_strategy -- The evaluation strategy to adopt during training. Possible values are:

"no": No evaluation is done during training.
"steps": Evaluation is done (and logged) every eval_steps.
"epoch": Evaluation is done at the end of each epoch.

41 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

41

learning_rate

the size of the learning steps (or how quickly to take them)

warmup_steps

number of steps used for a linear warmup from 0 to learning_rate

weight_decay

multiply all weights by this term to prevent weights from getting too big

{

There are many more arguments you can pass to the Trainer. Here we’re highlighting some basics.

output_dir -- where your fine-tuned model and configurations will be saved (remember to download this before you Colab session ends!)

num_train_epochs

per_device_train_batch_size -- The batch size per GPU/TPU core/CPU for training.

per_device_eval_batch_size -- The batch size per GPU/TPU core/CPU for evaluation.

learning_rate -- The initial learning rate for AdamW optimizer. (determines how much an updating step influences the current value of the weights)

warmup_steps -- Number of steps used for a linear warmup from 0 to learning_rate

weight_decay -- The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. After each update, you multiply all the weights by this hyperparameter. Kind of like regularization, it prevents the weights from getting too big. Regularization ensures that our model doesn’t get too complicated and overfit.

logging_dir -- where to save logging files

logging_steps -- how often to display logging output

evaluation_strategy -- The evaluation strategy to adopt during training. Possible values are:

"no": No evaluation is done during training.
"steps": Evaluation is done (and logged) every eval_steps.
"epoch": Evaluation is done at the end of each epoch.

42 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

42

output_dir

where to save model and config files

logging_dir

where to save logging files

logging_steps

how often to show training progress metrics

evaluation_strategy

run evaluation during training

{

There are many more arguments you can pass to the Trainer. Here we’re highlighting some basics.

output_dir -- where your fine-tuned model and configurations will be saved (remember to download this before you Colab session ends!)

num_train_epochs

per_device_train_batch_size -- The batch size per GPU/TPU core/CPU for training.

per_device_eval_batch_size -- The batch size per GPU/TPU core/CPU for evaluation.

learning_rate -- The initial learning rate for AdamW optimizer. (determines how much an updating step influences the current value of the weights)

warmup_steps -- Number of steps used for a linear warmup from 0 to learning_rate

weight_decay -- The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. After each update, you multiply all the weights by this hyperparameter. Kind of like regularization, it prevents the weights from getting too big. Regularization ensures that our model doesn’t get too complicated and overfit.

logging_dir -- where to save logging files

logging_steps -- how often to display logging output

evaluation_strategy -- The evaluation strategy to adopt during training. Possible values are:

"no": No evaluation is done during training.
"steps": Evaluation is done (and logged) every eval_steps.
"epoch": Evaluation is done at the end of each epoch.

43 of 45

What happens after fine-tuning?

We can use our fine-tuned model for classification. The model will return labels with the highest probability for each document.
Now we have a list of predicted labels and our original list of true labels. We can now compare these in all kinds of ways, just as we would with the output of any other classification model.

43

44 of 45

Essential Resources

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
A Visual Guide to Using BERT for the First Time by Jay Alammar
Illustrated BERT by Jay Alammar
Fine-tuning with custom datasets by HuggingFace

44

45 of 45

Let’s code together!�

https://bit.ly/bert-for-css-notebook

45