1 of 45

BERT for Computational Social Science�Fine-Tuning and Classification

1

Maria Antoniak

David Mimno, Melanie Walsh

Slides Link: https://bit.ly/bert-for-css-slides

Notebook Link: https://bit.ly/bert-for-css-notebook

2 of 45

Our Goals Today

BERT and similar models have revolutionized natural language processing.

  • Will these tools be helpful for researchers in other fields?
  • If they are helpful, how can we support researchers in using and understanding these tools?
  • What can we learn about BERT-like models from social science research and researchers?

2

3 of 45

What is BERT?

3

4 of 45

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers.

BERT is a “large language model” that is trained on a very large amount of text and can use words in context.

Transformers are a new class of NLP models.

Similar transformer models include GPT-2, GPT-3, RoBERTa, DistilBERT.

4

5 of 45

Word vectors: one point per vocabulary item

5

6 of 45

BERT: one point per instance of a word in context!

6

7 of 45

7

made rhyme an art

art thou not prone

when art was sacred

thou art to me a fly

imparting science, or celestial truth

theirs is a mixt religion

against nature can persuade

the nature of philosophie

what lessons science waits to teach

what he calls religion

8 of 45

Why are people excited about BERT?

  • One big pre-trained model that we all share, and then fine-tune on any smaller dataset.
  • Amazing performance on
    • Language Understanding
    • Sentiment Analysis
    • Natural Language Inference
    • Paraphrase Detection
    • Textual Entailment
    • … and more!

8

9 of 45

Why are people wary or critical of BERT?

BERT powers Google Search and many other real-world applications, and it’s exciting to see such fast research progress!

However, precisely because of these real-world applications, there are reasons to be wary.

    • They encode many biases that are difficult to measure or correct
    • They can be hard to interpret
    • They are very large and training them takes up a lot of resources
    • They often use poorly documented training data

Keep these critiques in mind when working with these models!

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell

9

10 of 45

How does BERT work?

10

11 of 45

Key Term Alert!

Perceptron

A basic classifier that takes an input vector and returns 0 or 1.

Layer

One step of a neural network, possibly many perceptrons in parallel.

Transformer

A class of NLP model that builds representations of words in parallel.

Attention

A way of automatically deciding which nearby words should influence a word's representation.

11

12 of 45

12

13 of 45

13

The art of science ...

Encoder

[CLS] The art of ...

Add position vectors

Convert tokens to vectors (from an embedding learned during pre-training)

Convert words to tokens

Our original input text

multiply by weights (attention)

add and normalize

layers of perceptrons (feed forward)

Add segment vectors

+

+

+

+

+

+

+

+

add and normalize

Lots more encoders

(this is where our word vectors came from!)

14 of 45

14

Encoder

Lots more encoders

Add position vectors

Convert tokens to vectors (from an embedding learned during pre-training)

Convert words to tokens

Our original input text

convert to number of classes (linear layer)

convert to probability distribution (softmax)

Get ready for classification

Add segment vectors

+

+

+

+

+

+

+

+

Choose the highest probability class as the output

(this is where our word vectors came from!)

The art of science ...

[CLS] The art of ...

multiply by weights (attention)

add and normalize

layers of perceptrons (feed forward)

add and normalize

15 of 45

Lifecycle of a BERT model

A BERT model consists of 400M of numbers.

Where do they come from, and when do they change?

15

Pre-training

Gradually improve parameters by looping over large datasets

Encode new docs

Take new inputs, produce outputs, but make no change to model parameters

Fine-tuning

Gradually improve parameters for a new purpose by looping over smaller datasets

Production use

Take new inputs, produce outputs, make no change to parameters

poems example

-Wikipedia

-Self-published novels

genre classification example

16 of 45

Uses of BERT with and without fine-tuning

Mapping words and sentences to vectors

    • Measuring word similarity

Add task-specific heads that produce specific output patterns

    • Classifying texts by genre
    • Named entity recognition

16

BERT

model

Task-specific heads

17 of 45

How do I actually use BERT?

17

18 of 45

🤗 HuggingFace: Python Library for Transfomers

  • Lots of datasets and models including BERT and GPT-2
  • Lots of documentation, and a very active community—if you have a question, someone will help you (or has already asked the question in a public forum)!
  • Documentation assumes knowledge of deep learning, machine learning libraries like PyTorch, and software engineering.
  • Documentation assumes use-cases are traditional NLP “tasks” like sentiment prediction.

18

19 of 45

Selecting a Pre-Trained Model

There are many other models available via HuggingFace, including languages beyond English.

It can be hard the best/right pre-trained model for your particular data and goal.

Browse through the collection on HuggingFace, look at examples, and check papers.

19

20 of 45

Can I work with BERT on my laptop?

20

Yes!*

*often with some modifications and caveats

21 of 45

GPUs, and Why They’re Important

  • A GPU, or “graphics processing unit,” is a specialized electronic circuit for computer hardware.
  • Because GPUs allow for parallel processing, they are often used for intensive graphics rendering tasks, like gaming, video editing, and machine learning.
  • Working with the full-scale version of BERT typically requires access to a GPU.

21

22 of 45

Google Colab Notebooks

  • Google Colab notebooks are interactive documents for programming code, very similar to Jupyter notebooks but hosted on the web.

  • Colab offers free (and paid) access to GPUs!
  • To use GPUs with Colab, you need to specify .to(‘cuda’) when you load the pre-trained BERT model.
  • CUDA is a parallel computing platform created by Nvidia.

22

23 of 45

Previewing a few BERT & HuggingFace Basics

23

24 of 45

To use BERT, you need to format your text in a way that BERT can understand

Each input sequence must contain 512 tokens:

  • Shorter sequences must be padded (add special tokens)
  • Longer sequences must be truncated (drop later words or divide long documents)
  • BERT special tokens
  • Divide the words into word pieces

24

25 of 45

BERT Special Tokens

25

BERT Token

Meaning

[CLS]

The start token of every document.

[SEP]

Placed between each sentence.

[PAD]

Added to the end of the document as many times as necessary, up to 512 tokens, so that every document is the same length.

##token

Indicates the start of a "word piece."

26 of 45

BERT Tokenization: Example

“Extempore Effusion upon the Death of James Hogg”

By William Wordsworth

When first, descending from the moorlands,

I saw the Stream of Yarrow glide

Along a bare and open valley,

The Ettrick Shepherd was my guide.

And Ettrick mourns with her their Poet dead

When last along its banks I wandered,

Through groves that had begun to shed

26

27 of 45

BERT Tokenization: Example

[CLS] when first , descending from the moor ##lands , i saw the stream of ya ##rrow glide along a bare and open valley , the et ##trick shepherd was my guide . when last along its banks i wandered , through groves that had begun to shed their golden leaves upon the pathways , my steps the border - min ##strel led . the mighty min ##strel breathe ##s no longer , \' mid mo ##uld ##ering ruins low he lies ; and death upon the bra ##es of ya ##rrow , has closed the shepherd - poet \' s eyes : nor has the rolling year twice measured , from sign to sign , its ste ##df ##ast course , since every mortal power of cole ##ridge was frozen at its marvel ##lous source ; the rap ##t one , of the god ##like forehead , the heaven - eyed creature sleeps in earth : and lamb , the fr ##olic and the gentle , has vanished from his lonely hearth . like clouds that rake the mountain - summit ##s , or waves that own no curb ##ing hand , how fast has brother followed brother , from sunshine to the sun ##less land ! yet i , whose lids from infant sl ##umber were earlier raised , remain to hear a tim ##id voice , that asks in whispers , " who next will drop and disappear ? " our ha ##ught ##y life is crowned with darkness , like london with its own black wreath , on which with thee , o crab ##be ! forth - looking , i gazed from hampstead \' s bree ##zy heath . as if but yesterday departed , thou too art gone before ; but why , o \' er ripe fruit , season ##ably gathered , should frail survivors he ##ave a sigh ? mo ##urn rather for that holy spirit , sweet as the spring , as ocean deep ; for her who , er ##e her summer faded , has sunk into a breathless sleep . no more of old romantic sorrow ##s , for slaughtered youth or love - lo ##rn maid ! with sharpe ##r grief is ya ##rrow sm ##itte ##n , and et ##trick mo ##urn ##s with her their poet dead . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

27

28 of 45

Words, positions, and word IDs

28

Position

0

1

2

3

4

5

Word

[CLS]

the

art

of

science

[SEP]

Word ID

101

1996

2396

1997

2671

102

Vector

String

Word ID

[CLS]

101

[SEP]

102

the

1996

of

1997

art

2396

science

2671

SPECIFIC INPUT

VOCABULARY

29 of 45

Pre-training vs fine-tuning

29

30 of 45

Pre-training

  • Pre-train once, use for many different applications.
  • There are two pre-training tasks:
    • masked word prediction
    • next sentence prediction
  • Training data includes
    • BooksCorpus (800M words)
    • English Wikipedia (2,500M words)

[Zhu et al., 2015]

  • This is all done before YOU do anything!

30

31 of 45

Fine-tuning

This is the part where YOU use YOUR dataset!

You update BERT for your dataset and (optionally) your labeled classification task.

What happens during fine-tuning for classification?

  • Adds a new layer containing classification parameters.
  • All BERT parameters are updated to maximize the log probability of the labels.

31

32 of 45

Preparing your data for fine-tuning

  1. Convert labels to integers (rather than strings).
  2. Break data into training, test, evaluation sets.
  3. Use a BERT tokenizer to convert data to BERT format (truncated, padded, special tokens, word pieces, replace words with embedding IDs, etc.).
  4. Combine the labels and data into a Torch dataset object.

32

33 of 45

The BERT Fine-Tuning “Recipe” with HuggingFace

1. Format your data in a way that BERT can understand.

tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)

2. Load a pre-trained BERT model.

model = DistilBertForSequenceClassification.from_pretrained(model_name,

num_labels=len(id2tag)).to(device_name)

3. Fine-tune the model on your data.

trainer = Trainer(

model=model,

args=training_args,

train_dataset=train_dataset,

eval_dataset=test_dataset,

compute_metrics=compute_metrics

)

trainer.train()

4. Evaluate the fine-tuned performance using held-out data.

trainer.evaluate()

33

34 of 45

Into the weeds...

34

35 of 45

How does BERT learn all those numbers?

Like many other machine learning models, BERT “learns” by using an algorithm called gradient descent.

Most of the training arguments we send to BERT during fine-tuning will refer to some part of gradient descent.

To understand those numbers, you need to have an intuition for how gradient descent works.

35

36 of 45

Gradient Descent Intuition

  • We have a cost function (e.g. how many mistakes we make) that tells us how badly our current model works.
  • We want to find a set of parameters that minimizes the cost function.
  • We start with a random set of parameters and we update the parameters one learning step at a time in the direction that results in smaller cost.
  • We can vary the size of the steps (or the rate at which we take steps).

36

37 of 45

Gradient Descent

37

Cost

Parameters

Goal (Minimum)

Random Start

Learning Steps

“how many mistakes the model makes”

“importance of each feature for prediction”

38 of 45

Heads up! Expect to be confused!

38

39 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

39

40 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

40

num_train_epochs

how many times to pass over the full dataset

per_device_train_batch_size

how many examples to process at a time

per_device_eval_batch_size

how many examples to process at a time

{

41 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

41

learning_rate

the size of the learning steps (or how quickly to take them)

warmup_steps

number of steps used for a linear warmup from 0 to learning_rate

weight_decay

multiply all weights by this term to prevent weights from getting too big

{

42 of 45

Fine-tuning Hyperparameters (HuggingFace)

training_args = TrainingArguments(

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=20,

learning_rate=5e-5,

warmup_steps=10,

weight_decay=0.01,

output_dir='./results',

logging_dir='./logs',

logging_steps=10,

evaluation_strategy='steps'

)

42

output_dir

where to save model and config files

logging_dir

where to save logging files

logging_steps

how often to show training progress metrics

evaluation_strategy

run evaluation during training

{

43 of 45

What happens after fine-tuning?

  • We can use our fine-tuned model for classification. The model will return labels with the highest probability for each document.
  • Now we have a list of predicted labels and our original list of true labels. We can now compare these in all kinds of ways, just as we would with the output of any other classification model.

43

44 of 45

Essential Resources

44

45 of 45

Let’s code together!

https://bit.ly/bert-for-css-notebook

45