1 of 68

Analysis of Language Models�aka BERTology

Human Language Technologies

Dipartimento di Informatica

Giuseppe Attardi

Università di Pisa

Slides from John Hewitt: https://nlp.stanford.edu/~johnhew/structural-probe.html

2 of 68

Questions about Language Models

What can’t be learned via language model pretraining?

What will replace the

Transformer?

What can be learned via language model pretraining?

What does deep learning try to do?

What do neural models tell us about language?

How are our models affecting people, and transferring power?

3 of 68

Before self-supervised learning

  • The way to approach doing NLP was through understanding the human language

system, and trying to imitate it

  • Example: Parsing
    • I want my sentiment analysis system to classify this movie review correctly
      • “My uncultured roommate hated this movie, but I absolutely loved it”
    • How would we do this?
    • We might have some semantic representation of some key words like “hate” and “uncultured”, but how does everything relate?

Slide by Isabel Papadimitriou

4 of 68

How do humans structure this string of words?

  • Many linguists might tell you something like this:

Slide by Isabel Papadimitriou

5 of 68

Linguistic theory helped NLP reverse-engineer language

Syntax

Input

Pipelines

Semantics

Discourse

Slide by Isabel Papadimitriou

6 of 68

Now, language models just seem to catch on to a lot of these things!

Slide by Isabel Papadimitriou

7 of 68

Linguistic Structure in NLP

  • Linguistic structure in humans
    • There is a system for producing language, that can be described by discrete rules
  • Do NLP systems work like that?
  • They definitely used to!
  • Now, NLP works better than it ever has before – and we’re not constraining our systems to know any syntax
  • What about structure in modern language models?

Slide by Isabel Papadimitriou

8 of 68

What linguistic knowledge is present in LM?

  • POS tagging through word embedding clusters
    • Nearest vector for Italian articled preposition “del”:

Il, nel, nella, al, dal, col, dell’

  • NER via Masked Language Model

9 of 68

Unsupervised NER

Using BERT Masked Language Model

[Rajasekharan 2020]

Nonenbury is a ___

context sensitive signature of a term

context independent signature of a term

10 of 68

Entity Distribution

11 of 68

LM Effectiveness

LMs exhibit surprising abilities in several language tasks

But do they really understand language?

Consider the natural language inference task, as encoded in the Multi-NLI dataset.

Premise

“He turned and saw Jon

sleeping in his half-tent”

[Likely to get the right answer, since the accuracy is 95%?]

Hypothesis

“He saw Jon was asleep”

Model A

Accuracy: 95%

Entailment

Neutral

Contradiction

12 of 68

Checking LM understanding abilities

What if our model is using simple heuristics to get good accuracy?

A diagnostic test set is carefully constructed to test for a specific skill or capacity of your neural model.

For example, HANS: (Heuristic Analysis for NLI Systems) tests syntactic heuristics in NLI

13 of 68

HANS model analysis in Natural Language Inference

McCoy et al., 2019 took 4 strong MNLI models,

with the following accuracies on the original test set (in-domain)

Evaluating on HANS, where syntactic heursitcs work, accuracy is high!

But where syntactic heuristics fail, accuracy

is very very low

14 of 68

Language models as linguistic test subjects

  • How do we understand language behavior in humans?
  • One method: minimal pairs. What sounds “okay” to a speaker, but doesn’t with a small change?

The chef who made the pizzas is here. 🡨 “Acceptable”

The chef who made the pizzas are here 🡨 “Unacceptable”

Idea: verbs agree in number with their subjects

15 of 68

Testing Language models linguistic knowledge

  • What’s the language model analogue of acceptability?

The chef who made the pizzas is here. 🡨 “Acceptable”

The chef who made the pizzas are here 🡨 “Unacceptable”

  • Assign higher probability to the acceptable sentence in the minimal pair

P(The chef who made the pizzas is here.) > P(The chef who made the pizzas are here)

  • Just like in HANS, we can develop a test set with carefully chosen properties.
    • Specifically: can language models handle “attractors” in subject-verb agreement?
    • 0 Attractors: The chef is here.
    • 1 Attractor: The chef who made the pizzas is here.
    • 2 Attractors: The chef who made the pizzas and prepped the ingredients is here.

16 of 68

Testing Language models linguistic knowledge

Sample test examples for subject-verb agreement with attractors that a model got wrong

The ship that the player drives has a very high speed. The ship that the player drives have a very high speed.

The lead is also rather long; 5 paragraphs is pretty lengthy …

The lead is also rather long; 5 paragraphs are pretty lengthy …

17 of 68

Testing Language models linguistic knowledge

  • Kuncoro et al., 2018 train an LSTM language model on a small set of Wikipedia text.
  • They evaluate it only on sentences with specific numbers of agreement attractors.
  • Numbers in this table: error rates at predicting the correct number for the verb

Zero attractors: Easy

4 attractors: harder, but models still do pretty well!

The larger LSTMs learn subject- verb agreement better!

18 of 68

Linguistic Abilities

Current LLMs on the one hand do indeed exhibit formal language skills (such as lexical and grammatical knowledge, illustrated in the next figure), but on the other hand they lack functional skills.

K. Mahowald, et al. (2023) Dissociating language and thought in large language models: a cognitive perspective. https://arxiv.org/abs/2301.06627

19 of 68

Assessing Language models Syntactic Abilities

Y. Goldgerg. 2019. Assessing BERT’s Syntactic Abilities.

the game that the guard hates is bad .

the game that the guard hates are bad .

Feed into BERT:

[CLS] the game that the guard hates [MASK] bad .

and compare the scores predicted for is and are.

Attractors

BERT Base

BERT Large

# sents

1

0.97

0.97

24031

2

0.97

0.97

4414

3

0.96

0.96

946

4

0.97

0.96

254

20 of 68

Assessing Language models Syntactic Abilities

 

BERT

Base

BERT

Large

LSTM

(M&L)

Humans

(M&L)

# Pairs

(# M&L Pairs)

SUBJECT-VERB AGREEMENT:

 

 

 

 

 

Simple

1.00

1.00

0.94

0.96

120 (140)

In a sentential complement

0.83

0.86

0.99

0.93

1440 (1680)

Short VP coordination

0.89

0.86

0.90

0.82

720 (840)

Long VP coordination

0.98

0.97

0.61

0.82

400 (400)

Across a prepositional phrase

0.85

0.85

0.57

0.85

19440 (22400)

Across a subject relative clause

0.84

0.85

0.56

0.88

9600 (11200)

Across an object relative clause

0.89

0.85

0.50

0.85

19680 (22400)

Across an object relative (no that)

0.86

0.81

0.52

0.82

19680 (22400)

In an object relative clause

0.95

0.99

0.84

0.78

15960 (22400)

In an object relative (no that)

0.79

0.82

0.71

0.79

15960 (22400)

REFLEXIVE ANAPHORA:

 

 

 

 

 

Simple

0.94

0.92

0.83

0.96

280 (280)

In a sentential complement

0.89

0.86

0.86

0.91

3360 (3360)

Across a relative clause

0.80

0.76

0.55

0.87

22400 (22400)

Marvin and Linzen (2018)

21 of 68

Input influence: does model really use long-distance context?

  • We motivated LSTM language models through their theoretical ability to use long- distance context to make predictions. But how long really is the long short-term memory?
  • Khandelwal et al., 2018’s idea: shuffle or remove all contexts farther than 𝑘 words away, for various 𝑘, and see at which 𝑘 the model’s predictions start to get worse!
  • Loss is averaged across many examples.

History farther than 50 words away treated as a bag of words.

22 of 68

Prediction explanations: what in the input led to this output?

  • For a single example, what parts of the input led to the observed prediction?
  • Saliency maps: a score for each input word indicating its importance to the model’s prediction
  • In the above example, BERT is analyzed, and interpretable words seem to contribute to the model’s predictions (right).

23 of 68

Prediction explanations: simple saliency maps

  • How do we make a saliency map? Many ways to encode the intuition of “importance”
  • Simple gradient method:

For words 𝑥1, … , 𝑥𝑛 and the model’s score for a given class (output label) sc(𝑥1, … , 𝑥𝑛),

take the norm of the gradient of the score w.r.t. each word:

salience 𝑥𝑖 = ||∇𝑥𝑖 𝑠𝑐 𝑥1, … , 𝑥𝑛 ||

Idea: high gradient norm means changing that word (locally) would affect the score a lot

Loss

High saliency

Low saliency

word space

24 of 68

Prediction explanations: simple saliency maps

  • How do we make a saliency map? Many ways to encode the intuition of “importance”
  • Simple gradient method:

For words 𝑥1, … , 𝑥𝑛 and the model’s score for a given class (output label) sc(𝑥1, … , 𝑥𝑛),

take the norm of the gradient of the score w.r.t. each word:

salience 𝑥𝑖 = |∇𝑥𝑖 𝑠𝑐 𝑥1, … , 𝑥𝑛 |

Not a perfect method for saliency; many more methods have been proposed. One issue: linear approximation may not hold well!

Loss

Low saliency according to the gradient… but

move a little more and the loss skyrockets!

word space

25 of 68

BlenderBot

Who is the Italian prime minister?

[BlenderBot]: It is Mussolini. He was the leader of the fascist party during WWII.

Is he still alive?

[BlenderBot]: Yes, he is still alive. He has been in politics since 1989. He also has a son.

Who is his son?

[BlenderBot]: I do not know his name, but he was born in Italy and moved to the United States when he was very young.

Who is Enrico Fermi?

[BlenderBot]: I am not sure who that is, but I do know that Ferrari is an automobile manufacturer.

Who was Alan Turing?

[BlenderBot]: He was an American computer scientist who invented the Turing Machine. He was a genius.

What did he do during WWII?

[BlenderBot]: He worked on the Manhattan Project, which was a project to develop nuclear weapons.

What is the population of Michigan?

[BlenderBot]: I do not know the exact number, but it is one of the most populous states in the US.

https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/

26 of 68

What is the population of California?

[BlenderBot]: The state of California has a population of over 8,537,673 people.

(Well: it is 39.5 million)

27 of 68

Explanation by input reduction

What is the smallest part of the input I could keep and still get the same answer? An example from SQuAD:

In 1899, John Jacob Astor IV invested

$100,000 for Tesla to further develop and produce a new lighting system. Instead, Tesla used the money to fund his Colorado Springs experiments.

Passage:

Original Question:

What did Tesla spend Astor’s money on?

[prediction]

Reduced Question did

In this example, the model had confidence 0.78 for the original question, and the same answer at confidence 0.91 for the reduced question!

28 of 68

A method for explanation by input reduction

The Panthers used the San Jose State practice facility and stayed at the San Jose Marriott.

The Broncos practiced at Stanford University and stayed at the Santa Clara Marriott.

Idea: run an input saliency method. Iteratively remove the less important words.

Passage:

Original Question:

Where did the Broncos practice for the Super Bowl?

Where did the practice for the Super Bowl? Where did practice for the Super Bowl?

Where did practice the Super Bowl? Where did practice the Super?

Where did practice Super?

did practice Super?

[prediction]

[Note: beam search to find k least important words is an important addition]

Steps of input

reduction

Only here did the model stop being confident in the answer

29 of 68

Analyzing models by breaking them

Peyton manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38…

Idea: Can we break models by making seemingly innocuous changes to the input?

Passage:

Question:

[prediction]

What was the name of the quarterback who was 38 in Super Bowl XXXIII?

Looks good!

30 of 68

Analyzing models by breaking them

Peyton manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38… Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.

Idea: Can we break models by making seemingly innocuous changes to the input?

Passage:

Question:

[prediction]

What was the name of the quarterback who was 38 in Super Bowl XXXIII?

The sentence in orange hasn’t changed the answer, but the model’s prediction changed!

So, seems like the model wasn’t performing question answering as we’d like?

31 of 68

Analyzing models by breaking them

Idea: Can we break models by making seemingly innocuous changes to the input?

This model’s

predictions look good!

This typo is annoying, but a reasonable human might ignore it.

Changing what to what’s should never change the answer!

32 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

For example, can we try to characterize each attention head of BERT?

Attention head 1 of layer 1.

This head performs this kind of behavior on most sentences.

[Why is “interpretable” in quotes? It’s hard to tell exactly how/whether the model is performing an interpretable function, especially deep in the network.]

33 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

Some attention heads seem to perform simple operations.

34 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

Some heads are correlated with linguistic properties!

Approximate interpretation + quantitative analysis

Model behavior

35 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

We saw coreference before; one head often matches coreferent mentions!

Approximate interpretation + quantitative analysis

Model behavior

36 of 68

Probes

37 of 68

Probing: supervised analysis of neural networks

Question:

What do their representations

encode about language?

Premise:

Pretrained Transformers provide

surprisingly good general-purpose language representations?

38 of 68

Probing: supervised analysis of neural networks

Question:

What do pretrained representations

encode about linguistic properties for which we have annotated data?

39 of 68

Probing: supervised analysis of neural networks

 

40 of 68

Probing: supervised analysis of neural networks

BERT (and other pretrained LMs) make some linguistic properties predictable to very high accuracy

with a simple linear probe.

Syntactic roles Part-of-speech Named entity recognition

41 of 68

Layerwise trends of probing accuracy

  • Across a wide range of linguistic properties, the middle layers of BERT yield the best probing accuracies.

Consistently best a bit past the mid point

Input words here

MLM

objective here

42 of 68

Layerwise trends of probing accuracy

  • Increasingly abstract linguistic properties are more accessible later in the network.

Increasing abstractness of linguistic properties

Increasing depth in the network

43 of 68

Emergent simple structure in neural networks

  • Recall word2vec, and the intuitions we built around its vectors

California

Sacramento

Harrisburg

Pennsylvania

cat

kitty

guitar

Some relationships are encoded as linear offsets

We interpret cosine similarity

as semantic similarity.

  • It is fascinating that interpretable concepts approximately map onto simple functions of the vectors

44 of 68

Syntax Probes

The chef who ran to the store was out of food

45 of 68

Human languages, numerical machines

The meaning of a sentence is constructed by composing small chunks of words together with each other, obtaining successively larger chunks with more complex meanings

46 of 68

Word Embedding Representation

Vectors

In vector space

47 of 68

Are these views of language reconcilable?

Method to find tree structures in these vector spaces, and show the surprising extent to which ELMo and BERT encode human-like parse trees.

48 of 68

Contextual representations of language

  •  

49 of 68

Beyond “words in context”

In order to perform language modeling well, with enough data, some high-level language information seems to be needed.

Consider the following first sentence of a story, along with two possible continuations:

The chef who went to the store was out of food.

    • Because there was no food to be found, the chef went to the next store.
    • After stocking up on ingredients, the chef returned to the restaurant.

50 of 68

the string the store was out of food is a substring of the premise. Thus, knowing that the (chef, was) not the (store, was) may be helpful. 

51 of 68

What did my neural network learn along the way?

52 of 68

Observational evidence

An observational network study evaluates the model at the task it was optimized for, often hand-crafting inputs to determine whether a given desired behavior is observed.

Recall Linzen et al., 2016, which tests number agreements between subject and verbs.

53 of 68

The structural probe

We think of there existing a latent parse tree on every sentence, which the neural network does not have access to. For the dependency parsing formalisms, each word in the sentence has a corresponding node in the parse tree

A dependency parse tree look like this:

54 of 68

Trees as distances and norms

Our first intuition is that vector spaces and graphs both have natural distance metrics.

For a parse tree, we have the path metric, d(wi,wj), which is the number of edges in the path between the two words in the tree.

55 of 68

Syntax Distance Hypothesis

The syntax distance hypothesis: There exists a linear transformation B of the word representation space under which vector distance encodes parse trees. ��Equivalently, there exists an inner product on the word representation space such that distance under the inner product encodes parse trees. This (indefinite) inner product is specified by BTB.

56 of 68

2-D Visualization

The distances we pointed out earlier between chefstore and was, can be visualized in a vector space as follows, where B ∈ R2×3, mapping 3-dimensional word representations to a 2-dimensional space encoding syntax.

57 of 68

Minimum Spanning Tree

After the linear transformation, however, taking a minimum spanning tree on the distances recovers the tree, as shown in the image

58 of 68

Finding a parse tree-encoding distance metric

  •  

59 of 68

Finding B

  •  

60 of 68

Reconstructed trees and depth

gold parse trees (black, above the sentences) along with the minimum spanning trees of predicted distance metrics for a sentence (blue, red, purple, below the sentence):

61 of 68

Depth

depths in the gold parse tree (grey, circle) as well as predicted (squared) parse depths according to ELMo1 (red, triangle) and BERT-large, layer 16 (blue, square).

62 of 68

Parse Distance Matrix

63 of 68

Subject-verb number agreement

Distance matrices and minimum spanning trees predicted by a structural probe on BERT-large, layer 16, for 4 sentences:

64 of 68

Ungrammatical Sentence

What happens when we give BERT and the structural probe an ungrammatical sentence, where the form of the verb given is plural, but still must refer back to a singular subject? We still see largely the same behavior, meaning the model may not just be matching the verb with the noun in the sentence of the same number:

65 of 68

LLM Evaluation

66 of 68

  • Y. Bang et al. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. https://arxiv.org/pdf/2302.04023
  • ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks.
  • Factuality: ChatGPT is able to detect COVID-19 misinformation 92%
  • Reasoning: ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
  • Hallucinations: ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory.
  • Multi-turn "prompt engineering" enables the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation.

67 of 68

Performance vs Fine-tuned SotA

68 of 68

Conclusions

  • LMs exhibit surprising abilities in many NL tasks, especially after fine-tuning
  • Studies have questioned their ability to understand deep linguistic structures
  • Experiments have shown the limits of simple LMs, while Transformer models appear capable of capturing higher linguistic knowledge
  • The geometry of English parse trees is approximately discoverable in the geometry of deep LMs
  • Understanding the capabilities of these models has become an active area of research