JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 68

Analysis of Language Models�aka BERTology

Human Language Technologies

Dipartimento di Informatica

Giuseppe Attardi

Università di Pisa

Slides from John Hewitt: https://nlp.stanford.edu/~johnhew/structural-probe.html

2 of 68

Questions about Language Models

What can’t be learned via language model pretraining?

What will replace the

Transformer?

What can be learned via language model pretraining?

What does deep learning try to do?

What do neural models tell us about language?

How are our models affecting people, and transferring power?

3 of 68

Before self-supervised learning

The way to approach doing NLP was through understanding the human language

system, and trying to imitate it

Example: Parsing

I want my sentiment analysis system to classify this movie review correctly

“My uncultured roommate hated this movie, but I absolutely loved it”

How would we do this?
We might have some semantic representation of some key words like “hate” and “uncultured”, but how does everything relate?

Slide by Isabel Papadimitriou

4 of 68

How do humans structure this string of words?

Many linguists might tell you something like this:

Slide by Isabel Papadimitriou

5 of 68

Linguistic theory helped NLP reverse-engineer language

Syntax

Input

Pipelines

Semantics

Discourse

…

Slide by Isabel Papadimitriou

6 of 68

Now, language models just seem to catch on to a lot of these things!

Slide by Isabel Papadimitriou

7 of 68

Linguistic Structure in NLP

Linguistic structure in humans

There is a system for producing language, that can be described by discrete rules

Do NLP systems work like that?
They definitely used to!
Now, NLP works better than it ever has before – and we’re not constraining our systems to know any syntax
What about structure in modern language models?

Slide by Isabel Papadimitriou

8 of 68

What linguistic knowledge is present in LM?

POS tagging through word embedding clusters

Nearest vector for Italian articled preposition “del”:

Il, nel, nella, al, dal, col, dell’

NER via Masked Language Model

9 of 68

Unsupervised NER

Using BERT Masked Language Model

[Rajasekharan 2020]

Nonenbury is a ___

context sensitive signature of a term

context independent signature of a term

10 of 68

Entity Distribution

11 of 68

LM Effectiveness

LMs exhibit surprising abilities in several language tasks

But do they really understand language?

Consider the natural language inference task, as encoded in the Multi-NLI dataset.

[Williams et al., 2018]

Premise

“He turned and saw Jon

sleeping in his half-tent”

[Likely to get the right answer, since the accuracy is 95%?]

Hypothesis

“He saw Jon was asleep”

Model A

Accuracy: 95%

Entailment

Neutral

Contradiction

12 of 68

Checking LM understanding abilities

What if our model is using simple heuristics to get good accuracy?

A diagnostic test set is carefully constructed to test for a specific skill or capacity of your neural model.

For example, HANS: (Heuristic Analysis for NLI Systems) tests syntactic heuristics in NLI

[McCoy et al., 2019]

13 of 68

HANS model analysis in Natural Language Inference

McCoy et al., 2019 took 4 strong MNLI models,

with the following accuracies on the original test set (in-domain)

Evaluating on HANS, where syntactic heursitcs work, accuracy is high!

But where syntactic heuristics fail, accuracy

is very very low…

[McCoy et al., 2019]

14 of 68

Language models as linguistic test subjects

How do we understand language behavior in humans?
One method: minimal pairs. What sounds “okay” to a speaker, but doesn’t with a small change?

The chef who made the pizzas is here. 🡨 “Acceptable”

The chef who made the pizzas are here 🡨 “Unacceptable”

Idea: verbs agree in number with their subjects

[Linzen et al., 2016; figure from Manning et al., 2020 ]

15 of 68

Testing Language models linguistic knowledge

What’s the language model analogue of acceptability?

The chef who made the pizzas is here. 🡨 “Acceptable”

The chef who made the pizzas are here 🡨 “Unacceptable”

Assign higher probability to the acceptable sentence in the minimal pair

P(The chef who made the pizzas is here.) > P(The chef who made the pizzas are here)

Just like in HANS, we can develop a test set with carefully chosen properties.

Specifically: can language models handle “attractors” in subject-verb agreement?
0 Attractors: The chef is here.
1 Attractor: The chef who made the pizzas is here.
2 Attractors: The chef who made the pizzas and prepped the ingredients is here.
…

[Linzen et al., 2016]

16 of 68

Testing Language models linguistic knowledge

Sample test examples for subject-verb agreement with attractors that a model got wrong

The ship that the player drives has a very high speed. The ship that the player drives have a very high speed.

The lead is also rather long; 5 paragraphs is pretty lengthy …

The lead is also rather long; 5 paragraphs are pretty lengthy …

[Linzen et al., 2016]

17 of 68

Testing Language models linguistic knowledge

Kuncoro et al., 2018 train an LSTM language model on a small set of Wikipedia text.
They evaluate it only on sentences with specific numbers of agreement attractors.
Numbers in this table: error rates at predicting the correct number for the verb

[Kuncoro et al., 2016]

Zero attractors: Easy

4 attractors: harder, but models still do pretty well!

The larger LSTMs learn subject- verb agreement better!

18 of 68

Linguistic Abilities

Current LLMs on the one hand do indeed exhibit formal language skills (such as lexical and grammatical knowledge, illustrated in the next figure), but on the other hand they lack functional skills.

K. Mahowald, et al. (2023) Dissociating language and thought in large language models: a cognitive perspective. https://arxiv.org/abs/2301.06627

19 of 68

Assessing Language models Syntactic Abilities

Y. Goldgerg. 2019. Assessing BERT’s Syntactic Abilities.

the game that the guard hates is bad .

the game that the guard hates are bad .

Feed into BERT:

[CLS] the game that the guard hates [MASK] bad .

and compare the scores predicted for is and are.

Attractors	BERT Base	BERT Large	# sents
1	0.97	0.97	24031
2	0.97	0.97	4414
3	0.96	0.96	946
4	0.97	0.96	254

20 of 68

Assessing Language models Syntactic Abilities

	BERT Base	BERT Large	LSTM (M&L)	Humans (M&L)	# Pairs (# M&L Pairs)
SUBJECT-VERB AGREEMENT:
Simple	1.00	1.00	0.94	0.96	120 (140)
In a sentential complement	0.83	0.86	0.99	0.93	1440 (1680)
Short VP coordination	0.89	0.86	0.90	0.82	720 (840)
Long VP coordination	0.98	0.97	0.61	0.82	400 (400)
Across a prepositional phrase	0.85	0.85	0.57	0.85	19440 (22400)
Across a subject relative clause	0.84	0.85	0.56	0.88	9600 (11200)
Across an object relative clause	0.89	0.85	0.50	0.85	19680 (22400)
Across an object relative (no that)	0.86	0.81	0.52	0.82	19680 (22400)
In an object relative clause	0.95	0.99	0.84	0.78	15960 (22400)
In an object relative (no that)	0.79	0.82	0.71	0.79	15960 (22400)
REFLEXIVE ANAPHORA:
Simple	0.94	0.92	0.83	0.96	280 (280)
In a sentential complement	0.89	0.86	0.86	0.91	3360 (3360)
Across a relative clause	0.80	0.76	0.55	0.87	22400 (22400)

Marvin and Linzen (2018)

21 of 68

Input influence: does model really use long-distance context?

We motivated LSTM language models through their theoretical ability to use long- distance context to make predictions. But how long really is the long short-term memory?

Khandelwal et al., 2018’s idea: shuffle or remove all contexts farther than 𝑘 words away, for various 𝑘, and see at which 𝑘 the model’s predictions start to get worse!

Loss is averaged across many examples.

[Khandelwal et al., 2018]

History farther than 50 words away treated as a bag of words.

22 of 68

Prediction explanations: what in the input led to this output?

For a single example, what parts of the input led to the observed prediction?
Saliency maps: a score for each input word indicating its importance to the model’s prediction

In the above example, BERT is analyzed, and interpretable words seem to contribute to the model’s predictions (right).

[Simonyan et al., 2014, Wallace et al., 2019 ]

23 of 68

Prediction explanations: simple saliency maps

How do we make a saliency map? Many ways to encode the intuition of “importance”
Simple gradient method:

For words 𝑥₁, … , 𝑥_𝑛and the model’s score for a given class (output label) s_c(𝑥₁, … , 𝑥_𝑛),

take the norm of the gradient of the score w.r.t. each word:

salience 𝑥_𝑖= ||∇_𝑥_𝑖𝑠_𝑐𝑥₁, … , 𝑥_𝑛||

Idea: high gradient norm means changing that word (locally) would affect the score a lot

Loss

High saliency

Low saliency

word space

[Li et al., 2016, Simonyan et al., 2014, Wallace et al., 2019 ]

24 of 68

Prediction explanations: simple saliency maps

How do we make a saliency map? Many ways to encode the intuition of “importance”
Simple gradient method:

For words 𝑥₁, … , 𝑥_𝑛and the model’s score for a given class (output label) s_c(𝑥₁, … , 𝑥_𝑛),

take the norm of the gradient of the score w.r.t. each word:

salience 𝑥_𝑖= |∇_𝑥_𝑖𝑠_𝑐𝑥₁, … , 𝑥_𝑛|

Not a perfect method for saliency; many more methods have been proposed. One issue: linear approximation may not hold well!

Loss

Low saliency according to the gradient… but

move a little more and the loss skyrockets!

word space

[Li et al., 2016, Simonyan et al., 2014, Wallace et al., 2019 ]

25 of 68

BlenderBot

Who is the Italian prime minister?

[BlenderBot]: It is Mussolini. He was the leader of the fascist party during WWII.

Is he still alive?

[BlenderBot]: Yes, he is still alive. He has been in politics since 1989. He also has a son.

Who is his son?

[BlenderBot]: I do not know his name, but he was born in Italy and moved to the United States when he was very young.

Who is Enrico Fermi?

[BlenderBot]: I am not sure who that is, but I do know that Ferrari is an automobile manufacturer.

Who was Alan Turing?

[BlenderBot]: He was an American computer scientist who invented the Turing Machine. He was a genius.

What did he do during WWII?

[BlenderBot]: He worked on the Manhattan Project, which was a project to develop nuclear weapons.

What is the population of Michigan?

[BlenderBot]: I do not know the exact number, but it is one of the most populous states in the US.

https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/

26 of 68

What is the population of California?

[BlenderBot]: The state of California has a population of over 8,537,673 people.

(Well: it is 39.5 million)

27 of 68

Explanation by input reduction

What is the smallest part of the input I could keep and still get the same answer? An example from SQuAD:

[Feng et al., 2018]

In 1899, John Jacob Astor IV invested

$100,000 for Tesla to further develop and produce a new lighting system. Instead, Tesla used the money to fund his Colorado Springs experiments.

Passage:

Original Question:

What did Tesla spend Astor’s money on?

[prediction]

Reduced Question ^did

In this example, the model had confidence 0.78 for the original question, and the same answer at confidence 0.91 for the reduced question!

28 of 68

A method for explanation by input reduction

The Panthers used the San Jose State practice facility and stayed at the San Jose Marriott.

The Broncos practiced at Stanford University and stayed at the Santa Clara Marriott.

Idea: run an input saliency method. Iteratively remove the less important words.

[Feng et al., 2018]

Passage:

Original Question:

Where did the Broncos practice for the Super Bowl?

Where did the practice for the Super Bowl? Where did practice for the Super Bowl?

Where did practice the Super Bowl? Where did practice the Super?

Where did practice Super?

did practice Super?

[prediction]

[Note: beam search to find k least important words is an important addition]

Steps of input

reduction

Only here did the model stop being confident in the answer

29 of 68

Analyzing models by breaking them

Peyton manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38…

[Jia et al., 2017]

Idea: Can we break models by making seemingly innocuous changes to the input?

Passage:

Question:

[prediction]

What was the name of the quarterback who was 38 in Super Bowl XXXIII?

Looks good!

30 of 68

Analyzing models by breaking them

[Jia et al., 2017]

Idea: Can we break models by making seemingly innocuous changes to the input?

Passage:

Question:

[prediction]

What was the name of the quarterback who was 38 in Super Bowl XXXIII?

The sentence in orange hasn’t changed the answer, but the model’s prediction changed!

So, seems like the model wasn’t performing question answering as we’d like?

31 of 68

Analyzing models by breaking them

Idea: Can we break models by making seemingly innocuous changes to the input?

[Ribeiro et al., 2018]

This model’s

predictions look good!

This typo is annoying, but a reasonable human might ignore it.

Changing what to what’s should never change the answer!

32 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

For example, can we try to characterize each attention head of BERT?

Attention head 1 of layer 1.

This head performs this kind of behavior on most sentences.

[Clark et al., 2018]

[Why is “interpretable” in quotes? It’s hard to tell exactly how/whether the model is performing an interpretable function, especially deep in the network.]

33 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

Some attention heads seem to perform simple operations.

[Clark et al., 2018]

34 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

Some heads are correlated with linguistic properties!

Approximate interpretation + quantitative analysis

Model behavior

[Clark et al., 2018]

35 of 68

Analysis of “interpretable” architecture components

Idea: Some modeling components lend themselves to inspection.

We saw coreference before; one head often matches coreferent mentions!

Approximate interpretation + quantitative analysis

Model behavior

[Clark et al., 2018]

36 of 68

Probes

37 of 68

Probing: supervised analysis of neural networks

Question:

What do their representations

encode about language?

Premise:

Pretrained Transformers provide

surprisingly good general-purpose language representations?

38 of 68

Probing: supervised analysis of neural networks

Question:

What do pretrained representations

encode about linguistic properties for which we have annotated data?

39 of 68

Probing: supervised analysis of neural networks

40 of 68

Probing: supervised analysis of neural networks

BERT (and other pretrained LMs) make some linguistic properties predictable to very high accuracy

with a simple linear probe.

Syntactic roles ^{Part-of-speech Named entity recognition}

41 of 68

Layerwise trends of probing accuracy

Across a wide range of linguistic properties, the middle layers of BERT yield the best probing accuracies.

Consistently best a bit past the mid point

Input words here

MLM

objective here

[Liu et al., 2019]

42 of 68

Layerwise trends of probing accuracy

Increasingly abstract linguistic properties are more accessible later in the network.

Increasing abstractness of linguistic properties

Increasing depth in the network

[Tenney et al., 2019]

43 of 68

Emergent simple structure in neural networks

Recall word2vec, and the intuitions we built around its vectors

[Mikolov et al., 2013]

California

Sacramento

Harrisburg

Pennsylvania

cat

kitty

guitar

Some relationships are encoded as linear offsets

We interpret cosine similarity

as semantic similarity.

It is fascinating that interpretable concepts approximately map onto simple functions of the vectors

44 of 68

Syntax Probes

The chef who ran to the store was out of food

45 of 68

Human languages, numerical machines

The meaning of a sentence is constructed by composing small chunks of words together with each other, obtaining successively larger chunks with more complex meanings

46 of 68

Word Embedding Representation

Vectors

In vector space

47 of 68

Are these views of language reconcilable?

Method to find tree structures in these vector spaces, and show the surprising extent to which ELMo and BERT encode human-like parse trees.

48 of 68

Contextual representations of language

49 of 68

Beyond “words in context”

In order to perform language modeling well, with enough data, some high-level language information seems to be needed.

Consider the following first sentence of a story, along with two possible continuations:

The chef who went to the store was out of food.

Because there was no food to be found, the chef went to the next store.
After stocking up on ingredients, the chef returned to the restaurant.

50 of 68

the string the store was out of food is a substring of the premise. Thus, knowing that the (chef, was) not the (store, was) may be helpful.

51 of 68

What did my neural network learn along the way?

52 of 68

Observational evidence

An observational network study evaluates the model at the task it was optimized for, often hand-crafting inputs to determine whether a given desired behavior is observed.

Recall Linzen et al., 2016, which tests number agreements between subject and verbs.

53 of 68

The structural probe

We think of there existing a latent parse tree on every sentence, which the neural network does not have access to. For the dependency parsing formalisms, each word in the sentence has a corresponding node in the parse tree

A dependency parse tree look like this:

54 of 68

Trees as distances and norms

Our first intuition is that vector spaces and graphs both have natural distance metrics.

For a parse tree, we have the path metric, d(w_i,w_j), which is the number of edges in the path between the two words in the tree.

55 of 68

Syntax Distance Hypothesis

The syntax distance hypothesis: There exists a linear transformation B of the word representation space under which vector distance encodes parse trees. ��Equivalently, there exists an inner product on the word representation space such that distance under the inner product encodes parse trees. This (indefinite) inner product is specified by B^TB.

56 of 68

2-D Visualization

The distances we pointed out earlier between chef, store and was, can be visualized in a vector space as follows, where B ∈ R^2×3, mapping 3-dimensional word representations to a 2-dimensional space encoding syntax.

57 of 68

Minimum Spanning Tree

After the linear transformation, however, taking a minimum spanning tree on the distances recovers the tree, as shown in the image

58 of 68

Finding a parse tree-encoding distance metric

59 of 68

Finding B

60 of 68

Reconstructed trees and depth

gold parse trees (black, above the sentences) along with the minimum spanning trees of predicted distance metrics for a sentence (blue, red, purple, below the sentence):

61 of 68

Depth

depths in the gold parse tree (grey, circle) as well as predicted (squared) parse depths according to ELMo1 (red, triangle) and BERT-large, layer 16 (blue, square).

62 of 68

Parse Distance Matrix

63 of 68

Subject-verb number agreement

Distance matrices and minimum spanning trees predicted by a structural probe on BERT-large, layer 16, for 4 sentences:

64 of 68

Ungrammatical Sentence

What happens when we give BERT and the structural probe an ungrammatical sentence, where the form of the verb given is plural, but still must refer back to a singular subject? We still see largely the same behavior, meaning the model may not just be matching the verb with the noun in the sentence of the same number:

65 of 68

LLM Evaluation

66 of 68

Y. Bang et al. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. https://arxiv.org/pdf/2302.04023
ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks.
Factuality: ChatGPT is able to detect COVID-19 misinformation 92%
Reasoning: ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
Hallucinations: ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory.
Multi-turn "prompt engineering" enables the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation.

67 of 68

Performance vs Fine-tuned SotA

68 of 68

Conclusions

LMs exhibit surprising abilities in many NL tasks, especially after fine-tuning
Studies have questioned their ability to understand deep linguistic structures
Experiments have shown the limits of simple LMs, while Transformer models appear capable of capturing higher linguistic knowledge
The geometry of English parse trees is approximately discoverable in the geometry of deep LMs
Understanding the capabilities of these models has become an active area of research