Analysis of Language Models�aka BERTology
Human Language Technologies
Dipartimento di Informatica
Giuseppe Attardi
Università di Pisa
Slides from John Hewitt: https://nlp.stanford.edu/~johnhew/structural-probe.html
Questions about Language Models
What can’t be learned via language model pretraining?
What will replace the
Transformer?
What can be learned via language model pretraining?
What does deep learning try to do?
What do neural models tell us about language?
How are our models affecting people, and transferring power?
Before self-supervised learning
system, and trying to imitate it
Slide by Isabel Papadimitriou
How do humans structure this string of words?
Slide by Isabel Papadimitriou
Linguistic theory helped NLP reverse-engineer language
Syntax
Input
Pipelines
Semantics
Discourse
…
Slide by Isabel Papadimitriou
Now, language models just seem to catch on to a lot of these things!
Slide by Isabel Papadimitriou
Linguistic Structure in NLP
Slide by Isabel Papadimitriou
What linguistic knowledge is present in LM?
Il, nel, nella, al, dal, col, dell’
Unsupervised NER
context sensitive signature of a term
context independent signature of a term
Entity Distribution
LM Effectiveness
LMs exhibit surprising abilities in several language tasks
But do they really understand language?
Consider the natural language inference task, as encoded in the Multi-NLI dataset.
Premise
“He turned and saw Jon
sleeping in his half-tent”
[Likely to get the right answer, since the accuracy is 95%?]
Hypothesis
“He saw Jon was asleep”
Model A
Accuracy: 95%
Entailment
Neutral
Contradiction
Checking LM understanding abilities
What if our model is using simple heuristics to get good accuracy?
A diagnostic test set is carefully constructed to test for a specific skill or capacity of your neural model.
For example, HANS: (Heuristic Analysis for NLI Systems) tests syntactic heuristics in NLI
HANS model analysis in Natural Language Inference
McCoy et al., 2019 took 4 strong MNLI models,
with the following accuracies on the original test set (in-domain)
Evaluating on HANS, where syntactic heursitcs work, accuracy is high!
But where syntactic heuristics fail, accuracy
is very very low…
Language models as linguistic test subjects
The chef who made the pizzas is here. 🡨 “Acceptable”
The chef who made the pizzas are here 🡨 “Unacceptable”
Idea: verbs agree in number with their subjects
[Linzen et al., 2016; figure from Manning et al., 2020 ]
Testing Language models linguistic knowledge
The chef who made the pizzas is here. 🡨 “Acceptable”
The chef who made the pizzas are here 🡨 “Unacceptable”
P(The chef who made the pizzas is here.) > P(The chef who made the pizzas are here)
Testing Language models linguistic knowledge
Sample test examples for subject-verb agreement with attractors that a model got wrong
The ship that the player drives has a very high speed. The ship that the player drives have a very high speed.
The lead is also rather long; 5 paragraphs is pretty lengthy …
The lead is also rather long; 5 paragraphs are pretty lengthy …
Testing Language models linguistic knowledge
Zero attractors: Easy
4 attractors: harder, but models still do pretty well!
The larger LSTMs learn subject- verb agreement better!
Linguistic Abilities
Current LLMs on the one hand do indeed exhibit formal language skills (such as lexical and grammatical knowledge, illustrated in the next figure), but on the other hand they lack functional skills.
K. Mahowald, et al. (2023) Dissociating language and thought in large language models: a cognitive perspective. https://arxiv.org/abs/2301.06627
Assessing Language models Syntactic Abilities
Y. Goldgerg. 2019. Assessing BERT’s Syntactic Abilities.
the game that the guard hates is bad .
the game that the guard hates are bad .
Feed into BERT:
[CLS] the game that the guard hates [MASK] bad .
and compare the scores predicted for is and are.
Attractors | BERT Base | BERT Large | # sents |
1 | 0.97 | 0.97 | 24031 |
2 | 0.97 | 0.97 | 4414 |
3 | 0.96 | 0.96 | 946 |
4 | 0.97 | 0.96 | 254 |
Assessing Language models Syntactic Abilities
| BERT Base | BERT Large | LSTM (M&L) | Humans (M&L) | # Pairs (# M&L Pairs) |
SUBJECT-VERB AGREEMENT: |
|
|
|
|
|
Simple | 1.00 | 1.00 | 0.94 | 0.96 | 120 (140) |
In a sentential complement | 0.83 | 0.86 | 0.99 | 0.93 | 1440 (1680) |
Short VP coordination | 0.89 | 0.86 | 0.90 | 0.82 | 720 (840) |
Long VP coordination | 0.98 | 0.97 | 0.61 | 0.82 | 400 (400) |
Across a prepositional phrase | 0.85 | 0.85 | 0.57 | 0.85 | 19440 (22400) |
Across a subject relative clause | 0.84 | 0.85 | 0.56 | 0.88 | 9600 (11200) |
Across an object relative clause | 0.89 | 0.85 | 0.50 | 0.85 | 19680 (22400) |
Across an object relative (no that) | 0.86 | 0.81 | 0.52 | 0.82 | 19680 (22400) |
In an object relative clause | 0.95 | 0.99 | 0.84 | 0.78 | 15960 (22400) |
In an object relative (no that) | 0.79 | 0.82 | 0.71 | 0.79 | 15960 (22400) |
REFLEXIVE ANAPHORA: |
|
|
|
|
|
Simple | 0.94 | 0.92 | 0.83 | 0.96 | 280 (280) |
In a sentential complement | 0.89 | 0.86 | 0.86 | 0.91 | 3360 (3360) |
Across a relative clause | 0.80 | 0.76 | 0.55 | 0.87 | 22400 (22400) |
Marvin and Linzen (2018)
Input influence: does model really use long-distance context?
History farther than 50 words away treated as a bag of words.
Prediction explanations: what in the input led to this output?
Prediction explanations: simple saliency maps
For words 𝑥1, … , 𝑥𝑛 and the model’s score for a given class (output label) sc(𝑥1, … , 𝑥𝑛),
take the norm of the gradient of the score w.r.t. each word:
salience 𝑥𝑖 = ||∇𝑥𝑖 𝑠𝑐 𝑥1, … , 𝑥𝑛 ||
Idea: high gradient norm means changing that word (locally) would affect the score a lot
Loss
High saliency
Low saliency
word space
Prediction explanations: simple saliency maps
For words 𝑥1, … , 𝑥𝑛 and the model’s score for a given class (output label) sc(𝑥1, … , 𝑥𝑛),
take the norm of the gradient of the score w.r.t. each word:
salience 𝑥𝑖 = |∇𝑥𝑖 𝑠𝑐 𝑥1, … , 𝑥𝑛 |
Not a perfect method for saliency; many more methods have been proposed. One issue: linear approximation may not hold well!
Loss
Low saliency according to the gradient… but
move a little more and the loss skyrockets!
word space
BlenderBot
Who is the Italian prime minister?
[BlenderBot]: It is Mussolini. He was the leader of the fascist party during WWII.
Is he still alive?
[BlenderBot]: Yes, he is still alive. He has been in politics since 1989. He also has a son.
Who is his son?
[BlenderBot]: I do not know his name, but he was born in Italy and moved to the United States when he was very young.
Who is Enrico Fermi?
[BlenderBot]: I am not sure who that is, but I do know that Ferrari is an automobile manufacturer.
Who was Alan Turing?
[BlenderBot]: He was an American computer scientist who invented the Turing Machine. He was a genius.
What did he do during WWII?
[BlenderBot]: He worked on the Manhattan Project, which was a project to develop nuclear weapons.
What is the population of Michigan?
[BlenderBot]: I do not know the exact number, but it is one of the most populous states in the US.
https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/
What is the population of California?
[BlenderBot]: The state of California has a population of over 8,537,673 people.
(Well: it is 39.5 million)
Explanation by input reduction
What is the smallest part of the input I could keep and still get the same answer? An example from SQuAD:
In 1899, John Jacob Astor IV invested
$100,000 for Tesla to further develop and produce a new lighting system. Instead, Tesla used the money to fund his Colorado Springs experiments.
Passage:
Original Question:
What did Tesla spend Astor’s money on?
[prediction]
Reduced Question did
In this example, the model had confidence 0.78 for the original question, and the same answer at confidence 0.91 for the reduced question!
A method for explanation by input reduction
The Panthers used the San Jose State practice facility and stayed at the San Jose Marriott.
The Broncos practiced at Stanford University and stayed at the Santa Clara Marriott.
Idea: run an input saliency method. Iteratively remove the less important words.
Passage:
Original Question:
Where did the Broncos practice for the Super Bowl?
Where did the practice for the Super Bowl? Where did practice for the Super Bowl?
Where did practice the Super Bowl? Where did practice the Super?
Where did practice Super?
did practice Super?
[prediction]
[Note: beam search to find k least important words is an important addition]
Steps of input
reduction
Only here did the model stop being confident in the answer
Analyzing models by breaking them
Peyton manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38…
Idea: Can we break models by making seemingly innocuous changes to the input?
Passage:
Question:
[prediction]
What was the name of the quarterback who was 38 in Super Bowl XXXIII?
Looks good!
Analyzing models by breaking them
Peyton manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38… Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.
Idea: Can we break models by making seemingly innocuous changes to the input?
Passage:
Question:
[prediction]
What was the name of the quarterback who was 38 in Super Bowl XXXIII?
The sentence in orange hasn’t changed the answer, but the model’s prediction changed!
So, seems like the model wasn’t performing question answering as we’d like?
Analyzing models by breaking them
Idea: Can we break models by making seemingly innocuous changes to the input?
This model’s
predictions look good!
This typo is annoying, but a reasonable human might ignore it.
Changing what to what’s should never change the answer!
Analysis of “interpretable” architecture components
Idea: Some modeling components lend themselves to inspection.
For example, can we try to characterize each attention head of BERT?
Attention head 1 of layer 1.
This head performs this kind of behavior on most sentences.
[Why is “interpretable” in quotes? It’s hard to tell exactly how/whether the model is performing an interpretable function, especially deep in the network.]
Analysis of “interpretable” architecture components
Idea: Some modeling components lend themselves to inspection.
Some attention heads seem to perform simple operations.
Analysis of “interpretable” architecture components
Idea: Some modeling components lend themselves to inspection.
Some heads are correlated with linguistic properties!
Approximate interpretation + quantitative analysis
Model behavior
Analysis of “interpretable” architecture components
Idea: Some modeling components lend themselves to inspection.
We saw coreference before; one head often matches coreferent mentions!
Approximate interpretation + quantitative analysis
Model behavior
Probes
Probing: supervised analysis of neural networks
Question:
What do their representations
encode about language?
Premise:
Pretrained Transformers provide
surprisingly good general-purpose language representations?
Probing: supervised analysis of neural networks
Question:
What do pretrained representations
encode about linguistic properties for which we have annotated data?
Probing: supervised analysis of neural networks
Probing: supervised analysis of neural networks
BERT (and other pretrained LMs) make some linguistic properties predictable to very high accuracy
with a simple linear probe.
Syntactic roles Part-of-speech Named entity recognition
Layerwise trends of probing accuracy
Consistently best a bit past the mid point
Input words here
MLM
objective here
Layerwise trends of probing accuracy
Increasing abstractness of linguistic properties
Increasing depth in the network
Emergent simple structure in neural networks
California
Sacramento
Harrisburg
Pennsylvania
cat
kitty
guitar
Some relationships are encoded as linear offsets
We interpret cosine similarity
as semantic similarity.
Syntax Probes
The chef who ran to the store was out of food
Human languages, numerical machines
The meaning of a sentence is constructed by composing small chunks of words together with each other, obtaining successively larger chunks with more complex meanings
Word Embedding Representation
Vectors
In vector space
Are these views of language reconcilable?
Method to find tree structures in these vector spaces, and show the surprising extent to which ELMo and BERT encode human-like parse trees.
Contextual representations of language
Beyond “words in context”
In order to perform language modeling well, with enough data, some high-level language information seems to be needed.
Consider the following first sentence of a story, along with two possible continuations:
The chef who went to the store was out of food.
the string the store was out of food is a substring of the premise. Thus, knowing that the (chef, was) not the (store, was) may be helpful.
What did my neural network learn along the way?
Observational evidence
An observational network study evaluates the model at the task it was optimized for, often hand-crafting inputs to determine whether a given desired behavior is observed.
Recall Linzen et al., 2016, which tests number agreements between subject and verbs.
The structural probe
We think of there existing a latent parse tree on every sentence, which the neural network does not have access to. For the dependency parsing formalisms, each word in the sentence has a corresponding node in the parse tree
A dependency parse tree look like this:
Trees as distances and norms
Our first intuition is that vector spaces and graphs both have natural distance metrics.
For a parse tree, we have the path metric, d(wi,wj), which is the number of edges in the path between the two words in the tree.
Syntax Distance Hypothesis
The syntax distance hypothesis: There exists a linear transformation B of the word representation space under which vector distance encodes parse trees. ��Equivalently, there exists an inner product on the word representation space such that distance under the inner product encodes parse trees. This (indefinite) inner product is specified by BTB.
2-D Visualization
The distances we pointed out earlier between chef, store and was, can be visualized in a vector space as follows, where B ∈ R2×3, mapping 3-dimensional word representations to a 2-dimensional space encoding syntax.
Minimum Spanning Tree
After the linear transformation, however, taking a minimum spanning tree on the distances recovers the tree, as shown in the image
Finding a parse tree-encoding distance metric
Finding B
Reconstructed trees and depth
gold parse trees (black, above the sentences) along with the minimum spanning trees of predicted distance metrics for a sentence (blue, red, purple, below the sentence):
Depth
depths in the gold parse tree (grey, circle) as well as predicted (squared) parse depths according to ELMo1 (red, triangle) and BERT-large, layer 16 (blue, square).
Parse Distance Matrix
Subject-verb number agreement
Distance matrices and minimum spanning trees predicted by a structural probe on BERT-large, layer 16, for 4 sentences:
Ungrammatical Sentence
What happens when we give BERT and the structural probe an ungrammatical sentence, where the form of the verb given is plural, but still must refer back to a singular subject? We still see largely the same behavior, meaning the model may not just be matching the verb with the noun in the sentence of the same number:
LLM Evaluation
Performance vs Fine-tuned SotA
Conclusions