CS458 Natural language Processing
Lecture 4 Intro to N-gram Language Models
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Predicting words
*refrigerator
*that
blue
green
clear
Language Models
Why word prediction?
It's a helpful part of language tasks
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved
I will be back soonish I will be bassoon dish
Why word prediction?
It's how large language models (LLMs) work!
LLMs are trained to predict words
LLMs generate text by predicting words
Language Modeling (LM) more formally
P(W) = P(w1,w2,w3,w4,w5…wn)
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
P(W) or P(wn|w1,w2…wn-1)
How to estimate these probabilities
=
How to compute P(W) or P(wn|w1, …wn-1)
P(The, water, of, Walden, Pond, is, so, beautifully, blue)
Reminder: The Chain Rule
P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint probability of words in sentence
P(“The water of Walden Pond”) =
P(The) × P(water|The) × P(of|The water)
× P(Walden|The water of) × P(Pond|The water of Walden)
Markov Assumption
Andrei Markov
≈
Wikimedia commons
Bigram Markov Assumption
More generally, we approximate each component in the product
Simplest case: Unigram model
To him swallowed confess hear both . Which . Of save on trail for are ay device and rote life have
�Hill he late speaks ; or ! a more to leg less first you enter
Months the my and issue of year foreign new exchange’s September
were recession exchange new endorsed a acquire to six executives
Some automatically generated sentences from two different unigram models
Bigram model
Some automatically generated sentences rom two different unigram models
Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.
�What means, sir. I confess she? then all sorts, he is trim, captain.
Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
Problems with N-gram models
“The soups that I made from that new cookbook I bought yesterday were amazingly delicious."
The solution: Large language models
Why N-gram models?
A nice clear paradigm that lets us introduce many of the important issues for large language models
Estimating N-gram Probabilities
Estimating bigram probabilities
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples: �Berkeley Restaurant Project sentences
Raw bigram counts
Raw bigram probabilities
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge do N-grams represent?
Dealing with scale in large n-grams
If we need probabilities we can do one exp at the end
Larger n-grams
Newest model: infini-grams (∞-grams) (Liu et al 2024)
N-gram LM Toolkits
Evaluation and Perplexity
How to evaluate N-gram models
To compare models A and B
Intrinsic (in-vitro) evaluation
Training sets and test sets
We train parameters of our model on a training set.
We test the model’s performance on data we haven’t seen.
Choosing training and test sets
Training on the test set
We can’t allow test sentences into the training set
This is called “Training on the test set”
Bad science!
33
Dev sets
Intuition of perplexity as evaluation metric: How good is our language model?
Intuition: A good LM prefers "real" sentences
Intuition of perplexity 2: �Predicting upcoming words
The Shannon Game: How well can we predict the next word?
Unigrams are terrible at this game (Why?)
time 0.9
dream 0.03
midnight 0.02
…
and 1e-100
Picture credit: Historiska bildsamlingen
https://creativecommons.org/licenses/by/2.0/
Claude Shannon
A good LM is one that assigns a higher probability to the next word that actually occurs
Intuition of perplexity 3: The best language model is one that best predicts the entire unseen test set
Intuition of perplexity 4: Use perplexity instead of raw probability
Intuition of perplexity 5: the inverse
Perplexity is the inverse probability of the test set, normalized by the number of words
(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as maximizing probability
Intuition of perplexity 6: N-grams
Bigrams:
Chain rule:
Intuition of perplexity 7: �Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue,green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 = ((⅓)5) -1/5 = (⅓)-1 = 3
Intuition of perplexity 7: �Weighted average branching factor
PerplexityB(T) = PB(red red red red blue)-1/5
= (.8 * .8 * .8 * .8 * .1) -1/5
=.04096 -1/5
= .527-1
= 1.89
Holding test set constant:�Lower perplexity = better language model
N-gram Order | Unigram | Bigram | Trigram |
Perplexity | 962 | 170 | 109 |
Sampling and Generalization
The Shannon (1948) Visualization Method�Sample words from an LM
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Claude Shannon
How Shannon sampled those words in 1948
"Open a book at random and select a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc."
Sampling a word from a distribution
Visualizing Bigrams the Shannon Way
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food
Note: there are other sampling methods
Used for neural language models
Many of them avoid generating words from the very unlikely tail of the distribution
We'll discuss when we get to neural LM decoding:
Approximating Shakespeare
Shakespeare as corpus
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams.
The Wall Street Journal is not Shakespeare
Can you guess the author? These 3-gram sentences are sampled from an LM trained on who?
1) They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions
2) This shall forbid it should be branded, if renown made it empty.
3) “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for.
53
Choosing training data
If task-specific, use a training corpus that has a similar genre to your task.
Make sure to cover different kinds of dialects and speaker/authors.
The perils of overfitting
N-grams only work well for word prediction if the test corpus looks like the training corpus
One kind of generalization: Zeros
Zeros
… ate lunch
… ate dinner
… ate a
… ate the
P(“breakfast” | ate) = 0
… ate lunch
… ate breakfast
Zero probability bigrams
Bigrams with zero probability
And hence we cannot compute perplexity (can’t divide by 0)!
Smoothing, Interpolation, and Backoff
The intuition of smoothing (from Dan Klein)
P(w | denied the)
3 allegations
2 reports
1 claims
1 request
7 total
P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total
allegations
reports
claims
attack
request
man
outcome
…
allegations
attack
man
outcome
…
allegations
reports
claims
request
Add-one estimation
Maximum Likelihood Estimates
Berkeley Restaurant Corpus: Laplace smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram counts
Add-1 estimation is a blunt instrument
Backoff and Interpolation
Linear Interpolation
How to set λs for interpolation?
Training Data
Held-Out Data
Test
Data
Backoff
Suppose you want:
P(pancakes| delicious soufflé)
Complication: need to discount the higher-order ngram so probabilities don't sum higher than 1 (e.g., Katz backoff)
Stupid Backoff
71
Thank You