1 of 72

CS458 Natural language Processing

Lecture 4 Intro to N-gram Language Models

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 72

Predicting words

The water of Walden Pond is beautifully ...

*refrigerator

*that

blue

green

clear

3 of 72

Language Models

Systems that can predict upcoming words

Can assign a probability to each potential next word
Can assign a probability to a whole sentence

4 of 72

Why word prediction?

It's a helpful part of language tasks

Grammar or spell checking

Their are two midterms Their There are two midterms

Everything has improve Everything has improve improved

Speech recognition

I will be back soonish I will be bassoon dish

5 of 72

Why word prediction?

It's how large language models (LLMs) work!

LLMs are trained to predict words

Left-to-right (autoregressive) LMs learn to predict next word

LLMs generate text by predicting words

By predicting the next word over and over again

6 of 72

Language Modeling (LM) more formally

Goal: compute the probability of a sentence or sequence of words W:

P(W) = P(w₁,w₂,w₃,w₄,w₅…w_n)

Related task: probability of an upcoming word:

P(w₅|w₁,w₂,w₃,w₄) or P(w_n|w₁,w₂…w_n-1)

An LM computes either of these:

P(W) or P(w_n|w₁,w₂…w_n-1)

7 of 72

How to estimate these probabilities

Could we just count and divide?

No! Too many possible sentences!
We’ll never see enough data for estimating these

=

8 of 72

How to compute P(W) or P(w_n|w₁, …w_n-1)

How to compute the joint probability P(W):

P(The, water, of, Walden, Pond, is, so, beautifully, blue)

Intuition: let’s rely on the Chain Rule of Probability

9 of 72

Reminder: The Chain Rule

Recall the definition of conditional probabilities

P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)

More variables:

P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)

The Chain Rule in General

P(x₁,x₂,x₃,…,x_n) = P(x₁)P(x₂|x₁)P(x₃|x₁,x₂)…P(x_n|x₁,…,x_n-1)

10 of 72

The Chain Rule applied to compute joint probability of words in sentence

P(“The water of Walden Pond”) =

P(The) × P(water|The) × P(of|The water)

× P(Walden|The water of) × P(Pond|The water of Walden)

11 of 72

Markov Assumption

Simplifying assumption:

Andrei Markov

≈

Wikimedia commons

12 of 72

Bigram Markov Assumption

Instead of:

More generally, we approximate each component in the product

13 of 72

Simplest case: Unigram model

To him swallowed confess hear both . Which . Of save on trail for are ay device and rote life have

�Hill he late speaks ; or ! a more to leg less first you enter

Months the my and issue of year foreign new exchange’s September

were recession exchange new endorsed a acquire to six executives

Some automatically generated sentences from two different unigram models

14 of 72

Bigram model

Some automatically generated sentences rom two different unigram models

Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.

�What means, sir. I confess she? then all sorts, he is trim, captain.

Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living

on information such as more frequently fishing to keep her

15 of 72

Problems with N-gram models

N-grams can't handle long-distance dependencies:

“The soups that I made from that new cookbook I bought yesterday were amazingly delicious."

N-grams don't do well at modeling new sequences with similar meanings

The solution: Large language models

can handle much longer contexts
because of using embedding spaces, can model synonymy better, and generate better novel strings

16 of 72

Why N-gram models?

A nice clear paradigm that lets us introduce many of the important issues for large language models

training and test sets
the perplexity metric
sampling to generate sentences
ideas like interpolation and backoff

17 of 72

Estimating N-gram Probabilities

18 of 72

Estimating bigram probabilities

The Maximum Likelihood Estimate

19 of 72

An example

<s> I do not like green eggs and ham </s>

20 of 72

More examples: �Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day

21 of 72

Raw bigram counts

Out of 9222 sentences

22 of 72

Raw bigram probabilities

Normalize by unigrams:

Result:

23 of 72

Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =

P(I|<s>)

× P(want|I)

× P(english|want)

× P(food|english)

× P(</s>|food)

= .000031

24 of 72

What kinds of knowledge do N-grams represent?

P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25

25 of 72

Dealing with scale in large n-grams

LM probabilities are stored and computed in log format, i.e. log probabilities
This avoids underflow from multiplying many small numbers

If we need probabilities we can do one exp at the end

26 of 72

Larger n-grams

4-grams, 5-grams
Large datasets of large n-grams have been released

N-grams from Corpus of Contemporary American English (COCA) 1 billion words (Davies 2020)
Google Web 5-grams (Franz and Brants 2006) 1 trillion words)
Efficiency: quantize probabilities to 4-8 bits instead of 8-byte float

Newest model: infini-grams (∞-grams) (Liu et al 2024)

No precomputing! Instead, store 5 trillion words of web text in suffix arrays. Can compute n-gram probabilities with any n!

27 of 72

N-gram LM Toolkits

SRILM

http://www.speech.sri.com/projects/srilm/

KenLM

https://kheafield.com/code/kenlm/

28 of 72

Evaluation and Perplexity

29 of 72

How to evaluate N-gram models

"Extrinsic (in-vivo) Evaluation"

To compare models A and B

Put each model in a real task

Machine Translation, speech recognition, etc.

Run the task, get a score for A and for B

How many words translated correctly
How many words transcribed correctly

Compare accuracy for A and B

30 of 72

Intrinsic (in-vitro) evaluation

Extrinsic evaluation not always possible

Expensive, time-consuming
Doesn't always generalize to other applications

Intrinsic evaluation: perplexity

Directly measures language model performance at predicting words.
Doesn't necessarily correspond with real application performance
But gives us a single general metric for language models
Useful for large language models (LLMs) as well as n-grams

Extrinsic evaluation isn't always possible. It's often time-consuming and expensive to run real tasks, involving human scoring, but that's a problem because we often want to evaluate thousands of versions of our systems. Also, we might want to see how well our n-gram language model works more generally, and not just in a particular application.��For this reason, we often use an intrinsic, or in-vitro evaluation metric called perplexity. We call it intrinsic or in-vitro because it's not embedded in a real application in the world, so it's sort of a test-tube evaluation. Perplexity directly measuring language model performance at predicting words. This may not always correspond with real application performance, but it gives us a single general metric. We use perplexity for simple n-gram language models but also for the most advanced neural Large Language Models.

31 of 72

Training sets and test sets

We train parameters of our model on a training set.

We test the model’s performance on data we haven’t seen.

A test set is an unseen dataset; different from training set.

Intuition: we want to measure generalization to unseen data

An evaluation metric (like perplexity) tells us how well our model does on the test set.

32 of 72

Choosing training and test sets

If we're building an LM for a specific task

The test set should reflect the task language we want to use the model for

If we're building a general-purpose model

We'll need lots of different kinds of training data
We don't want the training set or the test set to be just from one domain or author or language.

33 of 72

Training on the test set

We can’t allow test sentences into the training set

Or else the LM will assign that sentence an artificially high probability when we see it in the test set
And hence assign the whole test set a falsely high probability.
Making the LM look better than it really is

This is called “Training on the test set”

Bad science!

33

34 of 72

Dev sets

If we test on the test set many times we might implicitly tune to its characteristics

Noticing which changes make the model better.

So we run on the test set only once, or a few times
That means we need a third dataset:

A development test set or, devset.
We test our LM on the devset until the very end
And then test our LM on the test set once

35 of 72

Intuition of perplexity as evaluation metric: How good is our language model?

Intuition: A good LM prefers "real" sentences

Assign higher probability to “real” or “frequently observed” sentences
Assigns lower probability to “word salad” or “rarely observed” sentences?

36 of 72

Intuition of perplexity 2: �Predicting upcoming words

The Shannon Game: How well can we predict the next word?

Once upon a ____
That is a picture of a ____
For breakfast I ate my usual ____

Unigrams are terrible at this game (Why?)

time 0.9

dream 0.03

midnight 0.02

…

and 1e-100

Picture credit: Historiska bildsamlingen

https://creativecommons.org/licenses/by/2.0/

Claude Shannon

A good LM is one that assigns a higher probability to the next word that actually occurs

We can think of this predicting task first just for individual words. The idea of asking an LM, or a person, to look at the start of a sentence, and predict the next upcoming word is called the Shannon Game, because it was invented by information theorist Claude Shannon in 1951 as a way of measuring the entropy of English. So if we see the prefix "Once upon a", we might guess "time" as a very likely next word, and maybe we would assign dream, or midnight as possible but less likely, while words like "and" should be extremely unlikely, because "Once upon a and" is an unlikely sentence. People are pretty good at this, as are language models, the longer the better. Unigrams are terrible at this time. Think about why this is? intuition of perplexity is that a good LM is one that is good at the Shannon game: it assigns a higher probability to the next word that actually occurs

37 of 72

Intuition of perplexity 3: The best language model is one that best predicts the entire unseen test set

We said: a good LM is one that assigns a higher probability to the next word that actually occurs.
Let's generalize to all the words!

The best LM assigns high probability to the entire test set.

When comparing two LMs, A and B

We compute P_A(test set) and P_B(test set)
The better LM will give a higher probability to (=be less surprised by) the test set than the other LM.

38 of 72

Intuition of perplexity 4: Use perplexity instead of raw probability

Probability depends on size of test set

Probability gets smaller the longer the text
Better: a metric that is per-word, normalized by length

Perplexity is the inverse probability of the test set, normalized by the number of words

39 of 72

Intuition of perplexity 5: the inverse

Perplexity is the inverse probability of the test set, normalized by the number of words

(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)

Probability range is [0,1], perplexity range is [1,∞]

Minimizing perplexity is the same as maximizing probability

40 of 72

Intuition of perplexity 6: N-grams

Bigrams:

Chain rule:

41 of 72

Intuition of perplexity 7: �Weighted average branching factor

Perplexity is also the weighted average branching factor of a language.

Branching factor: number of possible next words that can follow any word

Example: Deterministic language L = {red,blue,green}

Branching factor = 3 (any word can be followed by red, blue, green)

Now assume LM A where each word follows any other word with equal probability ⅓

Given a test set T = "red red red red blue"

Perplexity_A(T) = P_A(red red red red blue)^-1/5 = ((⅓)⁵)^-1/5 = (⅓)^-1 = 3

42 of 72

Intuition of perplexity 7: �Weighted average branching factor

But now suppose red was very likely in training set, such that for LM B:

P(red) = .8 p(green) = .1 p(blue) = .1

We would expect the probability to be higher, and hence the perplexity to be smaller:

Perplexity_B(T) = P_B(red red red red blue)^-1/5

= (.8 * .8 * .8 * .8 * .1)^-1/5

=.04096^-1/5

= .527^-1

= 1.89

43 of 72

Holding test set constant:�Lower perplexity = better language model

Training 38 million words, test 1.5 million words, WSJ

N-gram Order	Unigram	Bigram	Trigram
Perplexity	962	170	109

44 of 72

Sampling and Generalization

45 of 72

The Shannon (1948) Visualization Method�Sample words from an LM

Unigram:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

Bigram:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Claude Shannon

46 of 72

How Shannon sampled those words in 1948

"Open a book at random and select a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc."

47 of 72

Sampling a word from a distribution

48 of 72

Visualizing Bigrams the Shannon Way

Choose a random bigram (<s>, w)
according to its probability p(w|<s>)
Now choose a random bigram (w, x) according to its probability p(x|w)
And so on until we choose </s>
Then string the words together

<s> I

I want

want to

to eat

eat Chinese

Chinese food

food </s>

I want to eat Chinese food

49 of 72

Note: there are other sampling methods

Used for neural language models

Many of them avoid generating words from the very unlikely tail of the distribution

We'll discuss when we get to neural LM decoding:

Temperature sampling
Top-k sampling
Top-p sampling

50 of 72

Approximating Shakespeare

51 of 72

Shakespeare as corpus

N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types out of V²= 844 million possible bigrams.

So 99.96% of the possible bigrams were never seen (have zero entries in the table)
That sparsity is even worse for 4-grams, explaining why our sampling generated actual Shakespeare.

52 of 72

The Wall Street Journal is not Shakespeare

53 of 72

Can you guess the author? These 3-gram sentences are sampled from an LM trained on who?

1) They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions

2) This shall forbid it should be branded, if renown made it empty.

3) “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for.

53

54 of 72

Choosing training data

If task-specific, use a training corpus that has a similar genre to your task.

If legal or medical, need lots of special-purpose documents

Make sure to cover different kinds of dialects and speaker/authors.

Example: African-American Vernacular English (AAVE)

One of many varieties that can be used by African Americans and others
Can include the auxiliary verb finna that marks immediate future tense: "My phone finna die"

55 of 72

The perils of overfitting

N-grams only work well for word prediction if the test corpus looks like the training corpus

But even when we try to pick a good training corpus, the test set will surprise us!
We need to train robust models that generalize!

One kind of generalization: Zeros

Things that don’t ever occur in the training set

But occur in the test set

56 of 72

Zeros

Training set:

… ate lunch

… ate dinner

… ate a

… ate the

P(“breakfast” | ate) = 0

Test set

… ate lunch

… ate breakfast

57 of 72

Zero probability bigrams

Bigrams with zero probability

Will hurt our performance for texts where those words appear!
And mean that we will assign 0 probability to the test set!

And hence we cannot compute perplexity (can’t divide by 0)!

58 of 72

Smoothing, Interpolation, and Backoff

59 of 72

The intuition of smoothing (from Dan Klein)

When we have sparse statistics:

Steal probability mass to generalize better

P(w | denied the)

3 allegations

2 reports

1 claims

1 request

7 total

P(w | denied the)

2.5 allegations

1.5 reports

0.5 claims

0.5 request

2 other

7 total

allegations

reports

claims

attack

request

man

outcome

…

allegations

attack

man

outcome

…

allegations

reports

claims

request

60 of 72

Add-one estimation

Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!

MLE estimate:

Add-1 estimate:

61 of 72

Maximum Likelihood Estimates

The maximum likelihood estimate

of some parameter of a model M from a training set T
maximizes the likelihood of the training set T given the model M

Suppose the word “bagel” occurs 400 times in a corpus of a million words
What is the probability that a random word from some other text will be “bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus

But it is the estimate that makes it most likely that “bagel” will occur 400 times in a million word corpus.

62 of 72

Berkeley Restaurant Corpus: Laplace smoothed bigram counts

63 of 72

Laplace-smoothed bigrams

64 of 72

Reconstituted counts

65 of 72

Compare with raw bigram counts

66 of 72

Add-1 estimation is a blunt instrument

So add-1 isn’t used for N-grams:

Generally we use interpolation or backoff instead

But add-1 is used to smooth other NLP models

For text classification
In domains where the number of zeros isn’t so huge.

67 of 72

Backoff and Interpolation

Sometimes it helps to use less context

Condition on less context for contexts you know less about

Backoff:

use trigram if you have good evidence,
otherwise bigram, otherwise unigram

Interpolation:

mix unigram, bigram, trigram

Interpolation works better

68 of 72

Linear Interpolation

Simple interpolation

Lambdas conditional on context:

69 of 72

How to set λs for interpolation?

Use a held-out corpus

Choose λs to maximize probability of held-out data:

Fix the N-gram probabilities (on the training data)
Then search for λs that give largest probability to held-out set

Training Data

Held-Out Data

Test

Data

70 of 72

Backoff

Suppose you want:

P(pancakes| delicious soufflé)

If the trigram probability is 0, use the bigram
P(pancakes| soufflé)
If the bigram probability is 0, use the unigram
P(pancakes)

Complication: need to discount the higher-order ngram so probabilities don't sum higher than 1 (e.g., Katz backoff)

71 of 72

Stupid Backoff

Backoff without discounting (not a true probability)

71

72 of 72

Thank You