1 of 72

CS458 Natural language Processing

Lecture 4 Intro to N-gram Language Models

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 72

Predicting words

  • The water of Walden Pond is beautifully ...

*refrigerator

*that

blue

green

clear

3 of 72

Language Models

  • Systems that can predict upcoming words
    • Can assign a probability to each potential next word
    • Can assign a probability to a whole sentence

4 of 72

Why word prediction?

It's a helpful part of language tasks

  • Grammar or spell checking

Their are two midterms Their There are two midterms

Everything has improve Everything has improve improved

  • Speech recognition

I will be back soonish I will be bassoon dish

5 of 72

Why word prediction?

It's how large language models (LLMs) work!

LLMs are trained to predict words

  • Left-to-right (autoregressive) LMs learn to predict next word

LLMs generate text by predicting words

    • By predicting the next word over and over again

6 of 72

Language Modeling (LM) more formally

  • Goal: compute the probability of a sentence or sequence of words W:

P(W) = P(w1,w2,w3,w4,w5…wn)

  • Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)

  • An LM computes either of these:

P(W) or P(wn|w1,w2…wn-1)

7 of 72

How to estimate these probabilities

  • Could we just count and divide?

  • No! Too many possible sentences!
  • We’ll never see enough data for estimating these

=

8 of 72

How to compute P(W) or P(wn|w1, …wn-1)

  • How to compute the joint probability P(W):

P(The, water, of, Walden, Pond, is, so, beautifully, blue)

  • Intuition: let’s rely on the Chain Rule of Probability

9 of 72

Reminder: The Chain Rule

  • Recall the definition of conditional probabilities

P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)

  • More variables:

P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)

  • The Chain Rule in General

P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

10 of 72

The Chain Rule applied to compute joint probability of words in sentence

P(“The water of Walden Pond”) =

P(The) × P(water|The) × P(of|The water)

× P(Walden|The water of) × P(Pond|The water of Walden)

11 of 72

Markov Assumption

  • Simplifying assumption:

Andrei Markov

Wikimedia commons

12 of 72

Bigram Markov Assumption

  • Instead of:

More generally, we approximate each component in the product

13 of 72

Simplest case: Unigram model

To him swallowed confess hear both . Which . Of save on trail for are ay device and rote life have

�Hill he late speaks ; or ! a more to leg less first you enter

Months the my and issue of year foreign new exchange’s September

were recession exchange new endorsed a acquire to six executives

Some automatically generated sentences from two different unigram models

14 of 72

Bigram model

Some automatically generated sentences rom two different unigram models

Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.

�What means, sir. I confess she? then all sorts, he is trim, captain.

Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living

on information such as more frequently fishing to keep her

15 of 72

Problems with N-gram models

  • N-grams can't handle long-distance dependencies:

The soups that I made from that new cookbook I bought yesterday were amazingly delicious."

  • N-grams don't do well at modeling new sequences with similar meanings

The solution: Large language models

  • can handle much longer contexts
  • because of using embedding spaces, can model synonymy better, and generate better novel strings

16 of 72

Why N-gram models?

A nice clear paradigm that lets us introduce many of the important issues for large language models

  • training and test sets
  • the perplexity metric
  • sampling to generate sentences
  • ideas like interpolation and backoff

17 of 72

Estimating N-gram Probabilities

18 of 72

Estimating bigram probabilities

  • The Maximum Likelihood Estimate

19 of 72

An example

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham </s>

20 of 72

More examples: �Berkeley Restaurant Project sentences

  • can you tell me about any good cantonese restaurants close by
  • tell me about chez panisse
  • i’m looking for a good place to eat breakfast
  • when is caffe venezia open during the day

21 of 72

Raw bigram counts

  • Out of 9222 sentences

22 of 72

Raw bigram probabilities

  • Normalize by unigrams:

  • Result:

23 of 72

Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =

P(I|<s>)

× P(want|I)

× P(english|want)

× P(food|english)

× P(</s>|food)

= .000031

24 of 72

What kinds of knowledge do N-grams represent?

  • P(english|want) = .0011
  • P(chinese|want) = .0065
  • P(to|want) = .66
  • P(eat | to) = .28
  • P(food | to) = 0
  • P(want | spend) = 0
  • P (i | <s>) = .25

25 of 72

Dealing with scale in large n-grams

  • LM probabilities are stored and computed in log format, i.e. log probabilities
  • This avoids underflow from multiplying many small numbers

If we need probabilities we can do one exp at the end

26 of 72

Larger n-grams

  • 4-grams, 5-grams
  • Large datasets of large n-grams have been released
    • N-grams from Corpus of Contemporary American English (COCA) 1 billion words (Davies 2020)
    • Google Web 5-grams (Franz and Brants 2006) 1 trillion words)
    • Efficiency: quantize probabilities to 4-8 bits instead of 8-byte float

Newest model: infini-grams (∞-grams) (Liu et al 2024)

    • No precomputing! Instead, store 5 trillion words of web text in suffix arrays. Can compute n-gram probabilities with any n!

27 of 72

N-gram LM Toolkits

  • SRILM
  • KenLM

28 of 72

Evaluation and Perplexity

29 of 72

How to evaluate N-gram models

  • "Extrinsic (in-vivo) Evaluation"

To compare models A and B

    • Put each model in a real task
      • Machine Translation, speech recognition, etc.
    • Run the task, get a score for A and for B
      • How many words translated correctly
      • How many words transcribed correctly
    • Compare accuracy for A and B

30 of 72

Intrinsic (in-vitro) evaluation

  • Extrinsic evaluation not always possible
    • Expensive, time-consuming
    • Doesn't always generalize to other applications
  • Intrinsic evaluation: perplexity
    • Directly measures language model performance at predicting words.
    • Doesn't necessarily correspond with real application performance
    • But gives us a single general metric for language models
    • Useful for large language models (LLMs) as well as n-grams

31 of 72

Training sets and test sets

We train parameters of our model on a training set.

We test the model’s performance on data we haven’t seen.

    • A test set is an unseen dataset; different from training set.
      • Intuition: we want to measure generalization to unseen data
    • An evaluation metric (like perplexity) tells us how well our model does on the test set.

32 of 72

Choosing training and test sets

  • If we're building an LM for a specific task
    • The test set should reflect the task language we want to use the model for
  • If we're building a general-purpose model
    • We'll need lots of different kinds of training data
    • We don't want the training set or the test set to be just from one domain or author or language.

33 of 72

Training on the test set

We can’t allow test sentences into the training set

    • Or else the LM will assign that sentence an artificially high probability when we see it in the test set
    • And hence assign the whole test set a falsely high probability.
    • Making the LM look better than it really is

This is called “Training on the test set”

Bad science!

33

34 of 72

Dev sets

  • If we test on the test set many times we might implicitly tune to its characteristics
    • Noticing which changes make the model better.
  • So we run on the test set only once, or a few times
  • That means we need a third dataset:
    • A development test set or, devset.
    • We test our LM on the devset until the very end
    • And then test our LM on the test set once

35 of 72

Intuition of perplexity as evaluation metric: How good is our language model?

Intuition: A good LM prefers "real" sentences

  • Assign higher probability to “real” or “frequently observed” sentences
  • Assigns lower probability to “word salad” or “rarely observed” sentences?

36 of 72

Intuition of perplexity 2: �Predicting upcoming words

The Shannon Game: How well can we predict the next word?

  • Once upon a ____
  • That is a picture of a ____
  • For breakfast I ate my usual ____

Unigrams are terrible at this game (Why?)

time 0.9

dream 0.03

midnight 0.02

and 1e-100

Picture credit: Historiska bildsamlingen

https://creativecommons.org/licenses/by/2.0/

Claude Shannon

A good LM is one that assigns a higher probability to the next word that actually occurs

37 of 72

Intuition of perplexity 3: The best language model is one that best predicts the entire unseen test set

  • We said: a good LM is one that assigns a higher probability to the next word that actually occurs.
  • Let's generalize to all the words!
    • The best LM assigns high probability to the entire test set.
  • When comparing two LMs, A and B
    • We compute PA(test set) and PB(test set)
    • The better LM will give a higher probability to (=be less surprised by) the test set than the other LM.

38 of 72

Intuition of perplexity 4: Use perplexity instead of raw probability

  • Probability depends on size of test set
    • Probability gets smaller the longer the text
    • Better: a metric that is per-word, normalized by length
  • Perplexity is the inverse probability of the test set, normalized by the number of words

39 of 72

Intuition of perplexity 5: the inverse

Perplexity is the inverse probability of the test set, normalized by the number of words

(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)

Probability range is [0,1], perplexity range is [1,∞]

Minimizing perplexity is the same as maximizing probability

40 of 72

Intuition of perplexity 6: N-grams

Bigrams:

Chain rule:

41 of 72

Intuition of perplexity 7: �Weighted average branching factor

Perplexity is also the weighted average branching factor of a language.

Branching factor: number of possible next words that can follow any word

Example: Deterministic language L = {red,blue,green}

Branching factor = 3 (any word can be followed by red, blue, green)

Now assume LM A where each word follows any other word with equal probability ⅓

Given a test set T = "red red red red blue"

PerplexityA(T) = PA(red red red red blue)-1/5 = ((⅓)5) -1/5 = (⅓)-1 = 3

42 of 72

Intuition of perplexity 7: �Weighted average branching factor

  • But now suppose red was very likely in training set, such that for LM B:
    • P(red) = .8 p(green) = .1 p(blue) = .1
  • We would expect the probability to be higher, and hence the perplexity to be smaller:

PerplexityB(T) = PB(red red red red blue)-1/5

= (.8 * .8 * .8 * .8 * .1) -1/5

=.04096 -1/5

= .527-1

= 1.89

43 of 72

Holding test set constant:�Lower perplexity = better language model

  • Training 38 million words, test 1.5 million words, WSJ

N-gram Order

Unigram

Bigram

Trigram

Perplexity

962

170

109

44 of 72

Sampling and Generalization

45 of 72

The Shannon (1948) Visualization Method�Sample words from an LM

  • Unigram:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

  • Bigram:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Claude Shannon

46 of 72

How Shannon sampled those words in 1948

"Open a book at random and select a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc."

47 of 72

Sampling a word from a distribution

48 of 72

Visualizing Bigrams the Shannon Way

  • Choose a random bigram (<s>, w)
  • according to its probability p(w|<s>)
  • Now choose a random bigram (w, x) according to its probability p(x|w)
  • And so on until we choose </s>
  • Then string the words together

<s> I

I want

want to

to eat

eat Chinese

Chinese food

food </s>

I want to eat Chinese food

49 of 72

Note: there are other sampling methods

Used for neural language models

Many of them avoid generating words from the very unlikely tail of the distribution

We'll discuss when we get to neural LM decoding:

    • Temperature sampling
    • Top-k sampling
    • Top-p sampling

50 of 72

Approximating Shakespeare

51 of 72

Shakespeare as corpus

N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams.

    • So 99.96% of the possible bigrams were never seen (have zero entries in the table)
    • That sparsity is even worse for 4-grams, explaining why our sampling generated actual Shakespeare.

52 of 72

The Wall Street Journal is not Shakespeare

53 of 72

Can you guess the author? These 3-gram sentences are sampled from an LM trained on who?

1) They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions

2) This shall forbid it should be branded, if renown made it empty.

3) “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for.

53

54 of 72

Choosing training data

If task-specific, use a training corpus that has a similar genre to your task.

    • If legal or medical, need lots of special-purpose documents

Make sure to cover different kinds of dialects and speaker/authors.

    • Example: African-American Vernacular English (AAVE)
      • One of many varieties that can be used by African Americans and others
      • Can include the auxiliary verb finna that marks immediate future tense: "My phone finna die"

55 of 72

The perils of overfitting

N-grams only work well for word prediction if the test corpus looks like the training corpus

    • But even when we try to pick a good training corpus, the test set will surprise us!
    • We need to train robust models that generalize!

One kind of generalization: Zeros

      • Things that don’t ever occur in the training set
        • But occur in the test set

56 of 72

Zeros

  • Training set:

… ate lunch

… ate dinner

… ate a

… ate the

P(“breakfast” | ate) = 0

  • Test set

… ate lunch

… ate breakfast

57 of 72

Zero probability bigrams

Bigrams with zero probability

    • Will hurt our performance for texts where those words appear!
    • And mean that we will assign 0 probability to the test set!

And hence we cannot compute perplexity (can’t divide by 0)!

58 of 72

Smoothing, Interpolation, and Backoff

59 of 72

The intuition of smoothing (from Dan Klein)

  • When we have sparse statistics:

  • Steal probability mass to generalize better

P(w | denied the)

3 allegations

2 reports

1 claims

1 request

7 total

P(w | denied the)

2.5 allegations

1.5 reports

0.5 claims

0.5 request

2 other

7 total

allegations

reports

claims

attack

request

man

outcome

allegations

attack

man

outcome

allegations

reports

claims

request

60 of 72

Add-one estimation

  • Also called Laplace smoothing
  • Pretend we saw each word one more time than we did
  • Just add one to all the counts!

  • MLE estimate:

  • Add-1 estimate:

61 of 72

Maximum Likelihood Estimates

  • The maximum likelihood estimate
    • of some parameter of a model M from a training set T
    • maximizes the likelihood of the training set T given the model M
  • Suppose the word “bagel” occurs 400 times in a corpus of a million words
  • What is the probability that a random word from some other text will be “bagel”?
  • MLE estimate is 400/1,000,000 = .0004
  • This may be a bad estimate for some other corpus
    • But it is the estimate that makes it most likely that “bagel” will occur 400 times in a million word corpus.

62 of 72

Berkeley Restaurant Corpus: Laplace smoothed bigram counts

63 of 72

Laplace-smoothed bigrams

64 of 72

Reconstituted counts

65 of 72

Compare with raw bigram counts

66 of 72

Add-1 estimation is a blunt instrument

  • So add-1 isn’t used for N-grams:
    • Generally we use interpolation or backoff instead
  • But add-1 is used to smooth other NLP models
    • For text classification
    • In domains where the number of zeros isn’t so huge.

67 of 72

Backoff and Interpolation

  • Sometimes it helps to use less context
    • Condition on less context for contexts you know less about
  • Backoff:
    • use trigram if you have good evidence,
    • otherwise bigram, otherwise unigram
  • Interpolation:
    • mix unigram, bigram, trigram

  • Interpolation works better

68 of 72

Linear Interpolation

  • Simple interpolation

  • Lambdas conditional on context:

69 of 72

How to set λs for interpolation?

  • Use a held-out corpus

  • Choose λs to maximize probability of held-out data:
    • Fix the N-gram probabilities (on the training data)
    • Then search for λs that give largest probability to held-out set

Training Data

Held-Out Data

Test

Data

70 of 72

Backoff

Suppose you want:

P(pancakes| delicious soufflé)

  • If the trigram probability is 0, use the bigram
  • P(pancakes| soufflé)
  • If the bigram probability is 0, use the unigram
  • P(pancakes)

Complication: need to discount the higher-order ngram so probabilities don't sum higher than 1 (e.g., Katz backoff)

71 of 72

Stupid Backoff

  • Backoff without discounting (not a true probability)

71

72 of 72

Thank You