1 of 26

Smoothing and Classifiers

COSC 426A: Natural Language Processing (Fall 2024)

Prof. Forrest Davis

2 of 26

Warm-up 🔥

3 of 26

Replication results using the pipeline due Sunday

Logistics

4 of 26

Understand the motivation behind more complex smoothing techniques
Describe how generating from a ngram model works
Understand the task of text classification
Apply Bayes’ rule to classification

Learning Objectives

5 of 26

Smoothing Techniques

Additive

Remove zero counts through extensive hallucinations

Order interpolation

Use higher N low-order features with lower N high-order features

Katz Back-off

Only use reliable probabilities

Absolute Discounting

Steal mass from higher order observations for lower order predictions

6 of 26

Katz back-off

7 of 26

Katz back-off (Katz, 1987)

Idea:

If we have reliable estimates of an ngram, great!

If not use a shorter context

8 of 26

Katz back-off (Katz, 1987)

Pros

Incorporates confidence to yield reliable probabilities

Cons

Dramatic shifts in distribution with new data
Insensitive to evidence by omission

P(eats | ants usually) ≟ P(eats | usually)

9 of 26

Absolute Discounting

10 of 26

zooming out to the common core

11 of 26

Discounting - Stealing Probability

12 of 26

Absolute Discounting: Magic Number

Church and Gale (1991)

Bigram count in training data	Bigram count in held-out data
0	0.000270
1	0.448
2	1.25
3	2.24
4	3.23
5	4.21
6	5.23
7	6.21
8	7.21
9	8.26

0.75!

13 of 26

Absolute Discounting

Idea: Steal probability from the rich (those you have counts for) and give to the poor (those who have no counts)

14 of 26

Generation

15 of 26

Generation

How do you think we would generate sentences given a ngram language model?

Starting from BOS (<s>), grab a next word proportion to its assigned probability
Make a next context, <s> WORD, what is likely after WORD?
Repeat until we generate EOS
Profit

16 of 26

Naive Bayes

Classification

17 of 26

Applications?? 👷‍♀️

Yes

Text Classification

18 of 26

Text Classification

Take text and assign it a label from a set of discrete classes

Positive? Negative? Neutral?
Spam? Not Spam?
Loan? No loan?
Good resume? Bad resume?
Grammatical? Ungrammatical?

How do we do this?

19 of 26

Text Classification

Aim:

P(class | text) - “the class is that which is most likely

for the text”

We need some way of estimating this. Thoughts?

20 of 26

Bayes’ rule

21 of 26

Bayes’ rule for text classification

Prior

Evidence

Likelihood

22 of 26

Bayes’ rule for text classification

We can throw out the evidence term – why?

23 of 26

Bayes’ rule for text classification

24 of 26

Compute the prior probability

How could you compute the probability of a class (i.e. P(c))?

25 of 26

Compute the likelihood

How could you compute the probability that a particular text belong to some class (i.e. P(text | class))?

Naive Bayes

“Bag of words assumption”: each word is independent (i.e. position does not matter)

26 of 26

Read
Work on midterm results

Before next class