1 of 26

Smoothing and Classifiers

COSC 426A: Natural Language Processing (Fall 2024)

Prof. Forrest Davis

2 of 26

Warm-up 🔥

3 of 26

  • Replication results using the pipeline due Sunday

Logistics

4 of 26

  • Understand the motivation behind more complex smoothing techniques
  • Describe how generating from a ngram model works
  • Understand the task of text classification
  • Apply Bayes’ rule to classification

Learning Objectives

5 of 26

Smoothing Techniques

  • Additive
    • Remove zero counts through extensive hallucinations
  • Order interpolation
    • Use higher N low-order features with lower N high-order features
  • Katz Back-off
    • Only use reliable probabilities
  • Absolute Discounting
    • Steal mass from higher order observations for lower order predictions

6 of 26

Katz back-off

7 of 26

Katz back-off (Katz, 1987)

Idea:

If we have reliable estimates of an ngram, great!

If not use a shorter context

8 of 26

Katz back-off (Katz, 1987)

Pros

  • Incorporates confidence to yield reliable probabilities

Cons

  • Dramatic shifts in distribution with new data
  • Insensitive to evidence by omission

P(eats | ants usually) ≟ P(eats | usually)

9 of 26

Absolute Discounting

10 of 26

zooming out to the common core

11 of 26

Discounting - Stealing Probability

12 of 26

Absolute Discounting: Magic Number

Church and Gale (1991)

Bigram count in training data

Bigram count in held-out data

0

0.000270

1

0.448

2

1.25

3

2.24

4

3.23

5

4.21

6

5.23

7

6.21

8

7.21

9

8.26

0.75!

13 of 26

Absolute Discounting

  • Idea: Steal probability from the rich (those you have counts for) and give to the poor (those who have no counts)

14 of 26

Generation

15 of 26

Generation

  • How do you think we would generate sentences given a ngram language model?
    • Starting from BOS (<s>), grab a next word proportion to its assigned probability
    • Make a next context, <s> WORD, what is likely after WORD?
    • Repeat until we generate EOS
    • Profit

16 of 26

Naive Bayes

Classification

17 of 26

Applications?? 👷‍♀️

  • Yes

Text Classification

18 of 26

Text Classification

  • Take text and assign it a label from a set of discrete classes
    • Positive? Negative? Neutral?
    • Spam? Not Spam?
    • Loan? No loan?
    • Good resume? Bad resume?
    • Grammatical? Ungrammatical?
  • How do we do this?

19 of 26

Text Classification

  • Aim:

P(class | text) - “the class is that which is most likely

for the text”

  • We need some way of estimating this. Thoughts?

20 of 26

Bayes’ rule

21 of 26

Bayes’ rule for text classification

Prior

Evidence

Likelihood

22 of 26

Bayes’ rule for text classification

  • We can throw out the evidence term – why?

23 of 26

Bayes’ rule for text classification

24 of 26

Compute the prior probability

How could you compute the probability of a class (i.e. P(c))?

25 of 26

Compute the likelihood

How could you compute the probability that a particular text belong to some class (i.e. P(text | class))?

Naive Bayes

“Bag of words assumption”: each word is independent (i.e. position does not matter)

26 of 26

  • Read
  • Work on midterm results

Before next class