1 of 32

CS458 Natural language Processing

Lecture 5 Naive Bayes & Text Classification

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 32

Is this spam?

3 of 32

Who wrote which Federalist Papers?

1787-8: essays anonymously written by:

Alexander Hamilton, James Madison, and John Jay

to convince New York to ratify U.S Constitution

Authorship of 12 of the letters unclear between:

1963: solved by Mosteller and Wallace using Bayesian methods

James Madison

Alexander Hamilton

4 of 32

Positive or negative movie review?

unbelievably disappointing

Full of zany characters and richly applied satire, and some great plot twists

this is the greatest screwball comedy ever filmed

It was pathetic. The worst part about it was the boxing scenes.

5 of 32

What is the subject of this article?

Antogonists and Inhibitors

Blood Supply

Chemistry

Drug Therapy

Embryology

Epidemiology

…

MeSH Subject Category Hierarchy

MEDLINE Article

6 of 32

Text Classification

Assigning subject categories, topics, or genres

Spam detection

Authorship identification (who wrote this?)

Language Identification (is this Portuguese?)

Sentiment analysis

…

7 of 32

Text Classification: definition

Input:

a document d
a fixed set of classes C = {c₁, c₂,…, c_J}

Output: a predicted class c ∈ C

8 of 32

Basic Classification Method: �Hand-coded rules

Rules based on combinations of words or other features

spam: black-list-address OR (“dollars” AND “have been selected”)

Accuracy can be high

In very specific domains
If rules are carefully refined by experts

But:

building and maintaining rules is expensive
they are too literal and specific: "high-precision, low-recall"

9 of 32

Classification Method:�Supervised Machine Learning

Input:

a document d
a fixed set of classes C = {c₁, c₂,…, c_J}
A training set of m hand-labeled documents (d₁,c₁),....,(d_m,c_m)

Output:

a learned classifier γ:d 🡪 c

10 of 32

Classification Methods:�Supervised Machine Learning

Many kinds of classifiers!

Naïve Bayes (this lecture)
Logistic regression
Neural networks
k-nearest neighbors
…

We can also use pretrained large language models!

Fine-tuned as classifiers
Prompted to give a classification

11 of 32

The Naive Bayes Classifier

12 of 32

Naive Bayes Intuition

Simple ("naive") classification method based on Bayes rule

Relies on very simple representation of document

Bag of words

13 of 32

The Bag of Words Representation

14 of 32

The bag of words representation

γ(

)=c

seen	2
sweet	1
whimsical	1
recommend	1
happy	1
...	...

15 of 32

Bayes’ Rule Applied to Documents and Classes

For a document d and a class c

16 of 32

Naive Bayes Classifier (I)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

17 of 32

Naive Bayes Classifier (II)

Document d represented as features x1..xn

"Likelihood"

"Prior"

18 of 32

Naïve Bayes Classifier (IV)

How often does this class occur?

O(|X|ⁿ•|C|) parameters

We can just count the relative frequencies in a corpus

Could only be estimated if a very, very large number of training examples was available.

19 of 32

Multinomial Naive Bayes Independence Assumptions

Bag of Words assumption: Assume position doesn’t matter

Conditional Independence: Assume the feature probabilities P(x_i|c_j) are independent given the class c.

20 of 32

Multinomial Naive Bayes Classifier

21 of 32

Applying Multinomial Naive Bayes Classifiers to Text Classification

positions ← all word positions in test document

22 of 32

Problems with multiplying lots of probs

There's a problem with this:

Multiplying lots of probabilities can result in floating-point underflow!

.0006 * .0007 * .0009 * .01 * .5 * .000008….

Idea: Use logs, because log(ab) = log(a) + log(b)

We'll sum logs of probabilities instead of multiplying probabilities!

23 of 32

We actually do everything in log space

Instead of this:

This:

Notes:

1) Taking log doesn't change the ranking of classes!

The class with highest probability also has highest log probability!

2) It's a linear model:

Just a max of a sum of weights: a linear function of the inputs

So naive bayes is a linear classifier

24 of 32

Naive Bayes: Learning

25 of 32

Learning the Multinomial Naive Bayes Model

First attempt: maximum likelihood estimates

simply use the frequencies in the data

Sec.13.3

26 of 32

Parameter estimation

Create mega-document for topic j by concatenating all docs in this topic

Use frequency of w in mega-document

fraction of times word w_i appears

among all words in documents of topic c_j

27 of 32

Problem with Maximum Likelihood

What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?

Zero probabilities cannot be conditioned away, no matter the other evidence!

Sec.13.3

28 of 32

Laplace (add-1) smoothing for Naïve Bayes

29 of 32

Multinomial Naïve Bayes: Learning

Calculate P(c_j) terms

For each c_jin C do

docs_j ← all docs with class =c_j

Calculate P(w_k | c_j) terms

Text_j ← single doc containing all docs_j
Foreach word w_k in Vocabulary

n_k ← # of occurrences of w_k in Text_j

From training corpus, extract Vocabulary

30 of 32

Unknown words

What about unknown words

that appear in our test data
but not in our training data or vocabulary?

We ignore them

Remove them from the test document!
Pretend they weren't there!
Don't include any probability for them at all!

Why don't we build an unknown word model?

It doesn't help: knowing which class has more unknown words is not generally helpful!

31 of 32

Stop words

Some systems ignore stop words

Stop words: very frequent words like the and a.

Sort the vocabulary by word frequency in training set
Call the top 10 or 50 words the stopword list.
Remove all stop words from both training and test sets

As if they were never there!

But removing stop words doesn't usually help

So in practice most NB algorithms use all words and don't use stopword lists

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32