CS458 Natural language Processing
Lecture 5 Naive Bayes & Text Classification
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Is this spam?
Who wrote which Federalist Papers?
1787-8: essays anonymously written by:
Alexander Hamilton, James Madison, and John Jay
to convince New York to ratify U.S Constitution
Authorship of 12 of the letters unclear between:
1963: solved by Mosteller and Wallace using Bayesian methods
James Madison
Alexander Hamilton
Positive or negative movie review?
unbelievably disappointing
Full of zany characters and richly applied satire, and some great plot twists
this is the greatest screwball comedy ever filmed
It was pathetic. The worst part about it was the boxing scenes.
4
What is the subject of this article?
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
5
MeSH Subject Category Hierarchy
?
MEDLINE Article
Text Classification
Assigning subject categories, topics, or genres
Spam detection
Authorship identification (who wrote this?)
Language Identification (is this Portuguese?)
Sentiment analysis
…
Text Classification: definition
Input:
Output: a predicted class c ∈ C
Basic Classification Method: �Hand-coded rules
Rules based on combinations of words or other features
Accuracy can be high
But:
Classification Method:�Supervised Machine Learning
Input:
Output:
9
Classification Methods:�Supervised Machine Learning
Many kinds of classifiers!
We can also use pretrained large language models!
The Naive Bayes Classifier
Naive Bayes Intuition
Simple ("naive") classification method based on Bayes rule
Relies on very simple representation of document
The Bag of Words Representation
13
The bag of words representation
γ(
)=c
seen | 2 |
sweet | 1 |
whimsical | 1 |
recommend | 1 |
happy | 1 |
... | ... |
Bayes’ Rule Applied to Documents and Classes
Naive Bayes Classifier (I)
MAP is “maximum a posteriori” = most likely class
Bayes Rule
Dropping the denominator
Naive Bayes Classifier (II)
Document d represented as features x1..xn
"Likelihood"
"Prior"
Naïve Bayes Classifier (IV)
How often does this class occur?
O(|X|n•|C|) parameters
We can just count the relative frequencies in a corpus
Could only be estimated if a very, very large number of training examples was available.
Multinomial Naive Bayes Independence Assumptions
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.
Multinomial Naive Bayes Classifier
Applying Multinomial Naive Bayes Classifiers to Text Classification
positions ← all word positions in test document
Problems with multiplying lots of probs
There's a problem with this:
Multiplying lots of probabilities can result in floating-point underflow!
.0006 * .0007 * .0009 * .01 * .5 * .000008….
Idea: Use logs, because log(ab) = log(a) + log(b)
We'll sum logs of probabilities instead of multiplying probabilities!
We actually do everything in log space
Instead of this:
This:
Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Naive Bayes: Learning
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
Sec.13.3
Parameter estimation
Create mega-document for topic j by concatenating all docs in this topic
fraction of times word wi appears
among all words in documents of topic cj
Problem with Maximum Likelihood
Sec.13.3
Laplace (add-1) smoothing for Naïve Bayes
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
docsj ← all docs with class =cj
nk ← # of occurrences of wk in Textj
Unknown words
What about unknown words
We ignore them
Why don't we build an unknown word model?
Stop words
Some systems ignore stop words
But removing stop words doesn't usually help
Thank You