CSE 312 Spam Filter Intro
The Naive Bayes Classifier
Agenda
Machine Learning in the Real World
ML Pipeline
Data
ML Model
Intelligence
From Wikipedia: “Machine learning is the study of computer algorithms that improve automatically through experience.”
You are a machine!
Number | Shape | “Label” |
3 | | 12 |
5 | | 15 |
-2 | | -8 |
7 | | 21 |
-4 | | ??? |
Given examples with correct “labels”, make predictions!
You are a machine!
Number | Shape | “Label” |
3 | | 12 |
5 | | 15 |
-2 | | -8 |
7 | | 21 |
-4 | | -16 |
Given examples with correct “labels”, make predictions!
Regression: Idea
$ 340,135 $801,353 ??????
Classification: Idea
“Green” class “Red” class
Classification: Idea
Is this new shape supposed to be “green” or “red”?
“Green” class “Red” class
Spam Filter
Evaluating Performance
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Training Set
Label | |
You buy Crypto! | Spam |
You need Crypto sir. | Spam |
I hope you are financially healthy. | Ham |
... | ... |
... | ... |
Test Set
We “train” our spam filter on the training set, and evaluate performance using a test set (data that is unseen by the spam filter initially). This gives an unbiased estimate of performance.
Spam Filter Task
Label | |
Buy crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Predict whether this
email is spam or ham:
You buy Crypto!
Training Set
Emails as word collections
Set of Words in the Email | |
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret… | {top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and} |
For simplicity, we will
Emails as word collections
Set of Words in the Email | |
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret… | {top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and} |
Hello hello hello there. | {hello, there} |
For simplicity, we will
Emails as word collections
Set of Words in the Email | |
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret… | {top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and} |
Hello hello hello there. | {hello, there} |
You buy Crypto! | {you, buy, crypto} |
For simplicity, we will
Problem with this model
Consider for example the following two emails:
“!!!Lunch free for You!!!!!”
“You free for lunch?”
One shortfalling of our model is that it will make the same prediction for these since they have the same set of words!
Spam
Ham
Our approach
Our approach
Naive Bayes Classifier - The bayes part
Naive Bayes Classifier - What we Calculate
Naive Bayes Classifier - What we Calculate
Naive Bayes Classifier - What we Calculate
Naive Bayes Classifier - The naive part
Naive Bayes Classifier - The naive part
It is somewhat unlikely that we have the email “You buy Crypto!” in our training data. (In this case we don’t!)
Naive Bayes Classifier - The naive part
Conditional Independence
Naive Bayes Classifier - The naive part
Naive Bayes Classifier - The naive part
Why is this Naive?
Words are not independent of each other!
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
= 1
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
= 1 (Marked as spam since no ham email contained “buy”)
What happens if we got a 0?
P(ham | “You buy Crypto!”) = 0 since P(“buy”| ham) = 0, since no ham email in our training data contained the word ‘buy’.
But does that mean we will never encounter a ham email with word ‘buy’?
Laplace smoothing
Pretend in spam emails (training set):
Laplace smoothing
Pretend in spam emails (training set):
Laplace smoothing
Pretend in spam emails (training set):
Same for ham emails:
Laplace smoothing
Pretend in spam emails (training set):
Same for ham emails:
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Example
Label | |
Buy Crypto! | Spam |
You good? | Ham |
Crypto help you. | Spam |
Good Crypto help. | Spam |
My Crypto wallet. | Ham |
Underflow Prevention
Underflow Prevention
Underflow Prevention
Applying underflow prevention
We will output spam iff:
Applying underflow prevention
We will output spam iff:
Denominators are equal and cancel when comparing
Applying underflow prevention
We will output spam iff:
Applying underflow prevention
We will output spam iff:
Taking the log of two sides:
Summary: Naive Bayes Algorithm steps
1.2. Iterate over the training set, for each unique word x, count:
1. TRAINING
1.1. Compute the proportion of emails in the training set that is spam or ham:
Summary: Naive Bayes Algorithm steps
Predict email D as spam
Otherwise, predict email D as ham
2. TESTING
Iterate over the test set, for each unlabelled email D:
Questions?
Comments?
Concerns?