1 of 55

CSE 312 Spam Filter Intro

The Naive Bayes Classifier

2 of 55

Agenda

What is Machine Learning?
Featurizing Emails
Naive Bayes

3 of 55

Machine Learning in the Real World

4 of 55

ML Pipeline

Data

ML Model

Intelligence

From Wikipedia: “Machine learning is the study of computer algorithms that improve automatically through experience.”

5 of 55

You are a machine!

Number	Shape	“Label”
3		12
5		15
-2		-8
7		21
-4		???

Given examples with correct “labels”, make predictions!

6 of 55

You are a machine!

Number	Shape	“Label”
3		12
5		15
-2		-8
7		21
-4		-16

Given examples with correct “labels”, make predictions!

7 of 55

Regression: Idea

$ 340,135 $801,353 ??????

8 of 55

Classification: Idea

“Green” class “Red” class

9 of 55

Classification: Idea

Is this new shape supposed to be “green” or “red”?

“Green” class “Red” class

10 of 55

Spam Filter

In real life, you may have seen a lot of spam emails like this.

Building a good spam filter helps protect users from potential scams, unnecessary advertising, or malware links.

11 of 55

Evaluating Performance

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

Training Set

Email	Label
You buy Crypto!	Spam
You need Crypto sir.	Spam
I hope you are financially healthy.	Ham
...	...
...	...

Test Set

We “train” our spam filter on the training set, and evaluate performance using a test set (data that is unseen by the spam filter initially). This gives an unbiased estimate of performance.

12 of 55

Spam Filter Task

Email	Label
Buy crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

Predict whether this

email is spam or ham:

You buy Crypto!

Training Set

13 of 55

Emails as word collections

Email	Set of Words in the Email
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret…	{top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and}

For simplicity, we will

Ignore Duplicate Words
Ignore Punctuation
Ignore Casing

14 of 55

Emails as word collections

Email	Set of Words in the Email
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret…	{top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and}
Hello hello hello there.	{hello, there}

For simplicity, we will

Ignore Duplicate Words
Ignore Punctuation
Ignore Casing

15 of 55

Emails as word collections

Email	Set of Words in the Email
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret…	{top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and}
Hello hello hello there.	{hello, there}
You buy Crypto!	{you, buy, crypto}

For simplicity, we will

Ignore Duplicate Words
Ignore Punctuation
Ignore Casing

16 of 55

Problem with this model

Consider for example the following two emails:

“!!!Lunch free for You!!!!!”

“You free for lunch?”

One shortfalling of our model is that it will make the same prediction for these since they have the same set of words!

Spam

Ham

17 of 55

Our approach

18 of 55

Our approach

19 of 55

Naive Bayes Classifier - The bayes part

20 of 55

Naive Bayes Classifier - What we Calculate

21 of 55

Naive Bayes Classifier - What we Calculate

22 of 55

Naive Bayes Classifier - What we Calculate

23 of 55

Naive Bayes Classifier - The naive part

24 of 55

Naive Bayes Classifier - The naive part

It is somewhat unlikely that we have the email “You buy Crypto!” in our training data. (In this case we don’t!)

25 of 55

Naive Bayes Classifier - The naive part

26 of 55

Conditional Independence

In normal independence: we say two events A and B are independent of each other

In conditional independence, we say two events A and B are independent of each other given another event C

What does this mean?

If we know whether an event C occurs or not, our knowledge about A tells us nothing about B (and vice versa)

27 of 55

Naive Bayes Classifier - The naive part

28 of 55

Naive Bayes Classifier - The naive part

29 of 55

Why is this Naive?

Words are not independent of each other!

30 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

31 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

32 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

33 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

34 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

= 1

35 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

= 1 (Marked as spam since no ham email contained “buy”)

36 of 55

What happens if we got a 0?

P(ham | “You buy Crypto!”) = 0 since P(“buy”| ham) = 0, since no ham email in our training data contained the word ‘buy’.

But does that mean we will never encounter a ham email with word ‘buy’?

37 of 55

Laplace smoothing

Pretend in spam emails (training set):

We saw one extra spam email with word
We saw one extra spam email without word

38 of 55

Laplace smoothing

Pretend in spam emails (training set):

We saw one extra spam email with word
We saw one extra spam email without word

39 of 55

Laplace smoothing

Pretend in spam emails (training set):

We saw one extra spam email with word
We saw one extra spam email without word

Same for ham emails:

40 of 55

Laplace smoothing

Pretend in spam emails (training set):

We saw one extra spam email with word
We saw one extra spam email without word

Same for ham emails:

41 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

42 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

43 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

44 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

45 of 55

Example

Email	Label
Buy Crypto!	Spam
You good?	Ham
Crypto help you.	Spam
Good Crypto help.	Spam
My Crypto wallet.	Ham

46 of 55

Underflow Prevention

Multiplication of many probabilities, each of which will be between 0 and 1, can result in floating-point underflow. The product will be too small and will result in arithmetic underflow.

47 of 55

Underflow Prevention

Multiplication of many probabilities, each of which will be between 0 and 1, can result in floating-point underflow. The product will be too small and will result in arithmetic underflow.
Reminder: Log property:

48 of 55

Underflow Prevention

Multiplication of many probabilities, each of which will be between 0 and 1, can result in floating-point underflow. The product will be too small and will result in arithmetic underflow.
Reminder: Log property:

Summing logs of probabilities is better than multiplying probabilities

49 of 55

Applying underflow prevention

We will output spam iff:

50 of 55

Applying underflow prevention

We will output spam iff:

Denominators are equal and cancel when comparing

51 of 55

Applying underflow prevention

We will output spam iff:

52 of 55

Applying underflow prevention

We will output spam iff:

Taking the log of two sides:

53 of 55

Summary: Naive Bayes Algorithm steps

1.2. Iterate over the training set, for each unique word x, count:

How many spam emails in the training set contain x
How many ham emails in the training set contain x

1. TRAINING

1.1. Compute the proportion of emails in the training set that is spam or ham:

54 of 55

Summary: Naive Bayes Algorithm steps

Predict email D as spam

Otherwise, predict email D as ham

2. TESTING

Iterate over the test set, for each unlabelled email D:

Create a set S of n unique words appearing in D:
For each word in set S, calculate:

Note: If word doesn’t appear in the training set, we still calculate the above probabilities, with:

if

55 of 55

Questions?

Comments?

Concerns?