1 of 55

CSE 312 Spam Filter Intro

The Naive Bayes Classifier

2 of 55

Agenda

  • What is Machine Learning?
  • Featurizing Emails
  • Naive Bayes

3 of 55

Machine Learning in the Real World

4 of 55

ML Pipeline

Data

ML Model

Intelligence

From Wikipedia: “Machine learning is the study of computer algorithms that improve automatically through experience.”

5 of 55

You are a machine!

Number

Shape

“Label”

3

12

5

15

-2

-8

7

21

-4

???

Given examples with correct “labels”, make predictions!

6 of 55

You are a machine!

Number

Shape

“Label”

3

12

5

15

-2

-8

7

21

-4

-16

Given examples with correct “labels”, make predictions!

7 of 55

Regression: Idea

$ 340,135 $801,353 ??????

8 of 55

Classification: Idea

“Green” class “Red” class

9 of 55

Classification: Idea

Is this new shape supposed to be “green” or “red”?

“Green” class “Red” class

10 of 55

Spam Filter

  • In real life, you may have seen a lot of spam emails like this.
  • Building a good spam filter helps protect users from potential scams, unnecessary advertising, or malware links.

11 of 55

Evaluating Performance

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

Training Set

Email

Label

You buy Crypto!

Spam

You need Crypto sir.

Spam

I hope you are financially healthy.

Ham

...

...

...

...

Test Set

We “train” our spam filter on the training set, and evaluate performance using a test set (data that is unseen by the spam filter initially). This gives an unbiased estimate of performance.

12 of 55

Spam Filter Task

Email

Label

Buy crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

Predict whether this

email is spam or ham:

You buy Crypto!

Training Set

13 of 55

Emails as word collections

Email

Set of Words in the Email

SUBJECT: Top Secret Business Venture

Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret…

{top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and}

For simplicity, we will

  • Ignore Duplicate Words
  • Ignore Punctuation
  • Ignore Casing

14 of 55

Emails as word collections

Email

Set of Words in the Email

SUBJECT: Top Secret Business Venture

Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret…

{top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and}

Hello hello hello there.

{hello, there}

For simplicity, we will

  • Ignore Duplicate Words
  • Ignore Punctuation
  • Ignore Casing

15 of 55

Emails as word collections

Email

Set of Words in the Email

SUBJECT: Top Secret Business Venture

Dear Sir. First, I must solicit your confidence in this transaction, this is by virtue of its nature as being utterly confidential and top secret…

{top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virtue, of, its, nature, as, being, utterly, confidencial, and}

Hello hello hello there.

{hello, there}

You buy Crypto!

{you, buy, crypto}

For simplicity, we will

  • Ignore Duplicate Words
  • Ignore Punctuation
  • Ignore Casing

16 of 55

Problem with this model

Consider for example the following two emails:

“!!!Lunch free for You!!!!!”

“You free for lunch?”

One shortfalling of our model is that it will make the same prediction for these since they have the same set of words!

Spam

Ham

17 of 55

Our approach

  •  

18 of 55

Our approach

  •  

19 of 55

Naive Bayes Classifier - The bayes part

  •  

20 of 55

Naive Bayes Classifier - What we Calculate

  •  

21 of 55

Naive Bayes Classifier - What we Calculate

  •  

22 of 55

Naive Bayes Classifier - What we Calculate

  •  

23 of 55

Naive Bayes Classifier - The naive part

  •  

24 of 55

Naive Bayes Classifier - The naive part

It is somewhat unlikely that we have the email “You buy Crypto!” in our training data. (In this case we don’t!)

25 of 55

Naive Bayes Classifier - The naive part

  •  

26 of 55

Conditional Independence

  • In normal independence: we say two events A and B are independent of each other

  • In conditional independence, we say two events A and B are independent of each other given another event C

  • What does this mean?
    • If we know whether an event C occurs or not, our knowledge about A tells us nothing about B (and vice versa)

27 of 55

Naive Bayes Classifier - The naive part

  •  

28 of 55

Naive Bayes Classifier - The naive part

  •  

29 of 55

Why is this Naive?

Words are not independent of each other!

30 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

31 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

32 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

33 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

34 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

= 1

35 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

= 1 (Marked as spam since no ham email contained “buy”)

36 of 55

What happens if we got a 0?

P(ham | “You buy Crypto!”) = 0 since P(“buy”| ham) = 0, since no ham email in our training data contained the word ‘buy’.

But does that mean we will never encounter a ham email with word ‘buy’?

37 of 55

Laplace smoothing

Pretend in spam emails (training set):

  • We saw one extra spam email with word
  • We saw one extra spam email without word

38 of 55

Laplace smoothing

Pretend in spam emails (training set):

  • We saw one extra spam email with word
  • We saw one extra spam email without word

39 of 55

Laplace smoothing

Pretend in spam emails (training set):

  • We saw one extra spam email with word
  • We saw one extra spam email without word

Same for ham emails:

40 of 55

Laplace smoothing

Pretend in spam emails (training set):

  • We saw one extra spam email with word
  • We saw one extra spam email without word

Same for ham emails:

41 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

42 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

43 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

44 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

45 of 55

Example

  •  

Email

Label

Buy Crypto!

Spam

You good?

Ham

Crypto help you.

Spam

Good Crypto help.

Spam

My Crypto wallet.

Ham

 

 

 

46 of 55

Underflow Prevention

  • Multiplication of many probabilities, each of which will be between 0 and 1, can result in floating-point underflow. The product will be too small and will result in arithmetic underflow.

47 of 55

Underflow Prevention

  • Multiplication of many probabilities, each of which will be between 0 and 1, can result in floating-point underflow. The product will be too small and will result in arithmetic underflow.
  • Reminder: Log property:

48 of 55

Underflow Prevention

  • Multiplication of many probabilities, each of which will be between 0 and 1, can result in floating-point underflow. The product will be too small and will result in arithmetic underflow.
  • Reminder: Log property:

  • Summing logs of probabilities is better than multiplying probabilities

49 of 55

Applying underflow prevention

We will output spam iff:

50 of 55

Applying underflow prevention

We will output spam iff:

Denominators are equal and cancel when comparing

51 of 55

Applying underflow prevention

We will output spam iff:

52 of 55

Applying underflow prevention

We will output spam iff:

Taking the log of two sides:

53 of 55

Summary: Naive Bayes Algorithm steps

1.2. Iterate over the training set, for each unique word x, count:

  • How many spam emails in the training set contain x
  • How many ham emails in the training set contain x

1. TRAINING

1.1. Compute the proportion of emails in the training set that is spam or ham:

54 of 55

Summary: Naive Bayes Algorithm steps

Predict email D as spam

Otherwise, predict email D as ham

2. TESTING

Iterate over the test set, for each unlabelled email D:

  • Create a set S of n unique words appearing in D:
  • For each word in set S, calculate:

    • Note: If word doesn’t appear in the training set, we still calculate the above probabilities, with:
  • if

55 of 55

Questions?

Comments?

Concerns?