1 of 30

Spam or Ham?

109502503 楊沛蓉

109502007 張原鳴

109502518 陳洛鈞

2 of 30

Motivation

Out of curiosity, we want to understand how Gmail classifies various emails as spam.

3 of 30

Problem

  • What type of mail is classified as spam?
  • What words often appear in spam the most?

4 of 30

Goal

  • Distinguish those contents of emails belonging to spam, and find the best type of classifier by the final accuracy rate.
  • Moreover, find the top 50 English words that appear frequently from the results of classification by each classifier.

5 of 30

Research Outline

  • Find a suitable dataset
  • Preprocess the dataset
  • Separate the dataset into training and testing data(7:3)
  • Place those data into different models
  • Execute text analysis on those predictions
  • Show the results

6 of 30

Dataset

Kaggle — Email Spam Dataset

(https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset?select=completeSpamAssassin.csv)

include 6045 valid data

7 of 30

Preprocess

Delete the useless data in original dataset

8 of 30

Separate

The ratio of training and testing data

7 : 3

9 of 30

Model

  1. Naïve Bayes
  2. Logistic Regression
  3. Decision Tree
  4. Support Vector Machine(SVM)

10 of 30

Model – Naïve Bayes

11 of 30

Model – Logistic Regression

12 of 30

Model – Decision Tree

13 of 30

Model – Support Vector Machine(SVM)

14 of 30

Analysis & Result

  • Do the text analysis on the results of classifiers by using CountVectorizer, calculating the number of every word appearing in all the spam predicted
  • We look over the final results by using classification report, ROC curve, confusion matrix in scikit-learn

15 of 30

Method – 1

  • Delete all the link words and special symbols
  • Transform all the English letters into lower-case
  • Transform all the English words into their original words
    • ex: organizes, organizing, organized ⇒ organize
  • Filter out all the stop words that have no real meaning
    • ex: a, the, on

16 of 30

Method – 2

  • Separate the dataset into training and testing data
    • The ratio of training and testing data = 7 : 3
  • Put the training and testing data into different models
  • Get the data index categorized as spam by the prediction of testing data

17 of 30

Method – 3

  • Set a counter that aims to calculate the top 2000 English words that appear frequently
  • Calculate the number of every word appearing in all the predicted spam by using this counter and the indexes of spam
  • Vectorize contents of spam and transfer them into the structure of Pandas in Python
  • Delete those numbers in front of each line of data

18 of 30

Method – 4

  • Print the results based on testing data run in different models
    • classification results
    • ROC curve
    • confusion matrix
    • top 50 key words that appear the most in spam predicted

19 of 30

Naive Bayes

Classification report

ROC curve

20 of 30

Naive Bayes

Confusion Matrix

Top 50 key words occur in spam

21 of 30

Logistic Regression

Classification report

ROC curve

22 of 30

Logistic Regression

Confusion Matrix

Top 50 key words occur in spam

23 of 30

Decision Tree

Classification report

ROC curve

24 of 30

Decision Tree

Confusion Matrix

Top 50 key words occur in spam

25 of 30

Support Vector Machine(SVM)

Classification report

ROC curve

26 of 30

Support Vector Machine(SVM)

Confusion Matrix

Top 50 key words occur in spam

27 of 30

Result

Acuracy rate: Logistic Regression > SVM > Decision Tree > Naïve Bayes

Area under the ROC curve: SVM > Logistic Regression > Decision Tree > Naïve Bayes

Precision rate: SVM > Logistic Regression > Decision Tree > Naïve Bayes

Recall rate: Logistic Regression > Naïve Byes > Decision Tree > SVM

True positives(TP): Logistic Regression > Naïve Bayes > Decision Tree > SVM

True negatives(TN): SVM > Logistic Regression > Decision Tree > Naïve Bayes

Ranking considering the above comparison criterion:

Logistic Regression (21) > SVM (14) > Decision Tree (12)>Naïve Bayes (10)

28 of 30

The top 5 word in spam of each model

Naïve Bayes: free、email、business、money、get

Logistic Regression: free、email、business、money、click

Decision Tree: free、email、business、money、click

SVM: email、free、business、money、mail

The top common words:

free, money, email, business, click

29 of 30

Extension

We want to find what kind of vocabularies in spam emails will cluster together. Thus, we applied the elbow method to find the optimal k of the k-means algorithm. Then fit the data to k-means algorithm to get the clustering center. Through the clustering center, we are able to find the top 20 words that have the highest score. Then print the bar chart about k clusters, showing the most 15 representative words in each clusters.

30 of 30

Conclusion

In terms of comprehensive considerations, choosing Logistic Regression will have the best results. But because Logistic Regression doesn’t perform best in all evaluation standards, so if you only want to focus on a certain measurement standard, other classifiers may have better performance results.

Through this final project, we can learn what words are usually contained in the content of spam. In future, if we receive a letter that is not classified as spam, we can also judge by ourselves to avoid being scammed.