Spam or Ham?
109502503 楊沛蓉
109502007 張原鳴
109502518 陳洛鈞
Motivation
Out of curiosity, we want to understand how Gmail classifies various emails as spam.
Problem
Goal
Research Outline
Dataset
Kaggle — Email Spam Dataset
(https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset?select=completeSpamAssassin.csv)
include 6045 valid data
Preprocess
Delete the useless data in original dataset
Separate
The ratio of training and testing data
7 : 3
Model
Model – Naïve Bayes
Model – Logistic Regression
Model – Decision Tree
Model – Support Vector Machine(SVM)
Analysis & Result
Method – 1
Method – 2
Method – 3
Method – 4
Naive Bayes
Classification report
ROC curve
Naive Bayes
Confusion Matrix
Top 50 key words occur in spam
Logistic Regression
Classification report
ROC curve
Logistic Regression
Confusion Matrix
Top 50 key words occur in spam
Decision Tree
Classification report
ROC curve
Decision Tree
Confusion Matrix
Top 50 key words occur in spam
Support Vector Machine(SVM)
Classification report
ROC curve
Support Vector Machine(SVM)
Confusion Matrix
Top 50 key words occur in spam
Result
Acuracy rate: Logistic Regression > SVM > Decision Tree > Naïve Bayes
Area under the ROC curve: SVM > Logistic Regression > Decision Tree > Naïve Bayes
Precision rate: SVM > Logistic Regression > Decision Tree > Naïve Bayes
Recall rate: Logistic Regression > Naïve Byes > Decision Tree > SVM
True positives(TP): Logistic Regression > Naïve Bayes > Decision Tree > SVM
True negatives(TN): SVM > Logistic Regression > Decision Tree > Naïve Bayes
Ranking considering the above comparison criterion:
Logistic Regression (21) > SVM (14) > Decision Tree (12)>Naïve Bayes (10)
The top 5 word in spam of each model
Naïve Bayes: free、email、business、money、get
Logistic Regression: free、email、business、money、click
Decision Tree: free、email、business、money、click
SVM: email、free、business、money、mail
The top common words:
free, money, email, business, click
Extension
We want to find what kind of vocabularies in spam emails will cluster together. Thus, we applied the elbow method to find the optimal k of the k-means algorithm. Then fit the data to k-means algorithm to get the clustering center. Through the clustering center, we are able to find the top 20 words that have the highest score. Then print the bar chart about k clusters, showing the most 15 representative words in each clusters.
Conclusion
In terms of comprehensive considerations, choosing Logistic Regression will have the best results. But because Logistic Regression doesn’t perform best in all evaluation standards, so if you only want to focus on a certain measurement standard, other classifiers may have better performance results.
Through this final project, we can learn what words are usually contained in the content of spam. In future, if we receive a letter that is not classified as spam, we can also judge by ourselves to avoid being scammed.