MIS-USE OF THE CRAIGSLIST PLATFORM ��TEXT ANALYSIS
GOUTHAM KUMAR VEMASANI
HARTHIK
MIRIYALA
SRUJANA
Kalyadapu
PRATHYUSHA REDDY MIDUDHULA
HARISH DATTA
CHITNENI
AGENDA
PROBLEM STATEMENT
01
02
03
05
06
DATASET DESCRIPTION
ANALYSIS DESIGN
RECOMMENDATIONS
MODELING
TEXT REPRESENTATION
04
LIMITATIONS & FUTURE SCOPE
07
DISCLAIMER
Why is this important?
Hub for Solicitation Messages | Against the Law
Reduced Quality of Content | Decreased Demand
Search Trend: “Craigslist” | Decreased Demand
Against the Company Policy
Flagging
Doesn’t Craigslist already have filter Mechanism? | Flagging
But is it working?
Escaping the flagging
Hookup | |
'Hook_Up' | 'H~ookup' |
'Hook~up' | 'Hook_Up' |
'Hook~up' | 'HOOkUpp' |
'Hooookup' | 'HOOkUpp' |
Girl | |
'girLl' | 'G~irl' |
'g~irl' | 'Gurl' |
'girl.c' | 'G~irl' |
'girLL' | 'g~irl' |
'girLL' | 'g~irl' |
Looking | |
'L00KiNg' | 'L~ooking' |
'L~ooking' | 'L~ooking' |
'L~ooking' | 'L_ooking' |
'L~ooking' | 'looking__' |
'L~ooking' | 'L~ooking' |
Manual Classification
Comment | Class |
Looking To Hook_Up...!! | 1 |
Let’s mEet tonight | 1 |
Anybody wanna HukuPPP | 1 |
Un~HAppy F need male | 1 |
L00KiNg s0me crAzy fuN | 1 |
Need A new friend to hang out with Rogers | 0 |
young lead guitarist at dt bar North Las Vegas | 0 |
Short haired woman at the laundromat this morning S. University | 0 |
Looking for someone to hang out with for weekend Missoula | 0 |
Scraped data using BeautifulSoup package in python.
SCRAPPED
Dataset
Class | Number of Data Points | Percentage |
0 | 3458 | 74% |
1 | 1194 | 26% |
Total | 4652 | 1 |
Analysis Design
Corpus
Tokenization
Lemmatization
Stop words
GloVe
BERT
Logistic Regression
Gradient Boosting
Decision Tree Classifier
Random Forest Classifier
KNN Classifier
SVM
Neural Network
TF-IDF
Text Representation
Word Embedding
Doc Embedding
Model Performance: AUC
Logistic Regression
Gradient
Boosting
Decision Tree
Classifier
Random Forest
KNN
Classifier
Support Vector Machine
TF-IDF
GloVe
BERT
0.72
0.75
0.89
0.67
0.76
0.88
0.73
0.67
0.72
0.76
0.72
0.87
0.67
0.68
0.84
0.78
0.71
0.89
0.75
0.74
0.90
Text Embedding
Neural Network
Model Performance: AUC
TF-IDF vs BERT
True
Positive
True
Positive
False
Positive
True
Negative
“young lady prepared to play”
“To the nice lady who slipped me a 20 at the Goodwill in Crestview Crestview”
TF-IDF
BERT
0.75 ↓
0.90 ↑
AUC
BERT is better than TF-IDF in understanding the context and semantic relationships resulting in better classification
Why BERT is performing well?
BERT: Bidirectional Encoder Representations from Transformers | Introduced by Google in 2018
Modified Analysis Design
Corpus
Tokenization
Lemmatization
Stop words
GloVe
BERT
Logistic Regression
Gradient Boosting
Decision Tree Classifier
Random Forest Classifier
KNN Classifier
SVM
Neural Network
TF-IDF
Text Representation
Word Embedding
Doc Embedding
Autocorrect
from autocorrect import Speller
Autocorrect
Auto-correction | |
'AAduLt' | 'adult' |
'b~edroom' | 'bedroom' |
'b0red' | 'bored' |
'D~irty' | 'dirty' |
'maale' | 'male' |
'f~emale' | 'female' |
'Y_oung' | 'young' |
'Hook_Up$' | 'hookup' |
Model Performance: AUC
Model | With Auto-correction | Without Auto-correction |
Neural Network (GloVe embedding) | 0.79 | 0.80 |
*Similar trend in other models as well
How can our model help Craigslist? | Recommendations
Avoid delays in flagging the posts by Craigslist for better user experience
Constantly update the training corpus, as the platform users find new ways to escape the flagging mechanism
Use more sophisticated models such as LSTM
How can our model help Craigslist? | Recommendations
Avoid delays in flagging the posts by Craigslist for better user experience
Constantly update the training corpus, as the platform users find new ways to escape the flagging mechanism
Use more sophisticated models such as LSTM
Avoid delays in flagging the posts by Craigslist for better user experience
Constantly update the training corpus, as the platform users find new ways to escape the flagging mechanism
Use more sophisticated models such as LSTM
Limitations & Future Scope
Auto-Correction Statistics:
Not all misspelled words (25%) undergo auto-correction.
Manual Classification Challenges:
Manual classification introduces the potential for human error.
Implementation of predefined classification models.
Scope Expansion:
Consider extending the scope to include Hate Speech and Derogatory Comments.
Enhances the system's ability to detect and address offensive language.
THANK YOU