1 of 26

MIS-USE OF THE CRAIGSLIST PLATFORM ��TEXT ANALYSIS

2 of 26

GOUTHAM KUMAR VEMASANI

HARTHIK

MIRIYALA

SRUJANA

Kalyadapu

PRATHYUSHA REDDY MIDUDHULA

HARISH DATTA

CHITNENI

3 of 26

AGENDA

PROBLEM STATEMENT

01

02

03

05

06

DATASET DESCRIPTION

ANALYSIS DESIGN

RECOMMENDATIONS

MODELING

TEXT REPRESENTATION

04

LIMITATIONS & FUTURE SCOPE

07

4 of 26

DISCLAIMER

  • We regret any explicit content that may be presented in this slideshow
  • The nature of the topic necessitates the inclusion of such information
  • We apologize in advance

5 of 26

Why is this important?

6 of 26

Hub for Solicitation Messages | Against the Law

7 of 26

Reduced Quality of Content | Decreased Demand

8 of 26

Search Trend: “Craigslist” | Decreased Demand

9 of 26

Against the Company Policy

10 of 26

Flagging

11 of 26

Doesn’t Craigslist already have filter Mechanism? | Flagging

But is it working?

12 of 26

Escaping the flagging

Hookup

'Hook_Up'

'H~ookup'

'Hook~up'

'Hook_Up'

'Hook~up'

'HOOkUpp'

'Hooookup'

'HOOkUpp'

Girl

'girLl'

'G~irl'

'g~irl'

'Gurl'

'girl.c'

'G~irl'

'girLL'

'g~irl'

'girLL'

'g~irl'

Looking

'L00KiNg'

'L~ooking'

'L~ooking'

'L~ooking'

'L~ooking'

'L_ooking'

'L~ooking'

'looking__'

'L~ooking'

'L~ooking'

13 of 26

Manual Classification

Comment

Class

Looking To Hook_Up...!!

1

Let’s mEet tonight

1

Anybody wanna HukuPPP

1

Un~HAppy F need male

1

L00KiNg s0me crAzy fuN

1

Need A new friend to hang out with Rogers

0

young lead guitarist at dt bar North Las Vegas

0

Short haired woman at the laundromat this morning S. University

0

Looking for someone to hang out with for weekend Missoula

0

Scraped data using BeautifulSoup package in python.

SCRAPPED

14 of 26

Dataset

Class

Number of Data Points

Percentage

0

3458

74%

1

1194

26%

Total

4652

1

15 of 26

Analysis Design

Corpus

Tokenization

Lemmatization

Stop words

GloVe

BERT

Logistic Regression

Gradient Boosting

Decision Tree Classifier

Random Forest Classifier

KNN Classifier

SVM

Neural Network

TF-IDF

Text Representation

Word Embedding

Doc Embedding

16 of 26

Model Performance: AUC

Logistic Regression

Gradient

Boosting

Decision Tree

Classifier

Random Forest

KNN

Classifier

Support Vector Machine

TF-IDF

GloVe

BERT

0.72

0.75

0.89

0.67

0.76

0.88

0.73

0.67

0.72

0.76

0.72

0.87

0.67

0.68

0.84

0.78

0.71

0.89

0.75

0.74

0.90

Text Embedding

Neural Network

17 of 26

Model Performance: AUC

18 of 26

TF-IDF vs BERT

True

Positive

True

Positive

False

Positive

True

Negative

“young lady prepared to play”

“To the nice lady who slipped me a 20 at the Goodwill in Crestview Crestview”

TF-IDF

BERT

0.75

0.90

AUC

BERT is better than TF-IDF in understanding the context and semantic relationships resulting in better classification

19 of 26

Why BERT is performing well?

  1. Contextual Understanding:
    1. GloVe (Global Vectors for Word Representation): Static word embeddings without context awareness.
    2. BERT: Contextual embeddings consider the entire sentence, capturing nuanced meanings and improving comprehension.
  2. Handling Polysemy:
    • GloVe: Struggles with words having multiple meanings (polysemy).
    • BERT: Discerns word sense based on surrounding context, addressing polysemy effectively.
  3. Dynamic Embeddings:
    • GloVe: Fixed embeddings irrespective of context.
    • BERT: Dynamically adjusts embeddings based on the context in which the word appears.

BERT: Bidirectional Encoder Representations from Transformers | Introduced by Google in 2018

20 of 26

Modified Analysis Design

Corpus

Tokenization

Lemmatization

Stop words

GloVe

BERT

Logistic Regression

Gradient Boosting

Decision Tree Classifier

Random Forest Classifier

KNN Classifier

SVM

Neural Network

TF-IDF

Text Representation

Word Embedding

Doc Embedding

Autocorrect

from autocorrect import Speller

21 of 26

Autocorrect

Auto-correction

'AAduLt'

'adult'

'b~edroom'

'bedroom'

'b0red'

'bored'

'D~irty'

'dirty'

'maale'

'male'

'f~emale'

'female'

'Y_oung'

'young'

'Hook_Up$'

'hookup'

Model Performance: AUC

Model

With

Auto-correction

Without

Auto-correction

Neural Network (GloVe embedding)

0.79

0.80

*Similar trend in other models as well

22 of 26

How can our model help Craigslist? | Recommendations

Avoid delays in flagging the posts by Craigslist for better user experience

Constantly update the training corpus, as the platform users find new ways to escape the flagging mechanism

Use more sophisticated models such as LSTM

23 of 26

How can our model help Craigslist? | Recommendations

Avoid delays in flagging the posts by Craigslist for better user experience

Constantly update the training corpus, as the platform users find new ways to escape the flagging mechanism

Use more sophisticated models such as LSTM

24 of 26

Avoid delays in flagging the posts by Craigslist for better user experience

Constantly update the training corpus, as the platform users find new ways to escape the flagging mechanism

Use more sophisticated models such as LSTM

25 of 26

Limitations & Future Scope

Auto-Correction Statistics:

Not all misspelled words (25%) undergo auto-correction.

Manual Classification Challenges:

Manual classification introduces the potential for human error.

Implementation of predefined classification models.

Scope Expansion:

Consider extending the scope to include Hate Speech and Derogatory Comments.

Enhances the system's ability to detect and address offensive language.

26 of 26

THANK YOU