1 of 29

ADHD, Aspergers, Depression, OCD, and PTSD Classification

Kefan Yu, Xinyu Li, Yixiang Cheng, Zezhen Liu

2 of 29

Background

Intro to Project

EDA & Objective

Cleaning Process

Model: Naive Bayes

Model: Bert

Demo

Conclusion

Model: Doc2Vec

3 of 29

Background

A lot of people seek for answers and solutions in Reddit trends under various categories.

Specifically, in the mental health trends, people might tend to find out the name of mental disorders including ADHD, OCD, etc. based on Reddit users’ posts.

ADHD?

OCD

4 of 29

Therefore, we decide to build a classification model based on such specific Reddit trend. When people are uncertain about whether they have such mental disease, they don’t need to waste their time viewing thousands posts from Reddit trends.

In this project, we are going to focus on ADHD, Aspergers, Depression, OCD, and PTSD.

5 of 29

Data Gathering

ADHD

OCD

Aspergers

Depression

PTSD

6 of 29

EDA

7 of 29

Wordcloud

8 of 29

ADHD

OCD

Aspergers

Depression

PTSD

9 of 29

Objective

We are going to make classification into the category of ADHD, Aspergers, Depression, OCD, and PTSD depending on the users’ descriptions.

10 of 29

Cleaning Process

The dataset we use is the reddit_mental_health_post from hugging face

  • Containing several CSV files of specific mental disease with roughly 30000 data points each

  1. Merging
  2. Reducing Dimensions
  3. Merge tile and body as one document
  4. Merge five files into one
  5. Drop missing value

11 of 29

Missing Value

12 of 29

Naive Bayes

13 of 29

Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

  1. Data pre-processing: removing punctuations, removing stopwords, tokenization, stemming, lemmatization
  2. BOW
  3. Use Multinomial NB to train the classification model.

14 of 29

Baseline Result

15 of 29

Doc2Vec

16 of 29

Doc2Vec + Classifier

  • Data pre-processing
  • Use Doc2Vec to obtain dense vector which is trained to predict words in the document.
  • Use logistic regression classifier to classify the text into ADHD, Aspergers, Depression, OCD, and PTSD categories.

17 of 29

Results

18 of 29

BERT fine tuning

19 of 29

Data pre-processing for bert

  • Bert-base-cased will gives 768 features to each token
  • Each contains all the informations in the sentences

Therefore, we don’t want the text to be too long.

For example:

<cls>Benefits for autistic people in the UK&What, if any, benefits are we entitled to?<sep>PTSD<Sep>

20 of 29

BERT tokenizer

  • Bert tokenizer select words features
  • Bert tokenizer use python double for loop structure

Thus, Bert tokenizer is very slow~

21 of 29

BERT Fine tuning

  • Data pre-processing
  • Use Doc2Vec to obtain dense vector which is trained to predict words in the document.
  • Use logistic regression classifier to classify the text into ADHD, Aspergers, Depression, OCD, and PTSD categories.

We copy the

Bert.encoder —>

Plus Bert.hidden

<-We give a nn.Linear(768,5)

<-bert.encoder(: , 0 , :)

22 of 29

BERT Fine tuning—V1

model_name: bert-base-cased

--train_file ftr.csv: -50%

--validation_file fva.csv: -25%

--test_file fte.csv: -25%

--do_train

--do_predict

--max_seq_length: 128

--train_batch_size: 32

--learning_rate: 2e-5

--num_train_epochs: 3(this is not a big data set)

23 of 29

BERT Fine tuning—V1

We got a accuracy of 75% Plus F1 score 73%

24 of 29

BERT Fine tuning—V1

model_name: bert-base-cased

--train_file ftr.csv: -50%

--validation_file fva.csv: -25%

--test_file fte.csv: -25%

--do_train

--do_predict

--max_seq_length: 256

--train_batch_size: 32

--learning_rate: 2e-5

--num_train_epochs: 3(With more budget we can make it to 20+)

<<Money is all you need>>

25 of 29

BERT Fine tuning—V2

We still got a accuracy of 75% Plus F1 score 74%

V1

V2

26 of 29

27 of 29

Conclusion & Future work

  1. We get a good accuracy in predicting OCD,PTSD, Depression, ADHD, Aspergers
  2. Use more of Computational Resources in future larger dataset
  3. Use more professional Labelled Dataset from more academic resources to improve accuracy.
  4. Use other Bert Family Model to train the Data

28 of 29

Any Questions?

29 of 29

Thank You!