1 of 10

Analysis of Starbucks reviews

2 of 10

Exploratory data analysis

Key findings:

- imbalanced target variable

- significant number of missing value

- some useless features

🡪 Drop Nan and useless features

3 of 10

Text preprocessing

Amber and LaDonna at the Starbucks on Southwest Parkway are always so warm and welcoming. There is always a smile in their voice when they greet you at the drive-thru

amber ladonna starbuck southwest parkway alway warm welcom alway smile voic greet drivethru

To make text more machine-interpretable:

- character to lowercase

- remove punctuation

- removes line breaks '\n'

- removes stopwords

- changes the words to their stem

4 of 10

Vectorization – Word2Vec

🡪 shallow neural network model used to create word embeddings from a large corpus.

Parameters used:

- vector_size=25

- window=5

- min_count=3

- sg=0 (continuous-bag-of-word approach)

Resulting embedding has 1100 features 🡪 apply PCA for dimensionality reduction

5 of 10

Vectorization - Tfid

Short for Term Frequency Inverse Document Frequency.

Take two sentences:

d1: “my new model is better than your model”

d2: “my new model is good”

For word “model” we have:

d1: Tf = 2

d2: Tf= 1

To get tf-id scores compute:

d1: 2 * log(2/2)

d2: 1 * log(2/2)

The formula to compute it is:

Where:

- tf = term frequency

- N = # of documents

- df = document frequency

6 of 10

Sentiment Analysis

Map labels to positive vs negative reviews 🡪 binary classification

RandomForestClassifier(

class_weight: 'balanced’,

criterion: 'entropy’,

max_depth: 8,

max_features: 'sqrt’,

min_samples_leaf: 5)

Best model

Cross validation ‘balanced accuracy score’: 0.80

Classification report (on test)

7 of 10

Rating prediction

Multiclass classification

Regression

One vs Rest Classifier

Output Code Classifier

Neural Networks

8 of 10

Rating prediction – NN setup

Setup:

  1. calculate class weights to take class imbalance into account
  2. Set-up the model:

- two layers with ReLu activation

- two dropout layers

- softmax layer for prediction

3) Train!

9 of 10

Rating prediction – NN results

Setup:

  1. calculate class weights to take class imbalance into account
  2. Set-up the model:

- two layers with ReLu activation

- two dropout layers

- softmax layer for prediction

3) Train!

Classification report (on test)

10 of 10

Thank you for your attention �: )