1 of 25

How Likely Would You

Give A Five-Star Review on ?

Getting Your Hands Dirty with scikit-learn

Xun Tang

xun@yelp.com

2 of 25

Yelp’s Mission

Connecting people with great

local businesses.

3 of 25

Yelp Stats

As of Q1 2016

90M

32

70%

102M

4 of 25

5 of 25

Public Dataset

Given user's past reviews on Yelp

When the user writes a review for a business she hasn't reviewed before

How likely will it be a

review?

6 of 25

Demo

7 of 25

8 of 25

Visualize the data�Plot distribution of review star ratings

9 of 25

10 of 25

Plot review star ratings by year

11 of 25

Featurize the data

Convert date string to date delta

  • e.g. business_age

Convert strings to categorical features

  • e.g. noise level: {quiet, loud, very loud}.

Drop unused features

  • e.g. business_name

12 of 25

Join tables to populate the features

13 of 25

Identify data X and target y

Data X

  • All features we gathered from biz, user, review tables

Target y

  • What we predict: Whether the review is Five-star or not

14 of 25

Split training set and testing set

15 of 25

Model the data: Logistic regression

Logistic regression (LR)

  • Estimates the probability of a binary response
  • Here we estimate the probability of a review being five-star

16 of 25

LR: Standardize features

Standardize features by removing the mean and scaling to unit variance

17 of 25

LR: Build model & Cross validation

18 of 25

LR: Build model & Cross validation

19 of 25

LR: Evaluation via Confusion Matrix

False

Positive

0.1667

True

Negative

0.8333

False

Negative

0.3643

True

Positive

0.6357

20 of 25

Make prediction with the model

21 of 25

Make prediction with the model

22 of 25

3 Things...

Jupyter Notebook

Scikit-learn

Yelp public dataset

23 of 25

Questions?

24 of 25

Backup Slides

25 of 25

  • 77K businesses
  • 55K checkin-sets
  • 566K business attributes
  • 200k photos
  • 2.2M reviews
  • 552K users
  • 3.5M edge social-graph
  • 591K tips

Your academic project, research or visualizations, submitted by June 30, 2016

=

$5,000 prize + $1,000 for publication + $500 for presenting*

*See full terms on website

Academic dataset from 10 cities in 4 countries!