How Likely Would You
Give A Five-Star Review on ?
Getting Your Hands Dirty with scikit-learn
Xun Tang
Yelp’s Mission
Connecting people with great
local businesses.
Yelp Stats
As of Q1 2016
90M
32
70%
102M
Public Dataset
Given user's past reviews on Yelp
When the user writes a review for a business she hasn't reviewed before
How likely will it be a
review?
Demo
Visualize the data�Plot distribution of review star ratings
Source:
Plot review star ratings by year
Featurize the data
Convert date string to date delta
Convert strings to categorical features
Drop unused features
Join tables to populate the features
Identify data X and target y
Data X
Target y
Split training set and testing set
Model the data: Logistic regression
Logistic regression (LR)
LR: Standardize features
Standardize features by removing the mean and scaling to unit variance
LR: Build model & Cross validation
LR: Build model & Cross validation
LR: Evaluation via Confusion Matrix
False
Positive
0.1667
True
Negative
0.8333
False
Negative
0.3643
True
Positive
0.6357
Make prediction with the model
Make prediction with the model
3 Things...
Jupyter Notebook
Scikit-learn
Yelp public dataset
Questions?
Repo: github.com/xun-tang/pyladies_jupyter_demo
linkedin.com/in/xuntang xun@yelp.com
Yelp Careers: yelp.com/careers/teams/engineering
Yelp Dataset Challenge: yelp.com/dataset_challenge/
Backup Slides
Your academic project, research or visualizations, submitted by June 30, 2016
=
$5,000 prize + $1,000 for publication + $500 for presenting*
*See full terms on website
Academic dataset from 10 cities in 4 countries!