1 of 30

Final Presentation

Challenge Team

2 of 30

Salt baes

3 of 30

Objectives

Elite Users

Yelp has Elite users in its system, where people can nominate themselves/other users for Elite status
Yelp then reviews these nominations through their “Elite Council”, with criteria unknown to us

We would like to use the Yelp dataset to analyze Elite user trends
Applications

Provide Yelp a system that can automatically mark users as Elite and take away Elite status from users who have been slacking
Help someone who wants to become an elite user

4 of 30

User Groups

Elite

Elite this year, and will be elite next year

Non-Elite

Non-elite this year, and will be non-elite next year

Slackers

Elite this year, and will be non-elite next year

Potentials

Non-elite this year, and will be elite next year

	2009	2010	2011	2012	2013	2014	2015	2016
Elite	5392	8296	10523	14889	15042	16293	20827	22334
Non-Elite At Least 1yr Elite	38190	34532	29513	27471	25171	20963	16676	20564
Slack	1118	2100	2619	2683	4508	3732	4182	6512
Potential	5004	4776	7119	4661	4983	8716	8019	294
Non-Elite Never Elite	979728	979728	979278	979278	979278	979278	979278	979278

5 of 30

Elite, Non-elite detection

Metadata Analysis
Text Analysis
Time Series Analysis

6 of 30

Metadata Analysis

How is the number of tips affiliated with a Yelp user’s elite status?

We looked for distinguishing characteristics in the tip-giving patterns among different user groups.

Elites and Non-Elites have defined traits, but Potentials and Slackers are difficult to define based solely on tip behavior.

7 of 30

8 of 30

9 of 30

Statistical Analysis

		2009	2010	2011	2012	2013	2014	2015	2016
Elite-Potential	T-statistic p-value	0.7873 0.4343	1.7259 0.0846	3.1529 0.0016	1.9074 0.0565	3.2430 0.0011	4.6098 4.1e-06	5.1559 2.6e-07	1.4068 0.2952
Elite-Slacker	T-statistic p-value	1.1692 0.2465	1.5195 0.1289	0.8958 0.3704	1.6833 0.0923	3.8020 0.0001	2.9549 0.0031	1.9329 0.0533	3.6995 0.0002
Elite-Non-Elite	T-statistic p-value	1.4716 0.1455	3.9304 8.9e-05	4.8191 1.5e-06	6.751 1.6e-11	7.1942 7.3e-13	7.0875 1.5e-12	4.3850 1.2e-05	5.2215 1.9e-07
Potential-Slacker	T-statistic p-value	1.2447 0.2311	0.2287 0.8191	-1.626 0.1040	-0.1304 0.8962	0.4338 0.6644	-0.1909 0.8485	-1.6784 0.0949	0.4047 0.6857
Potential-Non-Elite	T-statistic p-value	0.7015 0.4883	1.6456 0.1003	1.7564 0.0792	3.5339 0.0004	3.0934 0.0020	2.6759 0.0075	0.8409 0.4005	2.1613 0.0310
Slacker-Non-Elite	T-statistic p-value	-0.616 0.5443	1.1635 0.2451	2.9218 0.0035	4.4498 9.1e-06	2.2711 0.0232	1.9021 0.0573	1.7178 0.0861	2.6972 0.0070

Significant at the 10% level

Significant at the 5% level

Significant at the 1% level

10 of 30

Text Analysis

Reviews reduced to 3 dimensions using T-SNE

Vectorized reviews into both bag of words and TF-IDF (term frequency x inverse document frequency) format
Looked into clustering of different types of users based on their vectorized reviews
Unable to find distinct clusters from the different classes for both representations

11 of 30

Text Analysis

Non-Elite

Elite

Slacker

Potentials

12 of 30

Time Series Analysis

Studied change in each user’s behavior over time
Extracted both positive and negative anomalies
Clustered pattern of user behavior
Use this metric as a classification feature

13 of 30

Regression Models

SVC:

R^2 0.6695, MAE 0.0869, MSE 0.00809

Kernel Ridge:

R^2 0.6935 MAE 0.02598, MSE 0.00405

Decision Tree (depth = 3):

R^2 0.8711, MAE 0.02120, MSE 0.00143

Each model is tuned by the grid search and 3-fold cross validation

14 of 30

Anomaly Detection

Decision Tree model

Detect the negative anomalies

Kernel Ridge

Extract anomalies without over-fitting.

15 of 30

Kernel Density Estimation

Used local-minimum of the density graph as a “turning point”

16 of 30

Dynamic Time Warping Clustering

Dynamically find the best alignment that minimizes overall euclidean distance.

Spliced the data into 24-month period (2 years) and calculated DTW distance to cluster user pattern

17 of 30

Questions?

18 of 30

Subteam: Boss Llama

19 of 30

1st Topic: Survival Analysis

Predict if a business is open on a certain date
Features

Average star rating
Days between reviews
Last activity
Linear regression

Achieved ~94% accuracy

Unable to increase past 94%
Unable to validate

20 of 30

2nd Topic: Competition analysis

Key Question: Can we see competition in reviews?
Performance Metric: Time taken to reach threshold count
Goal:
Construct a model of competition

Model the macroscopic trend of review intensity
Apply natural language processing to categorize businesses
Identify city centers or “clusters”

Fit regression models to find the competition relationships

21 of 30

Features Considered

8 cities, 9857 Restaurants
Review count and intensity
Geographical information
Survival information

Is_open
First Review Date

Restaurants categories

Topics in the Reviews
Business Attributes

22 of 30

Fit a Poisson Process

Good model of activity or “arrivals”

Rate function has effect of “smoothing” the volatility

How to fit a rate function
Identify inflection points in plateauing process
Use regression methods to fit a function
Find expected number of days to reach threshold
Rate function has linear, quadratic, seasonal (weekly, bi-annual, annual) components

23 of 30

24 of 30

Geographical “clusters”

25 of 30

Identifying Clusters

26 of 30

Topic Modeling

Trained LDA and NMF models on reviews
Analyzed top words in each topic
Generated probability distributions among all the topics
Compared topics generated from topic modeling with the categories generated on Yelp

27 of 30

Topic Top Words

Other Characteristics of Restaurants

Types of Restaurants

28 of 30

Identifying User Clusters

Steps:

Review Vectors�- word2vec → Reviews
User Vectors
Dimension Reduction
Fit Gaussian Mixture Models�→ Classify users

29 of 30

Result

Latitude and Longitude has very high importance
Certain categories are much more influential in making threshold
More interaction terms need to be added

30 of 30

The End

Thank you!