1 of 30

Final Presentation

Challenge Team

2 of 30

Salt baes

3 of 30

Objectives

  • Elite Users
    • Yelp has Elite users in its system, where people can nominate themselves/other users for Elite status
    • Yelp then reviews these nominations through their “Elite Council”, with criteria unknown to us
  • We would like to use the Yelp dataset to analyze Elite user trends
  • Applications
    • Provide Yelp a system that can automatically mark users as Elite and take away Elite status from users who have been slacking
    • Help someone who wants to become an elite user

4 of 30

User Groups

  • Elite
    • Elite this year, and will be elite next year
  • Non-Elite
    • Non-elite this year, and will be non-elite next year
  • Slackers
    • Elite this year, and will be non-elite next year
  • Potentials
    • Non-elite this year, and will be elite next year

2009

2010

2011

2012

2013

2014

2015

2016

Elite

5392

8296

10523

14889

15042

16293

20827

22334

Non-Elite

At Least 1yr Elite

38190

34532

29513

27471

25171

20963

16676

20564

Slack

1118

2100

2619

2683

4508

3732

4182

6512

Potential

5004

4776

7119

4661

4983

8716

8019

294

Non-Elite

Never Elite

979728

979728

979278

979278

979278

979278

979278

979278

5 of 30

Elite, Non-elite detection

  • Metadata Analysis
  • Text Analysis
  • Time Series Analysis

6 of 30

Metadata Analysis

How is the number of tips affiliated with a Yelp user’s elite status?

We looked for distinguishing characteristics in the tip-giving patterns among different user groups.

Elites and Non-Elites have defined traits, but Potentials and Slackers are difficult to define based solely on tip behavior.

7 of 30

8 of 30

9 of 30

Statistical Analysis

2009

2010

2011

2012

2013

2014

2015

2016

Elite-Potential

T-statistic

p-value

0.7873

0.4343

1.7259

0.0846

3.1529

0.0016

1.9074

0.0565

3.2430

0.0011

4.6098

4.1e-06

5.1559

2.6e-07

1.4068

0.2952

Elite-Slacker

T-statistic

p-value

1.1692

0.2465

1.5195

0.1289

0.8958

0.3704

1.6833

0.0923

3.8020

0.0001

2.9549

0.0031

1.9329

0.0533

3.6995

0.0002

Elite-Non-Elite

T-statistic

p-value

1.4716

0.1455

3.9304

8.9e-05

4.8191

1.5e-06

6.751

1.6e-11

7.1942

7.3e-13

7.0875

1.5e-12

4.3850

1.2e-05

5.2215

1.9e-07

Potential-Slacker

T-statistic

p-value

1.2447

0.2311

0.2287

0.8191

-1.626

0.1040

-0.1304

0.8962

0.4338

0.6644

-0.1909

0.8485

-1.6784

0.0949

0.4047

0.6857

Potential-Non-Elite

T-statistic

p-value

0.7015

0.4883

1.6456

0.1003

1.7564

0.0792

3.5339

0.0004

3.0934

0.0020

2.6759

0.0075

0.8409

0.4005

2.1613

0.0310

Slacker-Non-Elite

T-statistic

p-value

-0.616

0.5443

1.1635

0.2451

2.9218

0.0035

4.4498

9.1e-06

2.2711

0.0232

1.9021

0.0573

1.7178

0.0861

2.6972

0.0070

Significant at the 10% level

Significant at the 5% level

Significant at the 1% level

10 of 30

Text Analysis

Reviews reduced to 3 dimensions using T-SNE

  • Vectorized reviews into both bag of words and TF-IDF (term frequency x inverse document frequency) format
  • Looked into clustering of different types of users based on their vectorized reviews
  • Unable to find distinct clusters from the different classes for both representations

11 of 30

Text Analysis

Non-Elite

Elite

Slacker

Potentials

12 of 30

Time Series Analysis

  • Studied change in each user’s behavior over time
  • Extracted both positive and negative anomalies
  • Clustered pattern of user behavior
  • Use this metric as a classification feature

13 of 30

Regression Models

SVC:

R^2 0.6695, MAE 0.0869, MSE 0.00809

Kernel Ridge:

R^2 0.6935 MAE 0.02598, MSE 0.00405

Decision Tree (depth = 3):

R^2 0.8711, MAE 0.02120, MSE 0.00143

  • Each model is tuned by the grid search and 3-fold cross validation

14 of 30

Anomaly Detection

  • Decision Tree model
    • Detect the negative anomalies

  • Kernel Ridge
    • Extract anomalies without over-fitting.

15 of 30

Kernel Density Estimation

  • Used local-minimum of the density graph as a “turning point”

16 of 30

Dynamic Time Warping Clustering

  • Dynamically find the best alignment that minimizes overall euclidean distance.

  • Spliced the data into 24-month period (2 years) and calculated DTW distance to cluster user pattern

17 of 30

Questions?

18 of 30

Subteam: Boss Llama

19 of 30

1st Topic: Survival Analysis

  • Predict if a business is open on a certain date
  • Features
    • Average star rating
    • Days between reviews
    • Last activity
    • Linear regression
  • Achieved ~94% accuracy
    • Unable to increase past 94%
    • Unable to validate

20 of 30

2nd Topic: Competition analysis

  • Key Question: Can we see competition in reviews?
  • Performance Metric: Time taken to reach threshold count
  • Goal:
  • Construct a model of competition
    1. Model the macroscopic trend of review intensity
    2. Apply natural language processing to categorize businesses
    3. Identify city centers or “clusters”
  • Fit regression models to find the competition relationships

21 of 30

Features Considered

  • 8 cities, 9857 Restaurants
  • Review count and intensity
  • Geographical information
  • Survival information
    • Is_open
    • First Review Date
  • Restaurants categories
    • Topics in the Reviews
    • Business Attributes

22 of 30

Fit a Poisson Process

  1. Good model of activity or “arrivals”
    1. Rate function has effect of “smoothing” the volatility
  2. How to fit a rate function
  3. Identify inflection points in plateauing process
  4. Use regression methods to fit a function
  5. Find expected number of days to reach threshold
  6. Rate function has linear, quadratic, seasonal (weekly, bi-annual, annual) components

23 of 30

24 of 30

Geographical “clusters”

25 of 30

Identifying Clusters

26 of 30

Topic Modeling

  • Trained LDA and NMF models on reviews
  • Analyzed top words in each topic
  • Generated probability distributions among all the topics
  • Compared topics generated from topic modeling with the categories generated on Yelp

27 of 30

Topic Top Words

Other Characteristics of Restaurants

Types of Restaurants

28 of 30

Identifying User Clusters

Steps:

  1. Review Vectors�- word2vec → Reviews
  2. User Vectors
  3. Dimension Reduction
  4. Fit Gaussian Mixture Models�→ Classify users

29 of 30

Result

  • Latitude and Longitude has very high importance
  • Certain categories are much more influential in making threshold
  • More interaction terms need to be added

30 of 30

The End

Thank you!