1 of 24

NYCe TAXI !

2 of 24

PROJECT GOALS

  • Maximize tip benefits for taxi drivers

  • A generic algorithm that can be extended to any city

3 of 24

DATASET

Source, Size, Relevant Fields

1

4 of 24

SOURCE

  • NYC Taxi and Limousine Commission

  • The data comes from several vendors who manage the meter/gps systems in the cabs

Source: http://publish.illinois.edu/dbwork/open-data/

5 of 24

SIZE

  • Merged from 2 datasets each spanning over 4 years - 2010 to 2013
    • Trips - 116 GB (uncompressed)
    • Fares - 75 GB (uncompressed)

  • Subset Chosen
    • Year 2013
    • Randomly sampled 10% of each month
    • 200,000 records from above

Source: http://publish.illinois.edu/dbwork/open-data/

6 of 24

RELEVANT FIELDS

Unique Fields

  • medallion
  • hack_license
  • vendor_id

Trip Fields

  • pickup_datetime
  • dropoff_datetime
  • trip_time
  • trip_distance
  • pickup_latitude
  • pickup_longitude
  • dropoff_latitude
  • dropoff_longitude
  • passenger_count

Fare Fields

  • fare_amount
  • tip_amount
  • tolls_amount
  • total_amount
  • mta_tax
  • payment_type

7 of 24

ADDITIONAL SOURCE

  • US Census Data
    • Cost of living index
    • Median household income
    • Population Density

  • Available per Zip code

8 of 24

TOOLS USED

Libraries, Infrastructure

2

9 of 24

LIBRARIES & INFRASTRUCTURE

  • Language & Libraries
    • Python, BASH Scripting
    • Scikit, matplotlib

  • Infrastructure
    • IBM Softlayer Cloud
    • 16 cores, 16GB RAM, 100 GB Hard Disk

10 of 24

DATA EXPLORATION & FEATURES

Data Cleaning, Data Visualization, Feature Engineering

3

11 of 24

DATA EXPLORATION

  • Filtering NA and missing values

  • Identifying outliers and incorrect values
    • Plots to detect outliers
    • Removed incorrect values Eg. Passenger count of 250

  • Visualizations to identify correlations with tip amount.

12 of 24

VISUALIZATIONS

Tip Amount Density Function

Tip Percentage Density Function

13 of 24

CORRELATIONS

Tip Amount vs Payment Type

Tip Amount vs Tolls Amount

Tip Amount vs Fare Amount

14 of 24

FEATURE ENGINEERING

  • Date-Time Based
    • Weekend/ Weekday
    • Time of the day

  • Location Based
    • Latitude-Longitude to Zip code
    • Zip code to boroughs
    • Boroughs to demographic data

15 of 24

MODELS & METRICS

Models, Algorithms, Analysis

4

16 of 24

MODELS

Tip Class w/ Zero Tip Data

Tip - No Tip

Tip Class w/o Zero Tip Data

Tip Percent Class w/ Zero Tip Data

Tip Percent Class w/o Zero Tip Data

Tip Amount

17 of 24

ALGORITHMS

  • Classification Models
    • SVM
    • Decision Tree
    • Random Forest
    • Adaboost

  • Regression Models
    • Linear Regression
    • SVM Regression
    • Lasso Regression

18 of 24

METRICS - CLASSIFICATION & REGRESSION

Baseline

SVM

Decision Tree

Random Forest

Adaboost

Tip - No Tip

52.33

98.516

98.249

98.259

98.299

Tip Class w/

47.66

81.253

81.565

81.593

81.208

Tip Class w/o

45.07

67.98

68.002

68.002

66.72

Tip % w/

47.66

68.10

68.08

68.097

68.013

Tip % w/o

41.47

41.58

42.12

42.14

41.97

Baseline(Mean Abs Error)

Linear

SVM

Lasso

Tip

1.38

0.75

0.79

1.254

19 of 24

CONFUSION MATRIX

Tip - No Tip

Tip Class w/o Zero Tip

Tip Class w/ Zero Tip

Tip % w/ Zero Tip

20 of 24

INFERENCES & ROADMAP

Insights Gained, Future Work

5

21 of 24

INFERENCES

  • Card Payments have higher tip

  • Tip varies directly with fare

  • Routes with tolls yield less tip

  • Tip directly varies with the cost of living index

22 of 24

FUTURE WORK

  • Larger Data

  • More feature engineering

  • Apply the model to another city

23 of 24

QUESTIONS?

Thanks!

24 of 24

CREDITS

Special thanks to all the people who made and released these awesome resources for free:

  • Presentation template by SlidesCarnival
  • Photographs by Unsplash