1 of 20

Smart Rider

Ride smarter, for less!

Feng Xue

Insight Data Engineer Fellow @LA 20C

2 of 20

3 of 20

4 of 20

5 of 20

New York City Taxi trip data

  • Range: 2009 - present
  • Size: ~240GB
  • Location: S3
  • One csv for each month

6 of 20

Pipeline

7 of 20

Highlight: Data transformation

    • #2: ETL first, then use stored procedure in PostGIS
      • 18min (ETL) + 1.1ms per record (geo2zone) ~=5hr
      • >10 times performance boost
  • latitude/longitude vs Taxi zone id
    • <2016: latitude/longitude
    • >=2016: taxi zone id
  • Approaches
    • #1: Convert in Spark job, query DB for each geo pair
      • >2days (csv with 15M rows)

8 of 20

Feng Xue

Cognitive Neuroscience

9 of 20

Backup Slides

9

10 of 20

Highlights: Data completeness check

  • Purpose
    • Make sure all data is imported correctly
  • Criteria
    • Label as complete if a CSV has >1000 imported records
  • Method #1
    • Exec time: ~47hrs

<1000

1.4B rows

>=1000

11 of 20

Highlights:Data completeness check

  • Method #2
    • 2hr for all data (~1.4B rows)
    • Labelled 200 as good records out of 235 (csv)
    • Run method #1 on 35 unlabelled records
    • Total run time: 2+7=9 hrs
    • >5 times faster

10K rows

1.4B rows

10K rows

...

Run method #1 for unlabelled csv

12 of 20

Future Direction

  • Data enrichment
    • More providers
    • More cities
    • Month -> weeks -> days ...
  • Database optimization
    • redis cache
  • Database server file system optimization
    • ext4 vs xfs

...

13 of 20

Data cleaning

  • Driving speed: <100mph
  • Duration: >0.1hr
  • Fare: >=$2.5 (NYC minimal taxi fare)
  • Rate: <$50/Mile
  • Trip distance: >0.5Mile
  • Etc.

14 of 20

Why Spark

  • Process type:
    • Batch
  • Tool selection:

Source: edureca.co

15 of 20

Why PostGIS

  • Spatial data

  • Fast and free

16 of 20

Why Airflow

  • NYC taxi trip data updates monthly
  • Enhanced task scheduling vs. cron jobs
    • Can handle complex task dependency
    • Centralized job log, error reporting, alerting etc.

17 of 20

Data Modelling (Raw data)

18 of 20

Data Modelling

19 of 20

Cluster setup & performance monitoring

Spark Master

Spark Worker

Spark Worker

Spark Worker

PostGIS

20 of 20