1 of 20

Smart Rider

Ride smarter, for less!

Feng Xue

Insight Data Engineer Fellow @LA 20C

2 of 20

3 of 20

4 of 20

5 of 20

New York City Taxi trip data

Range: 2009 - present
Size: ~240GB
Location: S3
One csv for each month

6 of 20

Pipeline

7 of 20

Highlight: Data transformation

#2: ETL first, then use stored procedure in PostGIS

18min (ETL) + 1.1ms per record (geo2zone) ~=5hr
>10 times performance boost

latitude/longitude vs Taxi zone id

<2016: latitude/longitude
>=2016: taxi zone id

Approaches

#1: Convert in Spark job, query DB for each geo pair

>2days (csv with 15M rows)

Here is one of the highlights I would like to mention which is about data transformation. My older data has trip geolocation, including latitude and longitude, but my newer data has only taxi zone id. So, I need a way to map geolocation to taxi zone. PostGIS is the ultimate tool to help. Here are two approaches. Approaches #1 is straightforward, I implemented conversion during Spark ETL that queries database for every given geo pair. This one seems easy. But the performance was so bad that for a csv with 15M rows the processing job seemed running forever. I killed the job after two days. Then I came up with approach #2, in which I use Spark to ETL data without conversion, then I created a stored procedure in postgis to convert. Performance was very good: it only took about 5hrs to convert. Compared to #1, approach #2 gained more than 10 times performance boost.

8 of 20

Feng Xue

Cognitive Neuroscience

9 of 20

Backup Slides

9

10 of 20

Highlights: Data completeness check

Purpose

Make sure all data is imported correctly

Criteria

Label as complete if a CSV has >1000 imported records

Method #1

Exec time: ~47hrs

<1000

1.4B rows

>=1000

11 of 20

Highlights:Data completeness check

Method #2

2hr for all data (~1.4B rows)
Labelled 200 as good records out of 235 (csv)
Run method #1 on 35 unlabelled records
Total run time: 2+7=9 hrs
>5 times faster

10K rows

1.4B rows

10K rows

…

...

Run method #1 for unlabelled csv

12 of 20

Future Direction

Data enrichment

More providers
More cities
Month -> weeks -> days ...

Database optimization

redis cache

Database server file system optimization

ext4 vs xfs

...

13 of 20

Data cleaning

Driving speed: <100mph
Duration: >0.1hr
Fare: >=$2.5 (NYC minimal taxi fare)
Rate: <$50/Mile
Trip distance: >0.5Mile
Etc.

14 of 20

Why Spark

Process type:

Batch

Tool selection:

Source: edureca.co

15 of 20

Why PostGIS

Spatial data

Fast and free

16 of 20

Why Airflow

NYC taxi trip data updates monthly
Enhanced task scheduling vs. cron jobs

Can handle complex task dependency
Centralized job log, error reporting, alerting etc.

17 of 20

Data Modelling (Raw data)

18 of 20

Data Modelling

19 of 20

Cluster setup & performance monitoring

Spark Master

Spark Worker

PostGIS