Smart Rider
Ride smarter, for less!
Feng Xue
Insight Data Engineer Fellow @LA 20C
New York City Taxi trip data
Pipeline
Highlight: Data transformation
Feng Xue
Cognitive Neuroscience
Backup Slides
9
Highlights: Data completeness check
<1000
1.4B rows
>=1000
Highlights:Data completeness check
10K rows
1.4B rows
10K rows
…
...
Run method #1 for unlabelled csv
Future Direction
...
Data cleaning
Why Spark
Source: edureca.co
Why PostGIS
Why Airflow
Data Modelling (Raw data)
Data Modelling
Cluster setup & performance monitoring
Spark Master
Spark Worker
Spark Worker
Spark Worker
PostGIS