Taming the MTA’s Unruly Turnstile Data
Chris Whong
Data Pub Virtual Open Data Meetup
May 19th 2020
About me
Things to Know about the Raw Data
Query shows 5, but one has no data since January 2019
The entries and exits are like an odometer.
A single observation is useless on its own, you must know the reading and timestamp of the previous observation.
Building a Pipeline with PostgreSQL
Big shoutout to Todd Schneider for code and inspiration
First, we make a giant table with all of the weekly data and a few transformations/calculations
(~14M records for 2019 and 2020)
Net entries/ Net exits will be set to null if they are calculated to be over 10000 for any observation window (41.6 entries per minute)
This throws out obviously wrong large numbers, but there could still be other inaccuracies below the threshold
Step 2: Aggregate by date and sum entries and exits.
This means we lose some granularity, but get a nice useful number for the day for each individual turnstile
(~14M records aggregates to ~2M records)
Which day should this observation period count towards?
9pm
12am
11pm
10pm
1am
2am
3am
9pm
12am
11pm
10pm
1am
2am
3am
Step 3: Aggregate by station/complex
Here’s where it gets tricky, some stations are many stations linked together!
We needed a lookup table that would identify station identifiers that should be counted together as a “station complex”
I added a column with the station/complex id to the remote-booth-station table provided with the data.
This allows us to relate through to the official station list.
Behold, clean daily entries and exits by station/complex!
Versioning and Publishing with Qri
https://twitter.com/cgseife/status/1262733619300380673
https://twitter.com/cgseife/status/1259927221063614464
https://twitter.com/cgseife/status/1259927221063614464
@chris_whong
Thanks!
Any Questions?