1 of 34

Taming the MTA’s Unruly Turnstile Data

Chris Whong

Data Pub Virtual Open Data Meetup

May 19th 2020

2 of 34

About me

Urbanist, Mapmaker, Data Junkie

Outreach Engineer at Qri (https://qri.io)

@chris_whong

3 of 34

4 of 34

5 of 34

Things to Know about the Raw Data

  • The readings occur every 4 hours
  • The 4-hour observation windows are not the same for all turnstiles
  • Some of the turnstiles are counting down, not up. ¯\_(ツ)_/¯
  • Each turnstile that’s keeping track of counts is identified by its “subunit channel position” or scp, unique within a control area.
  • Each turnstile is in a “control area”, which is the bank of turnstiles associated with a subway exit or exits, unique within a remote unit.
  • Each control area is in a “remote unit”. Remote unit generally represents a single station, but some stations and station complexes have several associated remote unit ids.

6 of 34

7 of 34

8 of 34

9 of 34

Query shows 5, but one has no data since January 2019

10 of 34

The entries and exits are like an odometer.

A single observation is useless on its own, you must know the reading and timestamp of the previous observation.

11 of 34

12 of 34

Building a Pipeline with PostgreSQL

13 of 34

Big shoutout to Todd Schneider for code and inspiration

14 of 34

First, we make a giant table with all of the weekly data and a few transformations/calculations

(~14M records for 2019 and 2020)

15 of 34

Net entries/ Net exits will be set to null if they are calculated to be over 10000 for any observation window (41.6 entries per minute)

This throws out obviously wrong large numbers, but there could still be other inaccuracies below the threshold

16 of 34

Step 2: Aggregate by date and sum entries and exits.

This means we lose some granularity, but get a nice useful number for the day for each individual turnstile

(~14M records aggregates to ~2M records)

17 of 34

Which day should this observation period count towards?

9pm

12am

11pm

10pm

1am

2am

3am

18 of 34

9pm

12am

11pm

10pm

1am

2am

3am

19 of 34

Step 3: Aggregate by station/complex

Here’s where it gets tricky, some stations are many stations linked together!

20 of 34

We needed a lookup table that would identify station identifiers that should be counted together as a “station complex”

21 of 34

I added a column with the station/complex id to the remote-booth-station table provided with the data.

This allows us to relate through to the official station list.

22 of 34

Behold, clean daily entries and exits by station/complex!

23 of 34

24 of 34

Versioning and Publishing with Qri

25 of 34

26 of 34

27 of 34

28 of 34

29 of 34

30 of 34

31 of 34

32 of 34

https://twitter.com/cgseife/status/1262733619300380673

https://twitter.com/cgseife/status/1259927221063614464

33 of 34

https://twitter.com/cgseife/status/1259927221063614464

34 of 34

@chris_whong

Thanks!

Any Questions?