Abstract
MTA (Subway operator of NYC) provides passenger flow data based on the record of each turnstile. Wait, what is a turnstile? It’s something like this:
Fig 1. Turnstiles, source: http://www.turnstile.com/ |
Imagine that each station has tens of turnstiles and each line has tens of stations, isn’t it hard to harness these data for analysis? Don’t be panic, in this notebook, we will go through the following items:
- Read MTA turnstile data
- Find turnstile data by line name (or it can be easily tailored to station names)
- Calculate entries and exits for each station
- Visualize daily entry/exit dynamics for your chosen stations
Read MTA Turnstile Data
C/A | STATION | LINENAME | DATE | TIME | ENTRIES | EXITS | |
---|---|---|---|---|---|---|---|
0 | A002 | 59 ST | NQR456W | 09/28/2019 | 00:00:00 | 7215740 | 2444319 |
1 | A002 | 59 ST | NQR456W | 09/28/2019 | 04:00:00 | 7215766 | 2444322 |
2 | A002 | 59 ST | NQR456W | 09/28/2019 | 08:00:00 | 7215788 | 2444357 |
3 | A002 | 59 ST | NQR456W | 09/28/2019 | 12:00:00 | 7215862 | 2444436 |
4 | A002 | 59 ST | NQR456W | 09/28/2019 | 16:00:00 | 7216108 | 2444474 |
Each row in the records represents one turnstile in a station. The “ENTRIES” and “EXITS” column are count of total entries and exits since this turnstile is installed. They are provided on a 4-hour per record basis. Given this way, for example, if we want to calculate how many entries happen within a day, we can use today’s record at midnight to minus the record at midnight of the previous day. A comprehensive explaination could be found on MTA’s website.
Find Turnstile Data by Line Name
Alright, let’s start with a simple application: Find out how many entries happen in each A line station at the weeks of 10/05/2019, 10/12/2019, and 10/19/2019.
C/A | STATION | LINENAME | DATE | TIME | ENTRIES | EXITS | |
---|---|---|---|---|---|---|---|
70148 | N137 | 104 ST | A | 10/15/2019 | 20:00:00 | 1681115231 | 978345287 |
69605 | N137 | 104 ST | A | 10/04/2019 | 08:00:00 | 301455 | 191435 |
69604 | N137 | 104 ST | A | 10/04/2019 | 04:00:00 | 301164 | 191418 |
69603 | N137 | 104 ST | A | 10/04/2019 | 00:00:00 | 301155 | 191403 |
69602 | N137 | 104 ST | A | 10/03/2019 | 20:00:00 | 301130 | 191322 |
Since data is recorded each 4 hours, we want to collapse them into a one-day level, i.e. merge same day records to one record.
Entries | Exits | |
---|---|---|
(N137, R354, 00-06-01, 104 ST, A, 10/15/2019) | 289 | 11 |
(N137, R354, 00-00-00, 104 ST, A, 10/04/2019) | 689 | 269 |
(N137, R354, 00-00-00, 104 ST, A, 10/03/2019) | 685 | 379 |
(N137, R354, 00-00-00, 104 ST, A, 10/02/2019) | 720 | 352 |
(N137, R354, 00-00-00, 104 ST, A, 10/01/2019) | 603 | 352 |
... | ... | ... |
(N094, R029, 01-03-04, WORLD TRADE CTR, 23ACE, 10/07/2019) | 529 | 214 |
(N094, R029, 01-03-04, WORLD TRADE CTR, 23ACE, 10/08/2019) | 572 | 196 |
(N094, R029, 01-03-03, WORLD TRADE CTR, 23ACE, 10/09/2019) | 430 | 372 |
(N094, R029, 01-03-03, WORLD TRADE CTR, 23ACE, 10/10/2019) | 446 | 493 |
(N094, R029, 01-03-03, WORLD TRADE CTR, 23ACE, 10/11/2019) | 493 | 465 |
We just collasped the index column for statistical purpose, now let’s expand it again.
CA | unit | scp | station | linename | date | entries | exits | |
---|---|---|---|---|---|---|---|---|
0 | N137 | R354 | 00-06-01 | 104 ST | A | 10/15/2019 | 289 | 11 |
1 | N137 | R354 | 00-00-00 | 104 ST | A | 10/04/2019 | 689 | 269 |
2 | N137 | R354 | 00-00-00 | 104 ST | A | 10/03/2019 | 685 | 379 |
3 | N137 | R354 | 00-00-00 | 104 ST | A | 10/02/2019 | 720 | 352 |
4 | N137 | R354 | 00-00-00 | 104 ST | A | 10/01/2019 | 603 | 352 |
Then, let’s arrange the dataframe in another way:
station | 104 ST | 111 ST | 125 ST | 14 ST | 145 ST | 168 ST | ... | W 4 ST-WASH SQ | WORLD TRADE CTR |
---|---|---|---|---|---|---|---|---|---|
date | |||||||||
09/28/2019 | 0 | 1 | 17522 | 11118 | 14909 | 9626 | ... | 28372 | 450 |
09/29/2019 | 0 | 0 | 14850 | 8301 | 12179 | 6601 | ... | 23094 | 741 |
09/30/2019 | 1569 | 2227 | 22058 | 17018 | 18440 | 16861 | ... | 31745 | 16174 |
10/01/2019 | 1613 | 2438 | 23991 | 18127 | 19394 | 18269 | ... | 35842 | 17173 |
10/02/2019 | 2029 | 2929 | 27774 | 21693 | 24183 | 20073 | ... | 37851 | 18381 |
It is well know that the source data could be problematic because of faulty turnstiles. So, before doing any analysis, we want to check if the data is valid. One method is to look at the average weekly entries for each station.
From the above table, we can see that 59 ST COLUMBUS station has more than 16 million entries each week, but Penn Station (34 ST-PENN STA, one of the transportation centers of NYC) has only roughly 1 million. There must be some faulty turnstiles at 59 ST COLUMBUS station. Let’s remove 59 ST COLUMBUS to avoid misleading results. If interested, one can find that it was the turnstile with SCP number ‘01-05-01’ encountered problem from 0:00-8:00 AM on 09/28/2019. We are supposed to remove thoes faulty records and calculate for 59 ST COLUMBUS station again, but for simplicity, we just remove this station.
Let’s plot the results.
Fig 1. Ranking of entries - Stations on A line |
Not surprising, those well known places, like Penn Station, Port Authority, Fulton Street (under World Trade Center), and Time Square rank on the top. However, we also notice BEACH 36 ST station, which is the terminal station and locates at far south Brooklyn, ranks in the 2nd place. This seems a little weired. If you are interested, you can further check this station like what we did for 59 ST COLUMBUS.
Visualize Daily Entry/Exit Dynamics for Chosen Stations
Now, we already have the weekly ranking, but what if we want to know intra-week dynamics of entries/exits? Let’s say, for Penn Station, Port Authority, and Time Square.
Fig 2. Weekly dynamics of entries/exits at several busy NYC stations |
What happens at Penn Station on 10/05? Another turnstile problem? A big event (what event can boost the entries/exits by more than 800k)? It is subjected to be checked.
Closing Notes
Unlike other datasets, MTA turnstile data is provided in .txt format, so we need to use extra code snippets to handle it. Also remember that turnstile data can be misleading due to frequent malfunction of turnstiles. Make sure to check if the data makes sense before going any further.
Acknowledgement: the part on calculating entries per day was forked from https://github.com/ZachHeick/MTA_Data_Analysis.