Posts Handling MTA Turnstile Data
Post
Cancel

Handling MTA Turnstile Data

Abstract

MTA (Subway operator of NYC) provides passenger flow data based on the record of each turnstile. Wait, what is a turnstile? It’s something like this:

Fig 1. Turnstiles, source: http://www.turnstile.com/

Imagine that each station has tens of turnstiles and each line has tens of stations, isn’t it hard to harness these data for analysis? Don’t be panic, in this notebook, we will go through the following items:

  • Read MTA turnstile data
  • Find turnstile data by line name (or it can be easily tailored to station names)
  • Calculate entries and exits for each station
  • Visualize daily entry/exit dynamics for your chosen stations

Read MTA Turnstile Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from collections import defaultdict
pd.options.display.float_format = '{:.2f}'.format
%matplotlib inline
# 1 sample week. 
# MTA uses the data on each Saturday, e.g 10/05/2019, to name their data file.
week = 191005

# url of data source
url = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt'

df_sample = pd.read_csv(url.format(week)) 
df_sample.head()
C/ASTATIONLINENAMEDATETIMEENTRIESEXITS
0A00259 STNQR456W09/28/201900:00:0072157402444319
1A00259 STNQR456W09/28/201904:00:0072157662444322
2A00259 STNQR456W09/28/201908:00:0072157882444357
3A00259 STNQR456W09/28/201912:00:0072158622444436
4A00259 STNQR456W09/28/201916:00:0072161082444474

Each row in the records represents one turnstile in a station. The “ENTRIES” and “EXITS” column are count of total entries and exits since this turnstile is installed. They are provided on a 4-hour per record basis. Given this way, for example, if we want to calculate how many entries happen within a day, we can use today’s record at midnight to minus the record at midnight of the previous day. A comprehensive explaination could be found on MTA’s website.

Find Turnstile Data by Line Name

Alright, let’s start with a simple application: Find out how many entries happen in each A line station at the weeks of 10/05/2019, 10/12/2019, and 10/19/2019.

weeks = [191005, 191012, 191019]
lines = ['A'] # we write code capable of handling multiple lines
df_list = []

for week in weeks:
    df_temp = pd.read_csv(url.format(week))    
    ind = (df_temp['UNIT'] == None)
    print('week: ', week) # print out weeks to know how far we have gone
    for line in lines:
        ind = (ind | df_temp['LINENAME'].str.contains(line))
        print('  line: ', line) # also print out lines to know how far we have gone
    df_list.append(df_temp.loc[ind])

mta_df = pd.concat(df_list)
mta_df.sort_values(by = ['STATION'], inplace = True) # group by station name
mta_df.head()
C/ASTATIONLINENAMEDATETIMEENTRIESEXITS
70148N137104 STA10/15/201920:00:001681115231978345287
69605N137104 STA10/04/201908:00:00301455191435
69604N137104 STA10/04/201904:00:00301164191418
69603N137104 STA10/04/201900:00:00301155191403
69602N137104 STA10/03/201920:00:00301130191322

Since data is recorded each 4 hours, we want to collapse them into a one-day level, i.e. merge same day records to one record.

# This section calculates No. of Entries and Exits...
# for each turnstile at a same-day level
temp_d = defaultdict(list) # for Entries
temp_d1 = defaultdict(list) # for Exits
turnstile_d = {}
turnstile_d1 = {}

# Collapse data to a same-day level
for row in mta_df.itertuples():
    C_A, unit, scp, station, linename, date =\
	row[1], row[2], row[3], row[4], ''.join(sorted(row[5])), row[7]
    entries = row[10]
    exits = row[11]
	# Create a unique key for each turnstile at each day
    k = (C_A, unit, scp, station, linename, date)
	# Record entries and exits of each record at the same day
    temp_d[k].append(entries) 
    temp_d1[k].append(exits)
    
for key, value in temp_d.items():
    entry = abs(max(value) - min(value)) # (max - min) of a day
    turnstile_d[key] = [entry]
for key, value in temp_d1.items():
    exit = abs(max(value) - min(value)) # (max - min) of a day
    turnstile_d1[key] = [exit]

# form dataframes from dictionaries
dict_df = pd.DataFrame.from_dict(turnstile_d, orient='index')
dict_df.rename(columns = {0:'Entries'}, inplace=True)
dict_df1 = pd.DataFrame.from_dict(turnstile_d1, orient='index')
dict_df1.rename(columns = {0:'Exits'}, inplace=True)

dict_df['Exits'] = dict_df1['Exits']

# take a look of what we obtain
dict_df.head()
EntriesExits
(N137, R354, 00-06-01, 104 ST, A, 10/15/2019)28911
(N137, R354, 00-00-00, 104 ST, A, 10/04/2019)689269
(N137, R354, 00-00-00, 104 ST, A, 10/03/2019)685379
(N137, R354, 00-00-00, 104 ST, A, 10/02/2019)720352
(N137, R354, 00-00-00, 104 ST, A, 10/01/2019)603352
.........
(N094, R029, 01-03-04, WORLD TRADE CTR, 23ACE, 10/07/2019)529214
(N094, R029, 01-03-04, WORLD TRADE CTR, 23ACE, 10/08/2019)572196
(N094, R029, 01-03-03, WORLD TRADE CTR, 23ACE, 10/09/2019)430372
(N094, R029, 01-03-03, WORLD TRADE CTR, 23ACE, 10/10/2019)446493
(N094, R029, 01-03-03, WORLD TRADE CTR, 23ACE, 10/11/2019)493465

We just collasped the index column for statistical purpose, now let’s expand it again.

# convert back to original format
turnstile_df = pd.DataFrame(columns=[])
turnstile_df['CA'] = [row[0][0] for row in dict_df.itertuples()]
turnstile_df['unit'] = [row[0][1] for row in dict_df.itertuples()]
turnstile_df['scp'] = [row[0][2] for row in dict_df.itertuples()]
turnstile_df['station'] = [row[0][3] for row in dict_df.itertuples()]
turnstile_df['linename'] = [row[0][4] for row in dict_df.itertuples()]
turnstile_df['date'] = [row[0][5] for row in dict_df.itertuples()]
turnstile_df['entries'] = [row[1] for row in dict_df.itertuples()]
turnstile_df['exits'] = [row[2] for row in dict_df.itertuples()]
turnstile_df.head()
CAunitscpstationlinenamedateentriesexits
0N137R35400-06-01104 STA10/15/201928911
1N137R35400-00-00104 STA10/04/2019689269
2N137R35400-00-00104 STA10/03/2019685379
3N137R35400-00-00104 STA10/02/2019720352
4N137R35400-00-00104 STA10/01/2019603352

Then, let’s arrange the dataframe in another way:

# sum up entries at each station, use 'station' as columns and 'date' as index
tstile = pd.pivot_table(turnstile_df, values='entries',\
 index=['date'], columns=['station'], aggfunc=np.sum) 
tstile.head()
station104 ST111 ST125 ST14 ST145 ST168 ST...W 4 ST-WASH SQWORLD TRADE CTR
date
09/28/2019011752211118149099626...28372450
09/29/201900148508301121796601...23094741
09/30/20191569222722058170181844016861...3174516174
10/01/20191613243823991181271939418269...3584217173
10/02/20192029292927774216932418320073...3785118381

It is well know that the source data could be problematic because of faulty turnstiles. So, before doing any analysis, we want to check if the data is valid. One method is to look at the average weekly entries for each station.

checker = tstile.sum()/len(weeks)
checker.sort_values(ascending = False)

From the above table, we can see that 59 ST COLUMBUS station has more than 16 million entries each week, but Penn Station (34 ST-PENN STA, one of the transportation centers of NYC) has only roughly 1 million. There must be some faulty turnstiles at 59 ST COLUMBUS station. Let’s remove 59 ST COLUMBUS to avoid misleading results. If interested, one can find that it was the turnstile with SCP number ‘01-05-01’ encountered problem from 0:00-8:00 AM on 09/28/2019. We are supposed to remove thoes faulty records and calculate for 59 ST COLUMBUS station again, but for simplicity, we just remove this station.

Let’s plot the results.

tstile.drop(columns=['59 ST COLUMBUS'], inplace = True)
tstile_weekly = tstile.sum()/len(weeks)
tstile_weekly.sort_values(ascending=False, inplace=True)
Fig 1. Ranking of entries - Stations on A line

Not surprising, those well known places, like Penn Station, Port Authority, Fulton Street (under World Trade Center), and Time Square rank on the top. However, we also notice BEACH 36 ST station, which is the terminal station and locates at far south Brooklyn, ranks in the 2nd place. This seems a little weired. If you are interested, you can further check this station like what we did for 59 ST COLUMBUS.

Visualize Daily Entry/Exit Dynamics for Chosen Stations

Now, we already have the weekly ranking, but what if we want to know intra-week dynamics of entries/exits? Let’s say, for Penn Station, Port Authority, and Time Square.

stations = ['34 ST-PENN STA', '42 ST-PORT AUTH', 'TIMES SQ-42 ST']

# Convert date string to datetime object for plot
turnstile_df['date'] = pd.to_datetime(turnstile_df['date'])
tstile_dynamic = turnstile_df.groupby(['station','date']).sum().\
sort_values(by=['station', 'date'],ascending=False)
tstile_dynamic.reset_index(inplace=True)
tstile_dynamic.set_index('date', inplace = True)
tstile_dynamic.sort_index(ascending=True, inplace = True)

legend = ['Entries', 'Exits']

# Define a function for better view of y sticks
def y_fmt(tick_val, pos):
    if tick_val > 1000000:
        val = int(tick_val)/1000000
        return '{:.0f} M'.format(val)
    elif tick_val > 1000:
        val = int(tick_val) / 1000
        return '{:.0f} k'.format(val)
    else:
        return tick_val

for station in stations:
    f, ax = plt.subplots(figsize = (15,3))
    tstile_dynamic[tstile_dynamic['station'] == station].plot(ax = ax, y = 'entries')
    tstile_dynamic[tstile_dynamic['station'] == station].plot(ax = ax, y = 'exits')
    ax.legend(legend)
    ax.set_xlabel('Date',size = 12)
    ax.set_ylabel('Entries/Exits',size = 15)
    ax.set_title('Staion: ' + station)    
    ax.yaxis.set_major_formatter(tick.FuncFormatter(y_fmt))
Fig 2. Weekly dynamics of entries/exits at several busy NYC stations

What happens at Penn Station on 10/05? Another turnstile problem? A big event (what event can boost the entries/exits by more than 800k)? It is subjected to be checked.

Closing Notes

Unlike other datasets, MTA turnstile data is provided in .txt format, so we need to use extra code snippets to handle it. Also remember that turnstile data can be misleading due to frequent malfunction of turnstiles. Make sure to check if the data makes sense before going any further.

Acknowledgement: the part on calculating entries per day was forked from https://github.com/ZachHeick/MTA_Data_Analysis.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.