Posts Handling Citibike Data
Post
Cancel

Handling Citibike Data

Abstract

In one of my course projects, we need to analyze the seasonal pattern of Citibike riderships. I find taking some notes on how to handle Citibike data will be helpful for future reference.

This blog will demonstrate the following items:

  • Read Citibike trip history data.
  • Aggregate trip history data by month.
  • Display trip history data based on user type, gender, and age group.

Citibike provides a comprehensive list of trip history data in their homepage. It includes the following columns:

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
  • Gender (Zero=unknown; 1=male; 2=female)
  • Year of Birth

Read Data

Let’s read a sample of the data to see how it looks like. Suppose we are interested in the data between 2016 - 2019. We have the data files downloaded and stored in the path ‘/data’.

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from datetime import datetime

y_ind = ['2016','2017','2018','2019'] # list of years interested
m_ind = list() 
for i in range(1,13):
    m_ind.append(str(i)) # list of months

# take a look at a data sample
data = pd.read_csv("data/201710-citibike-tripdata.csv")
data.head(5)
tripdurationstarttimestoptimestart station idstart station namestart station latitude...usertypebirth yeargender
04572017-10-01 00:00:002017-10-01 00:07:384799 Ave & W 45 St40.760193...Subscriber1985.01
164622017-10-01 00:00:202017-10-01 01:48:03279Peck Slip & Front St40.707873...CustomerNaN0
27612017-10-01 00:00:272017-10-01 00:13:095041 Ave & E 16 St40.732219...Subscriber1992.01
311932017-10-01 00:00:292017-10-01 00:20:223236W 42 St & Dyer Ave40.758985...Customer1992.02
427722017-10-01 00:00:322017-10-01 00:46:442006Central Park S & 6 Ave40.765909...CustomerNaN0

Pre-processing

Let’s say we are interested in a month-by-month pattern. For each month, the data file could be at the size of 100M. Therefore, we want to preprocess (i.e. extract and aggregate) the data before doing anything else. Say we are interested in the information about user types, gender, and age.

# assume we want to analyze the behavior of different user types, genders, and age groups
# initialize empty dataframes for each type of analysis
data_user = pd.DataFrame()
data_gender = pd.DataFrame()
data_age = pd.DataFrame()

Then, we will loop over all the data files and extract related information (Citibike changes the format or the column titles along the way, so we need to pay special attention to these change. You can read the accomodations in the following codes).

# Loop over all years and months.
# I suggest to start with 1 datafile, e.g. y_ind = [2019], m_ind = [1], for the test run.
# The entire 4 years' data takes me 20min to run with a 16G RAM and a good CPU.
for y in y_ind:
    for m in m_ind:
        # If month is '1-9', turn it to '01-09' for consistency.
        if len(m) == 1: 
            m_temp = "0" + m
        else:
            m_temp = m
        
        # Try to load data, when finished, print
        try:
            data1 = pd.read_csv("data/" + y + m_temp + "-citibike-tripdata.csv")
        except:
            print('Finish reading all the data')
        
        # Print some notes to know how many files we have loaded
        print("---------")
        print(y + m_temp + "-citibike-tripdata.csv")
		
        # Preprocessing step 1: uniform column names
        # data between 2016-10 and 2017-02 has different column names from other time. 
        # We want to change them to uniformed names
        try:
            data1['starttime'] # the column name is either 'startime' or 'Start Time'
        except:
            data1.rename(columns={u'Start Time' : 'starttime',\
                     u'Bike ID' : 'bikeid', \
                     u'User Type' : 'usertype',\
                     u'Birth Year' : 'birth year',\
                     u'Gender' : 'gender',\
                      u'Trip Duration' : 'tripduration'}, inplace=True)
            print("Note: Column name is corrected")
            
        # Preprocessing step 2: remove NaN values   
        ind = (pd.notna(data1['birth year'])) &\ 
		(pd.notna(data1['gender'])) &\ 
		(pd.notna(data1['usertype']))
		
        data_temp = data1.loc[ind].copy()
        s = pd.Series(range(0,len(data_temp.index))) # reorganize index
        data_temp.set_index([s],inplace = True)
        
        # Preprocessing step 3: Convert time str to datetime object
        # Citibike data file has the following 4 types of date str, e.g.: 
		# 2019-01-31, 1/3/2019, 1/31/2019, 10/31/2019
		# We need to test each one until we get a correct result
        try:
            time = datetime.strptime(data_temp['starttime'][0][0:10], '%Y-%m-%d')
        except:
            try:
                time = datetime.strptime(data_temp['starttime'][0][0:8], '%m/%d/%Y')
            except:
                try:
                    time = datetime.strptime(data_temp['starttime'][0][0:9], '%m/%d/%Y')
                except:
                    time = datetime.strptime(data_temp['starttime'][0][0:10], '%m/%d/%Y')
        
        # Now we have finished data cleaning, let's start to aggregate the data to each type
        
        # 1. Generate Usertype data: 
		# Columns: all, subscriber (sub), and customer (cus)
        data_temp_user = pd.DataFrame()
        data_temp_user['time'] = pd.Series(time)
        # set index to be 2 columns and count along 'usertype'
        count1 = data_temp.set_index(['usertype','tripduration']).count(level = 'usertype').bikeid
        data_temp_user['count_all'] = pd.Series(sum(count1))
        data_temp_user['count_sub'] = pd.Series(count1.Subscriber)
        # for some months, only subscribers ride Citibike
        try:
            data_temp_user['count_cus'] = pd.Series(count1.Customer)
        except:
            data_temp_user['count_cus'] = pd.Series(0)
            print('Note: There is no casual customer for ' + y + '-' + m_temp)
        # add monthly data to a predefined dataframe
        data_user = data_user.append(data_temp_user)
        
        # 2. generate Gender data
		# Columns: all, male (m), female (f), and undisclosed (u)
        data_temp_gender = pd.DataFrame()
        data_temp_gender['time'] = pd.Series(time)
        count2 = data_temp.set_index(['gender','tripduration']).count(level = 'gender').bikeid
        data_temp_gender['count_all'] = pd.Series(sum(count2))
        data_temp_gender['count_m'] = pd.Series(count2[1]) # male
        data_temp_gender['count_f'] = pd.Series(count2[2]) # female
        data_temp_gender['count_u'] = pd.Series(count2[0]) # unknown
        data_gender = data_gender.append(data_temp_gender)
        
        # 3. generate Age data
		# Columns: according to the age group specified below
        data_temp_age = pd.DataFrame()
        data_temp_age['time'] = pd.Series(time)
        data_temp['age'] = int(y) - data_temp['birth year'] # calculate age
        # split age groups. Only age over 16 is allowed to use Citibike
        age_group = [16,18,22,30,40,50,60,70,max(data_temp.age)]
        # use 'groupby' and 'cut' method to count over age groups
        count3 = data_temp.groupby(pd.cut(data_temp.age, age_group)).count().bikeid        
        data_temp_age['count_all'] = pd.Series(sum(count3))
        # assign result to each age group
        for i in range(0,len(age_group) - 1):
            data_temp_age['count_g' + str(i + 1)] = pd.Series(count3[i])            
        data_age = data_age.append(data_temp_age) 
		

Now let’s take a look of what we have obtained:

s1 = pd.Series(range(0,len(data_user.index))) 
data_user.set_index([s1],inplace = True) # assign index
data_user.head()
timecount_allcount_subcount_cus
02016-01-014849334849330
12016-02-015310485310480
22016-03-018266788266780
32016-04-018826798826790
42016-05-01103595910359590
s2 = pd.Series(range(0,len(data_gender.index))) 
data_gender.set_index([s2],inplace = True)
data_gender.head()
timecount_allcount_mcount_fcount_u
02016-01-014849333793121044571164
12016-02-015310484172151125871246
22016-03-018266786342141905511913
32016-04-018826796741272065102042
42016-05-0110359597836872498312441
s3 = pd.Series(range(0,len(data_age.index))) 
data_age.set_index([s3],inplace = True)
data_age.head()
timecount_allcount_g1count_g2count_g3...count_g7count_g8
02016-01-0148493217076548109880...227292973
12016-02-01531025187512614125913...246573308
22016-03-01826631278018494208283...371965162
32016-04-01882548282220496235703...371635271
42016-05-011035719373021318280539...414445754

Great! After a bunch of work, we successfully extract monthly aggregated data for each scenario. We want to save them to our local disk, then next time we want to use them, we don’t need to go over this time-consuming procedure again:

data_user.to_csv('data_usertype.csv')
data_gender.to_csv('data_gender.csv')
data_age.to_csv('data_age.csv')

Visualization

Now, its time to have some good plots.

First of all, let’s see how user type looks like.

plt.figure(figsize=(12,5))
plt.plot(data_gender['time'], data_gender['count_all'],'>-')
plt.plot(data_gender['time'], data_gender['count_m'],'o-')
plt.plot(data_gender['time'], data_gender['count_f'],'*-')
plt.plot(data_gender['time'], data_gender['count_u'],'^-')
plt.ylabel('Ridership')
plt.xlabel('Time')
plt.legend(['all','male','female','unknown'])
Fig 1. How ridership splited among subscribers and customers

We notice a clear pattern of seasonal dynamics. The ridership peaks in Sep or Oct, and reaches its valley in winter.

A another fact is that subscribers have a dominant share of ridership in Citibike, but we can also see an increasing trend of customer ridership during recent years.

Then, what about genders?

plt.figure(figsize=(12,5))
plt.plot(data_user['time'], data_user['count_all'],'>-')
plt.plot(data_user['time'], data_user['count_sub'],'o-')
plt.plot(data_user['time'], data_user['count_cus'],'*-')
plt.ylabel('Ridership')
plt.xlabel('Time')
plt.legend(['all','subscriber','customer'])
Fig 2. How ridership splited among genders

Suprisingly, though NYC has a larger share of females, the ridership is performed more by males. Huh, funny. Is it because males prefer biking than females?

Also, started from 2018, more users reported their gender to be unknown. This could be explained by thinking that people have been getting more awared of pravicy protection in using apps, or simply, Citibike provided a 3rd selection of genders since 2018.

Finally, how does age influences ridership?

plt.figure(figsize=(12,5))
legend = list()
for i in range(0,len(age_group) - 1):    
    plt.plot(data_age['time'], data_age['count_g' + str(i + 1)],'o-')
    legend.append(str(age_group[i]) + '~' + str(age_group[i + 1]))
legend[-1] = '70+'
plt.legend(legend)
plt.ylabel('Ridership')
plt.xlabel('Time')
Fig 3. How ridership splited among age groups

Clearly, working age people use Citibike the most.

Closing Notes

Citibike provides a good dataset for research purposes. This blog is intended to show how to process a large amount of CItibike data files and extract useful information.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.