Posted Dec 10, 2019 2019-12-10T14:10:00+08:00 by Zuzhao Ye
Updated Jun 7, 2020 2020-06-08T04:56:18+08:00
Abstract
In one of my course projects, we need to analyze the seasonal pattern of Citibike riderships. I find taking some notes on how to handle Citibike data will be helpful for future reference.
This blog will demonstrate the following items:
Read Citibike trip history data.
Aggregate trip history data by month.
Display trip history data based on user type, gender, and age group.
Citibike provides a comprehensive list of trip history data in their homepage. It includes the following columns:
Trip Duration (seconds)
Start Time and Date
Stop Time and Date
Start Station Name
End Station Name
Station ID
Station Lat/Long
Bike ID
User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
Gender (Zero=unknown; 1=male; 2=female)
Year of Birth
Read Data
Let’s read a sample of the data to see how it looks like. Suppose we are interested in the data between 2016 - 2019. We have the data files downloaded and stored in the path ‘/data’.
tripduration
starttime
stoptime
start station id
start station name
start station latitude
...
usertype
birth year
gender
0
457
2017-10-01 00:00:00
2017-10-01 00:07:38
479
9 Ave & W 45 St
40.760193
...
Subscriber
1985.0
1
1
6462
2017-10-01 00:00:20
2017-10-01 01:48:03
279
Peck Slip & Front St
40.707873
...
Customer
NaN
0
2
761
2017-10-01 00:00:27
2017-10-01 00:13:09
504
1 Ave & E 16 St
40.732219
...
Subscriber
1992.0
1
3
1193
2017-10-01 00:00:29
2017-10-01 00:20:22
3236
W 42 St & Dyer Ave
40.758985
...
Customer
1992.0
2
4
2772
2017-10-01 00:00:32
2017-10-01 00:46:44
2006
Central Park S & 6 Ave
40.765909
...
Customer
NaN
0
Pre-processing
Let’s say we are interested in a month-by-month pattern. For each month, the data file could be at the size of 100M. Therefore, we want to preprocess (i.e. extract and aggregate) the data before doing anything else. Say we are interested in the information about user types, gender, and age.
Then, we will loop over all the data files and extract related information (Citibike changes the format or the column titles along the way, so we need to pay special attention to these change. You can read the accomodations in the following codes).
Now let’s take a look of what we have obtained:
time
count_all
count_sub
count_cus
0
2016-01-01
484933
484933
0
1
2016-02-01
531048
531048
0
2
2016-03-01
826678
826678
0
3
2016-04-01
882679
882679
0
4
2016-05-01
1035959
1035959
0
time
count_all
count_m
count_f
count_u
0
2016-01-01
484933
379312
104457
1164
1
2016-02-01
531048
417215
112587
1246
2
2016-03-01
826678
634214
190551
1913
3
2016-04-01
882679
674127
206510
2042
4
2016-05-01
1035959
783687
249831
2441
time
count_all
count_g1
count_g2
count_g3
...
count_g7
count_g8
0
2016-01-01
484932
1707
6548
109880
...
22729
2973
1
2016-02-01
531025
1875
12614
125913
...
24657
3308
2
2016-03-01
826631
2780
18494
208283
...
37196
5162
3
2016-04-01
882548
2822
20496
235703
...
37163
5271
4
2016-05-01
1035719
3730
21318
280539
...
41444
5754
Great! After a bunch of work, we successfully extract monthly aggregated data for each scenario. We want to save them to our local disk, then next time we want to use them, we don’t need to go over this time-consuming procedure again:
Visualization
Now, its time to have some good plots.
First of all, let’s see how user type looks like.
Fig 1. How ridership splited among subscribers and customers
We notice a clear pattern of seasonal dynamics. The ridership peaks in Sep or Oct, and reaches its valley in winter.
A another fact is that subscribers have a dominant share of ridership in Citibike, but we can also see an increasing trend of customer ridership during recent years.
Then, what about genders?
Fig 2. How ridership splited among genders
Suprisingly, though NYC has a larger share of females, the ridership is performed more by males. Huh, funny. Is it because males prefer biking than females?
Also, started from 2018, more users reported their gender to be unknown. This could be explained by thinking that people have been getting more awared of pravicy protection in using apps, or simply, Citibike provided a 3rd selection of genders since 2018.
Finally, how does age influences ridership?
Fig 3. How ridership splited among age groups
Clearly, working age people use Citibike the most.
Closing Notes
Citibike provides a good dataset for research purposes. This blog is intended to show how to process a large amount of CItibike data files and extract useful information.