Analyzing MTA Turnstile Data

Every day millions of residents and visitors of New York City flood the subway system as their primary means of transportation throughout the city. Every station for the subway system, named the MTA, is a high foot traffic area, which is optimal for interfacing with a large number of people. I set out to analyze the raw data for traffic through the MTA subway system. I was hoping to find the stations which had the highest foot traffic and find the times in which this occured. These stations would be optimal for placing representives for campaigns that needed to reach a large number of people.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
import pickle

%matplotlib inline

The first thing we want to do is grab the data from the MTA website. The data is in the form of csv files linked through the MTA website. I only grabbed the data for three week initially to be able to load and clean the data faster. The final model ran on a years worth of data.

# Source: http://web.mta.info/developers/dfturnstile.html
def get_data(week_nums):
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    dfs = []
    for week_num in week_nums:
        file_url = url.format(week_num)
        dfs.append(pd.read_csv(file_url))
    return pd.concat(dfs)
        
week_nums = [160903, 160910, 160917]
turnstiles_df = get_data(week_nums)

Now that the data is loaded in let's take a peek at it:

turnstiles_df.head()
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 00:00:00 REGULAR 5799442 1966041
1 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 04:00:00 REGULAR 5799463 1966044
2 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 08:00:00 REGULAR 5799492 1966079
3 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 12:00:00 REGULAR 5799610 1966155
4 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 16:00:00 REGULAR 5799833 1966214

We see that there are entries and exits for each unit, every day at different hours of the day. If we could use this data to find the numbers of entries for each station, we could find which stations are the most popular. Let's start with this goal in mind.

There is an explanation of the column headers from the MTA website:

C/A - Control Area name/Booth name. This is the internal identification of a booth at a given station.
Unit - Remote unit ID of station.
SCP - Subunit/Channel/position represents a specific address for a given device. STATION - Name assigned to the subway station by operations planning. This name is used in all references to stations, as well as in debit/credit purchase receipts, and customer’s bank activity statements.
LINENAME - Train lines stopping at this location. Can contain up to 20 single character identifier. When more than one train line appears, it is usually intercepting train lines, in major stations where the passenger can transfer between any one of the lines.
DIVISION - Represents the Line originally the station belonged to BMT, IRT, or IND. Each section of the system is assigned a unique line name, usually paired with its original operating company or division (Brooklyn–Manhattan Transit Corporation (BMT), Interborough Rapid Transit Company (IRT), and Independent Subway System (IND).
DATE - Represents the date of the audit data.

Now let's check if there are any issues with formatting for the columns.

 turnstiles_df.columns
Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES',
       'EXITS                                                               '],
      dtype='object')

For good measure let's strip out the white space that may be surronding the column names.

turnstiles_df.columns = [column.strip() for column in turnstiles_df.columns]
turnstiles_df.columns
Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')
turnstiles_df.head()
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 00:00:00 REGULAR 5799442 1966041
1 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 04:00:00 REGULAR 5799463 1966044
2 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 08:00:00 REGULAR 5799492 1966079
3 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 12:00:00 REGULAR 5799610 1966155
4 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 16:00:00 REGULAR 5799833 1966214

Now let's check to see that everyday is represented and how many times it is represented. We see below that some days have larger counts than others.

turnstiles_df.DATE.value_counts().sort_index()
08/27/2016    27290
08/28/2016    27300
08/29/2016    27828
08/30/2016    27775
08/31/2016    27352
09/01/2016    28116
09/02/2016    27478
09/03/2016    27353
09/04/2016    27304
09/05/2016    27325
09/06/2016    28826
09/07/2016    27977
09/08/2016    28137
09/09/2016    27956
09/10/2016    27325
09/11/2016    27334
09/12/2016    28074
09/13/2016    27343
09/14/2016    27359
09/15/2016    27283
09/16/2016    28160
Name: DATE, dtype: int64

We will want to turn the dates and times into datetimes so we can work with them.

# Take the date and time fields into a single datetime column
turnstiles_df["DATE_TIME"] = pd.to_datetime(turnstiles_df.DATE + " " + turnstiles_df.TIME, format="%m/%d/%Y %H:%M:%S")

We see below that the new column we created called DATE_TIME has the dates and times in chronological order with recordings every four hours.

turnstiles_df.head()
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS DATE_TIME
0 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 00:00:00 REGULAR 5799442 1966041 2016-08-27 00:00:00
1 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 04:00:00 REGULAR 5799463 1966044 2016-08-27 04:00:00
2 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 08:00:00 REGULAR 5799492 1966079 2016-08-27 08:00:00
3 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 12:00:00 REGULAR 5799610 1966155 2016-08-27 12:00:00
4 A002 R051 02-00-00 59 ST NQR456 BMT 08/27/2016 16:00:00 REGULAR 5799833 1966214 2016-08-27 16:00:00

The SCP represents a single turnstile, however, the SCP is a subunit of the station, unit, and C/A. Therefore, all four of these need to be grouped together along with the date time in order to get the entries and exits for each turnstile. If these four items are grouped together and there are duplicates, that must mean something went wrong with the turnstyle because it gave multiple outputs for the same date and time.

(turnstiles_df
 .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])  
 .ENTRIES.count()
 .reset_index()  # or use as_index = False; otherwise makes groupby columns new index 
 .sort_values("ENTRIES", ascending=False)).head(5)
C/A UNIT SCP STATION DATE_TIME ENTRIES
513722 R504 R276 00-00-01 VERNON-JACKSON 2016-09-16 08:00:00 2
0 A002 R051 02-00-00 59 ST 2016-08-27 00:00:00 1
387264 R155 R116 01-00-03 50 ST 2016-09-15 20:00:00 1
387259 R155 R116 01-00-03 50 ST 2016-09-15 00:00:00 1
387260 R155 R116 01-00-03 50 ST 2016-09-15 04:00:00 1

On 9/16, we seem to have two entries for same time. Let's take a look

mask = ((turnstiles_df["C/A"] == "R504") & 
(turnstiles_df["UNIT"] == "R276") & 
(turnstiles_df["SCP"] == "00-00-01") & 
(turnstiles_df["STATION"] == "VERNON-JACKSON") &
(turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 9, 16).date()))

turnstiles_df[mask].head()
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS DATE_TIME
170562 R504 R276 00-00-01 VERNON-JACKSON 7 IRT 09/16/2016 00:00:00 REGULAR 8137913 1333972 2016-09-16 00:00:00
170563 R504 R276 00-00-01 VERNON-JACKSON 7 IRT 09/16/2016 04:00:00 REGULAR 8137913 1333975 2016-09-16 04:00:00
170564 R504 R276 00-00-01 VERNON-JACKSON 7 IRT 09/16/2016 08:00:00 REGULAR 8138271 1334066 2016-09-16 08:00:00
170565 R504 R276 00-00-01 VERNON-JACKSON 7 IRT 09/16/2016 08:00:00 RECOVR AUD 8138271 16735026 2016-09-16 08:00:00
170566 R504 R276 00-00-01 VERNON-JACKSON 7 IRT 09/16/2016 12:00:00 REGULAR 8139437 1334196 2016-09-16 12:00:00

Looks to be a incorrect AUD entry. May be we should just select the Regular one.

turnstiles_df.DESC.value_counts()
REGULAR       579109
RECOVR AUD      1786
Name: DESC, dtype: int64

Since we are only interested in Entries, we might be OK.

# Get rid of the duplicate entry
turnstiles_df.sort_values(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True, \
                          ascending=False)
turnstiles_df.drop_duplicates(subset=["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True)

Let's do the same check to ensure there are no more duplicate entries.

(turnstiles_df
 .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
 .ENTRIES.count()
 .reset_index()
 .sort_values("ENTRIES", ascending=False)).head(5)
C/A UNIT SCP STATION DATE_TIME ENTRIES
0 A002 R051 02-00-00 59 ST 2016-08-27 00:00:00 1
387266 R155 R116 01-00-03 50 ST 2016-09-16 04:00:00 1
387260 R155 R116 01-00-03 50 ST 2016-09-15 04:00:00 1
387261 R155 R116 01-00-03 50 ST 2016-09-15 08:00:00 1
387262 R155 R116 01-00-03 50 ST 2016-09-15 12:00:00 1

We no longer have duplicate Entries. For now we will only be concerned with entries to each subway.

# Drop Exits and Desc columns. To prevent errors in multiple run of cell, 
# errors on drop is ignored (e.g. if some columns were dropped already)
turnstiles_df = turnstiles_df.drop(["EXITS", "DESC"], axis=1, errors="ignore")

Now we will want to group the cleaned data by the four identifiers for a single turnstile along with the date to get all of the turnstiles at each date. This will achieve our goal of getting the data for each day.

turnstiles_daily = turnstiles_df.groupby(["C/A", "UNIT", "SCP", "STATION", "DATE"])\
.ENTRIES.first().reset_index()
turnstiles_daily.head()
C/A UNIT SCP STATION DATE ENTRIES
0 A002 R051 R051 R051 R051 02-00-00 59 ST 08/27/2016 5800121
1 A002 R051 02-00-00 59 ST 08/28/2016 5800798
2 A002 R051 02-00-00 59 ST 08/29/2016 5802336
3 A002 R051 02-00-00 59 ST 08/30/2016 5803875
4 A002 R051 02-00-00 59 ST 08/31/2016 5805383

The entries are a cumulative sum. The current day needs to be subtracted from the previous day in order to get the total daily entries.

turnstiles_daily[["PREV_DATE", "PREV_ENTRIES"]] = (turnstiles_daily
                                                       .groupby(["C/A", "UNIT", "SCP", "STATION"])["DATE", "ENTRIES"]
                                                       .transform(lambda grp: grp.shift(1)))
# transform() takes a function as parameter
# shift moves the index by the number of periods given (positive or negative)
turnstiles_daily.head()
C/A UNIT SCP STATION DATE ENTRIES PREV_DATE PREV_ENTRIES
0 A002 R051 02-00-00 59 ST 08/27/2016 5800121 NaN NaN
1 A002 R051 02-00-00 59 ST 08/28/2016 5800798 08/27/2016 5800121.0
2 A002 R051 02-00-00 59 ST 08/29/2016 5802336 08/28/2016 5800798.0
3 A002 R051 02-00-00 59 ST 08/30/2016 5803875 08/29/2016 5802336.0
4 A002 R051 02-00-00 59 ST 08/31/2016 5805383 08/30/2016 5803875.0

The first date does not have a previous date so we will want to drop that date.

# Drop the rows for first date
turnstiles_daily.dropna(subset=["PREV_DATE"], axis=0, inplace=True)
# axis = 0 means index (=1 means column)

Let's check that the number of entries for today is higher than entries for yesterday

turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]].head()
C/A UNIT SCP STATION DATE ENTRIES PREV_DATE PREV_ENTRIES
715 A011 R080 01-00-00 57 ST-7 AV 08/28/2016 887688133 08/27/2016 887689608.0
716 A011 R080 01-00-00 57 ST-7 AV 08/29/2016 887685714 08/28/2016 887688133.0
717 A011 R080 01-00-00 57 ST-7 AV 08/30/2016 887683207 08/29/2016 887685714.0
718 A011 R080 01-00-00 57 ST-7 AV 08/31/2016 887680260 08/30/2016 887683207.0
719 A011 R080 01-00-00 57 ST-7 AV 09/01/2016 887677354 08/31/2016 887680260.0

We see that some of the entries are in reverse. Let's look at a few of these.

mask = ((turnstiles_df["C/A"] == "A011") & 
(turnstiles_df["UNIT"] == "R080") & 
(turnstiles_df["SCP"] == "01-00-00") & 
(turnstiles_df["STATION"] == "57 ST-7 AV") &
(turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 8, 27).date())) 

# datetime is both name of module and name of constructor of datetime object
turnstiles_df[mask].head()
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME ENTRIES DATE_TIME
1446 A011 R080 01-00-00 57 ST-7 AV NQR BMT 08/27/2016 20:00:00 887689608 2016-08-27 20:00:00
1445 A011 R080 01-00-00 57 ST-7 AV NQR BMT 08/27/2016 16:00:00 887690085 2016-08-27 16:00:00
1444 A011 R080 01-00-00 57 ST-7 AV NQR BMT 08/27/2016 12:00:00 887690472 2016-08-27 12:00:00
1443 A011 R080 01-00-00 57 ST-7 AV NQR BMT 08/27/2016 08:00:00 887690738 2016-08-27 08:00:00
1442 A011 R080 01-00-00 57 ST-7 AV NQR BMT 08/27/2016 04:00:00 887690800 2016-08-27 04:00:00

Let's see how many stations have this problem and how often.

(turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]]
    .groupby(["C/A", "UNIT", "SCP", "STATION"])
    .size()) # size() behaves same as if we'd done .DATE.count() 
C/A    UNIT  SCP       STATION        
A011   R080  01-00-00  57 ST-7 AV         20
             01-00-04  57 ST-7 AV         17
             01-00-05  57 ST-7 AV         20
A016   R081  03-06-01  49 ST               1
A025   R023  01-03-02  34 ST-HERALD SQ    20
A049   R088  02-05-00  CORTLANDT ST       15
A066   R118  00-00-00  CANAL ST           20
C019   R232  00-00-02  45 ST              20
H003   R163  01-00-02  6 AV               20
H023   R236  00-06-00  DEKALB AV          20
J034   R007  00-00-02  104 ST             20
JFK01  R535  00-00-01  HOWARD BCH JFK      1
             00-00-02  HOWARD BCH JFK      1
             00-00-03  HOWARD BCH JFK      2
JFK02  R535  01-00-01  HOWARD BCH JFK      1
             01-00-02  HOWARD BCH JFK      1
             01-00-03  HOWARD BCH JFK      1
             01-00-04  HOWARD BCH JFK      1
             01-00-05  HOWARD BCH JFK      1
             01-00-06  HOWARD BCH JFK      1
JFK03  R536  00-00-01  JFK JAMAICA CT1     2
             00-00-02  JFK JAMAICA CT1     2
             00-00-03  JFK JAMAICA CT1     1
             00-00-04  JFK JAMAICA CT1     1
             00-00-05  JFK JAMAICA CT1     1
             00-03-00  JFK JAMAICA CT1     1
             00-03-01  JFK JAMAICA CT1     1
             00-03-02  JFK JAMAICA CT1     1
             00-03-03  JFK JAMAICA CT1     1
             00-03-04  JFK JAMAICA CT1     1
                                          ..
PTH07  R550  00-01-02  CITY / BUS          1
PTH16  R550  01-00-03  LACKAWANNA          1
             01-02-03  LACKAWANNA          1
PTH19  R549  02-00-01  NEWARK C            2
R126   R189  01-00-02  CHRISTOPHER ST      6
R127   R105  00-00-00  14 ST              20
R148   R033  01-00-01  TIMES SQ-42 ST     20
R158   R084  00-06-03  59 ST COLUMBUS      1
R175   R169  01-00-04  137 ST CITY COL     1
R210   R044  00-03-04  BROOKLYN BRIDGE     6
R227   R131  00-00-00  23 ST              20
R238A  R046  02-00-03  GRD CNTRL-42 ST     1
R242   R049  01-00-02  51 ST               1
R256   R182  00-00-02  116 ST              1
R258   R132  00-00-03  125 ST             20
R304   R206  00-00-00  125 ST             20
R305   R206  01-00-00  125 ST              1
             01-00-02  125 ST             20
R310   R053  01-00-02  3 AV-149 ST        20
R317   R408  01-05-01  SIMPSON ST          1
R322   R386  00-00-02  174 ST             20
R333   R366  00-00-01  225 ST              1
R414   R162  00-00-01  ELDER AV            1
             00-03-00  ELDER AV            1
R526   R096  00-05-03  82 ST-JACKSON H     1
R550   R072  00-03-0A  34 ST-HUDSON YD     2
R622   R123  00-00-00  FRANKLIN AV        20
R629   R065  00-03-02  ROCKAWAY AV         1
R632   R067  00-00-02  PENNSYLVANIA AV     1
R646   R110  01-00-01  FLATBUSH AV-B.C    20
Length: 83, dtype: int64

Let's clean these up. If the entries are smaller the day before let's make the sum positive. In order to avoid outliers let's cap the total daily count for daily entries to be 1,000,000 people.

def get_daily_counts(row, max_counter):
    counter = row["ENTRIES"] - row["PREV_ENTRIES"]
    if counter < 0:
        counter = -counter
    if counter > max_counter:
        print(row["ENTRIES"], row["PREV_ENTRIES"])
        return 0
    return counter

# If counter is > 1 million, then the counter might have been reset.  
# Just set it to zero as different counters have different cycle limits
turnstiles_daily["DAILY_ENTRIES"] = turnstiles_daily.apply(get_daily_counts, axis=1, max_counter=1000000) 

# apply takes a function as parameter and applies it along the given axis (1=apply by row) 
# apply takes care of passing each row to the function
194 1894869683.0
262183 67111299.0
63 1054429.0
2862 2044694.0
1912607049 7509132.0
285 1494569141.0
2542 14129479.0
67108864 0.0
335600216 1374824.0
36 3252790.0
1117 3665497.0
262269 10823938.0
1452 1460387.0
1154413610 8957775.0
100665773 35655801.0
1443 1014193.0
186916568 604712476.0
874 6130902.0
598 11078528.0
1160 74672293.0
0 117440532.0
100663296 1299.0
117440514 100663296.0
2123 9170483.0
428 2336204.0
1024 12498870.0

There we have it! We have the daily entries.

turnstiles_daily.head()
C/A UNIT SCP STATION DATE ENTRIES PREV_DATE PREV_ENTRIES DAILY_ENTRIES
1 A002 R051 02-00-00 59 ST 08/28/2016 5800798 08/27/2016 5800121.0 677.0
2 A002 R051 02-00-00 59 ST 08/29/2016 5802336 08/28/2016 5800798.0 1538.0
3 A002 R051 02-00-00 59 ST 08/30/2016 5803875 08/29/2016 5802336.0 1539.0
4 A002 R051 02-00-00 59 ST 08/31/2016 5805383 08/30/2016 5803875.0 1508.0
5 A002 R051 02-00-00 59 ST 09/01/2016 5806990 08/31/2016 5805383.0 1607.0

The ultimate goal of this project is to find the busiest stations and find their peak times. In order to do that we will want to look at exits as well. Entries and exit represents total foot traffic for a subway station. The data for all of 2017 was cleaned in a manner similar to above and the daily exits were calculated similarly to the daily entries calculated above. The file was pickled and we will analyze it below.

Let's load in the pickle file.

picklefile = "/Users/kevin/Metis/Projects/1/Project_1/sorted_traffic_dataframe2.pickle"
with open(picklefile,'rb') as read_file:
    new_df = pickle.load(read_file)
    
new_df.head()
C/A_x Unit_x SCP_x Station_x Date_x Entries Prev_Date Prev_Entries Daily_Entries C/A_y Unit_y SCP_y Station_y Date_y Exits Prev_Date_Exit Prev_Exits Daily_Exits Total_Traffic
idxkey
678338 N500 R020 00-00-02 47-50 STS ROCK 12/31/2016 4468625 12/22/2017 5410194.0 941569.0 N500 R020 00-00-02 47-50 STS ROCK 12/31/2016 4569237 12/22/2017 5563842.0 994605.0 1936174.0
12598 A011 R080 01-00-05 57 ST-7 AV 12/31/2016 145783 12/22/2017 1089781.0 943998.0 A011 R080 01-00-05 57 ST-7 AV 12/31/2016 129254 12/22/2017 1098714.0 969460.0 1913458.0
541720 N224 R157 00-00-02 NORWOOD 205 ST 10/12/2017 191 10/11/2017 927201.0 927010.0 N224 R157 00-00-02 NORWOOD 205 ST 10/12/2017 384 10/11/2017 878198.0 877814.0 1804824.0
365672 N062A R010 00-06-00 42 ST-PORT AUTH 12/31/2016 3380304 12/22/2017 4239131.0 858827.0 N062A R010 00-06-00 42 ST-PORT AUTH 12/31/2016 3839490 12/22/2017 4780257.0 940767.0 1799594.0
50068 A043 R462 00-00-02 CANAL ST 12/31/2016 11589689 12/22/2017 12523977.0 934288.0 A043 R462 00-00-02 CANAL ST 12/31/2016 11020982 12/22/2017 11872484.0 851502.0 1785790.0

This dataset has the daily entries an exits which have been cleaned similarly to the method above. The total traffic was calculated from the sum of the entries and exits. We can use this data set to find the highest trafficked subway station. Below we see that Penn Station takes that mark followed by Grand Central Station.

new_df.groupby(['Station_x'])['Total_Traffic'].sum().sort_values(ascending = False)
Station_x
34 ST-PENN STA     157471286.0
GRD CNTRL-42 ST    117943132.0
23 ST              113720407.0
34 ST-HERALD SQ    110897711.0
TIMES SQ-42 ST      99527207.0
FULTON ST           98723918.0
42 ST-PORT AUTH     90234747.0
14 ST-UNION SQ      88225215.0
86 ST               79366669.0
CANAL ST            78786103.0
125 ST              78162595.0
59 ST               73153782.0
59 ST COLUMBUS      67191670.0
14 ST               64763286.0
47-50 STS ROCK      64344034.0
CHAMBERS ST         64220137.0
96 ST               58186778.0
FLUSHING-MAIN       56416580.0
28 ST               55584270.0
50 ST               50971007.0
ATL AV-BARCLAY      50016550.0
WALL ST             48328075.0
72 ST               48067940.0
42 ST-BRYANT PK     46994826.0
PATH NEW WTC        45306495.0
LEXINGTON AV/53     44990061.0
7 AV                44690688.0
JKSN HT-ROOSVLT     44679602.0
W 4 ST-WASH SQ      41536440.0
JAY ST-METROTEC     39208948.0
                      ...     
BRONX PARK EAST      2270873.0
BOTANIC GARDEN       2270152.0
RIT-ROOSEVELT        2231208.0
BAY 50 ST            2202479.0
75 AV                2143095.0
CENTRAL AV           2083234.0
AVENUE N             2014391.0
55 ST                1864742.0
SENECA AVE           1813891.0
215 ST               1790435.0
AVENUE P             1693259.0
PENNSYLVANIA AV      1658665.0
ATLANTIC AV          1607611.0
CYPRESS HILLS        1415806.0
BEACH 90 ST          1413628.0
AQUEDUCT RACETR      1410731.0
NEPTUNE AV           1317538.0
NASSAU ST            1279833.0
121 ST               1267656.0
BEACH 36 ST          1255856.0
E 143/ST MARY'S      1203541.0
AVENUE I             1058871.0
ROCKAWAY PARK B      1034571.0
BEACH 44 ST           927467.0
BEACH 98 ST           717647.0
JUNIUS ST             620667.0
TOMPKINSVILLE         563156.0
BEACH 105 ST          410905.0
ORCHARD BEACH          23643.0
PATH WTC                   0.0
Name: Total_Traffic, Length: 380, dtype: float64

Graph's for foot traffic for the top 5 Stations over a 6 Month Period

Below we are going to break out all 5 stations and plot the daily traffic over a 6 month period.

Create Function to Create Graphs for Foot Traffic

list_of_stations = ['34 ST-PENN STA', 'GRD CNTRL-42 ST', '23 ST', '34 ST-HERALD SQ', 'TIMES SQ-42 ST']
def foot_traffic_graph(station_list):
    '''This function graphs the foot traffic through a list of stations that are passed in via a list
    args: A list of the stations to be graphed.
    
    returns: Graphs for the foot traffic for every day the first 6 months of the year 
    '''
    
    
    for index, station in enumerate(station_list):
        
        mask = new_df['Station_x'] == station # Selects only the input station
        grouped_station = new_df[mask].groupby(['Station_x', 'Date_x'])['Total_Traffic'].sum() # Sums the traffic per day
        filtered_station = grouped_station[grouped_station < grouped_station.quantile(.96)] # Remove outliers
        six_months_station = filtered_station.values[:len(filtered_station)//2] # Grab only the first 6 months
        nine_nine = pd.Series(six_months_station).quantile(.99) # Create 99th percentile line
        topper = np.full(len(six_months_station),1)*nine_nine 
        one = pd.Series(six_months_station).quantile(.1) # Create 1 percentile line
        botper = np.full(len(six_months_station),1)*one
    
    
        plt.figure(figsize=[15,10])
        plt.plot(six_months_station, 'b')
        ticks_x = np.linspace(0, len(six_months_station), 7)
        months = ['Jan','Feb','Mar','Apr', 'May', 'June', 'July']
        plt.xlabel('Months', fontsize = '20')
        plt.ylabel('Number of People', fontsize = '20')
        plt.ylim(50000, 350000)
        plt.title('Foot Traffic Through ' + station, fontsize = 20);
        plt.xticks(ticks_x, months);
        plt.plot(topper)
        plt.plot(botper)
        #plt.savefig('FootTrafficThoughStations.png')
        
    
foot_traffic_graph(list_of_stations)

Here we are going to make a bar chart of the foot traffic per day of the week

This data was loaded in via pickle from a cleaned dataset

weeklypicklefile = '/Users/kevin/Downloads/df2017_Top5.pickle'
with open(weeklypicklefile,'rb') as read_file:
    df2017_Top5 = pickle.load(read_file)
    
new_df.head(5)
C/A UNIT SCP STATION LINE NAME DIVISION DATE TIME DESCRIPTION ENTRIES ... PREV_ENTRIES ENTRIES_DIF PREV_EXITS EXITS_DIF TOTAL_TRAF WEEK_TYPE HOURS AVG_HOURS HOUR_DAY DAY_OF_WEEK
5913322 R145 R032 00-00-04 TIMES SQ-42 ST 1237ACENQRSW IRT 10/16/2017 16:00:00 REGULAR 1568612 ... 1568546.0 66.0 1070716.0 45.0 111.0 WeekDay 16 16.0 16 Monday
5913336 R145 R032 00-00-04 TIMES SQ-42 ST 1237ACENQRSW IRT 10/14/2017 08:00:00 REGULAR 1568193 ... 1568191.0 2.0 1070264.0 13.0 15.0 Weekend 8 12.0 8 Saturday
5913335 R145 R032 00-00-04 TIMES SQ-42 ST 1237ACENQRSW IRT 10/14/2017 12:00:00 REGULAR 1568209 ... 1568193.0 16.0 1070277.0 46.0 62.0 Weekend 12 12.0 12 Saturday
5913334 R145 R032 00-00-04 TIMES SQ-42 ST 1237ACENQRSW IRT 10/14/2017 16:00:00 REGULAR 1568245 ... 1568209.0 36.0 1070323.0 40.0 76.0 Weekend 16 13.0 16 Saturday
5913333 R145 R032 00-00-04 TIMES SQ-42 ST 1237ACENQRSW IRT 10/14/2017 20:00:00 REGULAR 1568290 ... 1568245.0 45.0 1070363.0 39.0 84.0 Weekend 20 14.4 20 Saturday

5 rows × 23 columns


df1 = df2017_Top5.groupby('DATE')['TOTAL_TRAF'].sum().reset_index()
df1['datetime'] = pd.to_datetime(df1['DATE'], infer_datetime_format=True)
df1['weekday'] = df1['datetime'].dt.dayofweek
df1.groupby('weekday')['TOTAL_TRAF'].mean().reset_index().sort_values('weekday')
weekday TOTAL_TRAF
0 0 1.165029e+06
1 1 1.241894e+06
2 2 1.298062e+06
3 3 1.298698e+06
4 4 1.285390e+06
5 5 1.712970e+06
6 6 5.372261e+05

plt.figure(figsize=[14,8])
plt.xlabel('Days of Week', fontsize = 20)
plt.ylabel('Number of Foot Traffic per Day', fontsize = 20)
plt.title('Traffic by Day of Week', fontsize = 40)
plt.bar(days, df1.groupby('weekday')['TOTAL_TRAF'].mean());
plt.savefig('weekly.png')

Conclusions

We can see a cyclic trend over a 6 month time period for the top 5 stations. Most stations have higher traffic in the summer months. The weekdays and Saturday are the highest trafficked days for the top subway stations.

Back to Top