Analyzing MTA Turnstile Data

Every day millions of residents and visitors of New York City flood the subway system as their primary means of transportation throughout the city. Every station for the subway system, named the MTA, is a high foot traffic area, which is optimal for interfacing with a large number of people. I set out to analyze the raw data for traffic through the MTA subway system. I was hoping to find the stations which had the highest foot traffic and find the times in which this occured. These stations would be optimal for placing representives for campaigns that needed to reach a large number of people.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
import pickle

%matplotlib inline

The first thing we want to do is grab the data from the MTA website. The data is in the form of csv files linked through the MTA website. I only grabbed the data for three week initially to be able to load and clean the data faster. The final model ran on a years worth of data.

# Source: http://web.mta.info/developers/dfturnstile.html
def get_data(week_nums):
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    dfs = []
    for week_num in week_nums:
        file_url = url.format(week_num)
        dfs.append(pd.read_csv(file_url))
    return pd.concat(dfs)
        
week_nums = [160903, 160910, 160917]
turnstiles_df = get_data(week_nums)

Now that the data is loaded in let's take a peek at it:

turnstiles_df.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
0	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	00:00:00	REGULAR	5799442	1966041
1	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	04:00:00	REGULAR	5799463	1966044
2	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	08:00:00	REGULAR	5799492	1966079
3	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	12:00:00	REGULAR	5799610	1966155
4	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	16:00:00	REGULAR	5799833	1966214

We see that there are entries and exits for each unit, every day at different hours of the day. If we could use this data to find the numbers of entries for each station, we could find which stations are the most popular. Let's start with this goal in mind.

There is an explanation of the column headers from the MTA website:

C/A - Control Area name/Booth name. This is the internal identification of a booth at a given station.
Unit - Remote unit ID of station.
SCP - Subunit/Channel/position represents a specific address for a given device. STATION - Name assigned to the subway station by operations planning. This name is used in all references to stations, as well as in debit/credit purchase receipts, and customer’s bank activity statements.
LINENAME - Train lines stopping at this location. Can contain up to 20 single character identifier. When more than one train line appears, it is usually intercepting train lines, in major stations where the passenger can transfer between any one of the lines.
DIVISION - Represents the Line originally the station belonged to BMT, IRT, or IND. Each section of the system is assigned a unique line name, usually paired with its original operating company or division (Brooklyn–Manhattan Transit Corporation (BMT), Interborough Rapid Transit Company (IRT), and Independent Subway System (IND).
DATE - Represents the date of the audit data.

Now let's check if there are any issues with formatting for the columns.

 turnstiles_df.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES',
       'EXITS                                                               '],
      dtype='object')

For good measure let's strip out the white space that may be surronding the column names.

turnstiles_df.columns = [column.strip() for column in turnstiles_df.columns]

turnstiles_df.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')

turnstiles_df.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
0	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	00:00:00	REGULAR	5799442	1966041
1	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	04:00:00	REGULAR	5799463	1966044
2	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	08:00:00	REGULAR	5799492	1966079
3	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	12:00:00	REGULAR	5799610	1966155
4	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	16:00:00	REGULAR	5799833	1966214

Now let's check to see that everyday is represented and how many times it is represented. We see below that some days have larger counts than others.

turnstiles_df.DATE.value_counts().sort_index()

08/27/2016    27290
08/28/2016    27300
08/29/2016    27828
08/30/2016    27775
08/31/2016    27352
09/01/2016    28116
09/02/2016    27478
09/03/2016    27353
09/04/2016    27304
09/05/2016    27325
09/06/2016    28826
09/07/2016    27977
09/08/2016    28137
09/09/2016    27956
09/10/2016    27325
09/11/2016    27334
09/12/2016    28074
09/13/2016    27343
09/14/2016    27359
09/15/2016    27283
09/16/2016    28160
Name: DATE, dtype: int64

We will want to turn the dates and times into datetimes so we can work with them.

# Take the date and time fields into a single datetime column
turnstiles_df["DATE_TIME"] = pd.to_datetime(turnstiles_df.DATE + " " + turnstiles_df.TIME, format="%m/%d/%Y %H:%M:%S")

We see below that the new column we created called DATE_TIME has the dates and times in chronological order with recordings every four hours.

turnstiles_df.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATE_TIME
0	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	00:00:00	REGULAR	5799442	1966041	2016-08-27 00:00:00
1	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	04:00:00	REGULAR	5799463	1966044	2016-08-27 04:00:00
2	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	08:00:00	REGULAR	5799492	1966079	2016-08-27 08:00:00
3	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	12:00:00	REGULAR	5799610	1966155	2016-08-27 12:00:00
4	A002	R051	02-00-00	59 ST	NQR456	BMT	08/27/2016	16:00:00	REGULAR	5799833	1966214	2016-08-27 16:00:00

The SCP represents a single turnstile, however, the SCP is a subunit of the station, unit, and C/A. Therefore, all four of these need to be grouped together along with the date time in order to get the entries and exits for each turnstile. If these four items are grouped together and there are duplicates, that must mean something went wrong with the turnstyle because it gave multiple outputs for the same date and time.

(turnstiles_df
 .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])  
 .ENTRIES.count()
 .reset_index()  # or use as_index = False; otherwise makes groupby columns new index 
 .sort_values("ENTRIES", ascending=False)).head(5)

	C/A	UNIT	SCP	STATION	DATE_TIME	ENTRIES
513722	R504	R276	00-00-01	VERNON-JACKSON	2016-09-16 08:00:00	2
0	A002	R051	02-00-00	59 ST	2016-08-27 00:00:00	1
387264	R155	R116	01-00-03	50 ST	2016-09-15 20:00:00	1
387259	R155	R116	01-00-03	50 ST	2016-09-15 00:00:00	1
387260	R155	R116	01-00-03	50 ST	2016-09-15 04:00:00	1

On 9/16, we seem to have two entries for same time. Let's take a look

mask = ((turnstiles_df["C/A"] == "R504") & 
(turnstiles_df["UNIT"] == "R276") & 
(turnstiles_df["SCP"] == "00-00-01") & 
(turnstiles_df["STATION"] == "VERNON-JACKSON") &
(turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 9, 16).date()))

turnstiles_df[mask].head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATE_TIME
170562	R504	R276	00-00-01	VERNON-JACKSON	7	IRT	09/16/2016	00:00:00	REGULAR	8137913	1333972	2016-09-16 00:00:00
170563	R504	R276	00-00-01	VERNON-JACKSON	7	IRT	09/16/2016	04:00:00	REGULAR	8137913	1333975	2016-09-16 04:00:00
170564	R504	R276	00-00-01	VERNON-JACKSON	7	IRT	09/16/2016	08:00:00	REGULAR	8138271	1334066	2016-09-16 08:00:00
170565	R504	R276	00-00-01	VERNON-JACKSON	7	IRT	09/16/2016	08:00:00	RECOVR AUD	8138271	16735026	2016-09-16 08:00:00
170566	R504	R276	00-00-01	VERNON-JACKSON	7	IRT	09/16/2016	12:00:00	REGULAR	8139437	1334196	2016-09-16 12:00:00

Looks to be a incorrect AUD entry. May be we should just select the Regular one.

turnstiles_df.DESC.value_counts()

REGULAR       579109
RECOVR AUD      1786
Name: DESC, dtype: int64

Since we are only interested in Entries, we might be OK.

# Get rid of the duplicate entry
turnstiles_df.sort_values(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True, \
                          ascending=False)
turnstiles_df.drop_duplicates(subset=["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True)

Let's do the same check to ensure there are no more duplicate entries.

(turnstiles_df
 .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
 .ENTRIES.count()
 .reset_index()
 .sort_values("ENTRIES", ascending=False)).head(5)

	C/A	UNIT	SCP	STATION	DATE_TIME	ENTRIES
0	A002	R051	02-00-00	59 ST	2016-08-27 00:00:00	1
387266	R155	R116	01-00-03	50 ST	2016-09-16 04:00:00	1
387260	R155	R116	01-00-03	50 ST	2016-09-15 04:00:00	1
387261	R155	R116	01-00-03	50 ST	2016-09-15 08:00:00	1
387262	R155	R116	01-00-03	50 ST	2016-09-15 12:00:00	1

We no longer have duplicate Entries. For now we will only be concerned with entries to each subway.

# Drop Exits and Desc columns. To prevent errors in multiple run of cell, 
# errors on drop is ignored (e.g. if some columns were dropped already)
turnstiles_df = turnstiles_df.drop(["EXITS", "DESC"], axis=1, errors="ignore")

Now we will want to group the cleaned data by the four identifiers for a single turnstile along with the date to get all of the turnstiles at each date. This will achieve our goal of getting the data for each day.

turnstiles_daily = turnstiles_df.groupby(["C/A", "UNIT", "SCP", "STATION", "DATE"])\
.ENTRIES.first().reset_index()

turnstiles_daily.head()

	C/A	UNIT	SCP	STATION	DATE	ENTRIES
0	A002	R051	R051	R051	R051	02-00-00	59 ST	08/27/2016	5800121
1	A002	R051	02-00-00	59 ST	08/28/2016	5800798
2	A002	R051	02-00-00	59 ST	08/29/2016	5802336
3	A002	R051	02-00-00	59 ST	08/30/2016	5803875
4	A002	R051	02-00-00	59 ST	08/31/2016	5805383

The entries are a cumulative sum. The current day needs to be subtracted from the previous day in order to get the total daily entries.

turnstiles_daily[["PREV_DATE", "PREV_ENTRIES"]] = (turnstiles_daily
                                                       .groupby(["C/A", "UNIT", "SCP", "STATION"])["DATE", "ENTRIES"]
                                                       .transform(lambda grp: grp.shift(1)))
# transform() takes a function as parameter
# shift moves the index by the number of periods given (positive or negative)

turnstiles_daily.head()

	C/A	UNIT	SCP	STATION	DATE	ENTRIES	PREV_DATE	PREV_ENTRIES
0	A002	R051	02-00-00	59 ST	08/27/2016	5800121	NaN	NaN
1	A002	R051	02-00-00	59 ST	08/28/2016	5800798	08/27/2016	5800121.0
2	A002	R051	02-00-00	59 ST	08/29/2016	5802336	08/28/2016	5800798.0
3	A002	R051	02-00-00	59 ST	08/30/2016	5803875	08/29/2016	5802336.0
4	A002	R051	02-00-00	59 ST	08/31/2016	5805383	08/30/2016	5803875.0

The first date does not have a previous date so we will want to drop that date.

# Drop the rows for first date
turnstiles_daily.dropna(subset=["PREV_DATE"], axis=0, inplace=True)
# axis = 0 means index (=1 means column)

Let's check that the number of entries for today is higher than entries for yesterday

turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]].head()

	C/A	UNIT	SCP	STATION	DATE	ENTRIES	PREV_DATE	PREV_ENTRIES
715	A011	R080	01-00-00	57 ST-7 AV	08/28/2016	887688133	08/27/2016	887689608.0
716	A011	R080	01-00-00	57 ST-7 AV	08/29/2016	887685714	08/28/2016	887688133.0
717	A011	R080	01-00-00	57 ST-7 AV	08/30/2016	887683207	08/29/2016	887685714.0
718	A011	R080	01-00-00	57 ST-7 AV	08/31/2016	887680260	08/30/2016	887683207.0
719	A011	R080	01-00-00	57 ST-7 AV	09/01/2016	887677354	08/31/2016	887680260.0

We see that some of the entries are in reverse. Let's look at a few of these.

mask = ((turnstiles_df["C/A"] == "A011") & 
(turnstiles_df["UNIT"] == "R080") & 
(turnstiles_df["SCP"] == "01-00-00") & 
(turnstiles_df["STATION"] == "57 ST-7 AV") &
(turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 8, 27).date())) 

# datetime is both name of module and name of constructor of datetime object
turnstiles_df[mask].head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	ENTRIES	DATE_TIME
1446	A011	R080	01-00-00	57 ST-7 AV	NQR	BMT	08/27/2016	20:00:00	887689608	2016-08-27 20:00:00
1445	A011	R080	01-00-00	57 ST-7 AV	NQR	BMT	08/27/2016	16:00:00	887690085	2016-08-27 16:00:00
1444	A011	R080	01-00-00	57 ST-7 AV	NQR	BMT	08/27/2016	12:00:00	887690472	2016-08-27 12:00:00
1443	A011	R080	01-00-00	57 ST-7 AV	NQR	BMT	08/27/2016	08:00:00	887690738	2016-08-27 08:00:00
1442	A011	R080	01-00-00	57 ST-7 AV	NQR	BMT	08/27/2016	04:00:00	887690800	2016-08-27 04:00:00

Let's see how many stations have this problem and how often.

(turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]]
    .groupby(["C/A", "UNIT", "SCP", "STATION"])
    .size()) # size() behaves same as if we'd done .DATE.count()

C/A    UNIT  SCP       STATION        
A011   R080  01-00-00  57 ST-7 AV         20
             01-00-04  57 ST-7 AV         17
             01-00-05  57 ST-7 AV         20
A016   R081  03-06-01  49 ST               1
A025   R023  01-03-02  34 ST-HERALD SQ    20
A049   R088  02-05-00  CORTLANDT ST       15
A066   R118  00-00-00  CANAL ST           20
C019   R232  00-00-02  45 ST              20
H003   R163  01-00-02  6 AV               20
H023   R236  00-06-00  DEKALB AV          20
J034   R007  00-00-02  104 ST             20
JFK01  R535  00-00-01  HOWARD BCH JFK      1
             00-00-02  HOWARD BCH JFK      1
             00-00-03  HOWARD BCH JFK      2
JFK02  R535  01-00-01  HOWARD BCH JFK      1
             01-00-02  HOWARD BCH JFK      1
             01-00-03  HOWARD BCH JFK      1
             01-00-04  HOWARD BCH JFK      1
             01-00-05  HOWARD BCH JFK      1
             01-00-06  HOWARD BCH JFK      1
JFK03  R536  00-00-01  JFK JAMAICA CT1     2
             00-00-02  JFK JAMAICA CT1     2
             00-00-03  JFK JAMAICA CT1     1
             00-00-04  JFK JAMAICA CT1     1
             00-00-05  JFK JAMAICA CT1     1
             00-03-00  JFK JAMAICA CT1     1
             00-03-01  JFK JAMAICA CT1     1
             00-03-02  JFK JAMAICA CT1     1
             00-03-03  JFK JAMAICA CT1     1
             00-03-04  JFK JAMAICA CT1     1
                                          ..
PTH07  R550  00-01-02  CITY / BUS          1
PTH16  R550  01-00-03  LACKAWANNA          1
             01-02-03  LACKAWANNA          1
PTH19  R549  02-00-01  NEWARK C            2
R126   R189  01-00-02  CHRISTOPHER ST      6
R127   R105  00-00-00  14 ST              20
R148   R033  01-00-01  TIMES SQ-42 ST     20
R158   R084  00-06-03  59 ST COLUMBUS      1
R175   R169  01-00-04  137 ST CITY COL     1
R210   R044  00-03-04  BROOKLYN BRIDGE     6
R227   R131  00-00-00  23 ST              20
R238A  R046  02-00-03  GRD CNTRL-42 ST     1
R242   R049  01-00-02  51 ST               1
R256   R182  00-00-02  116 ST              1
R258   R132  00-00-03  125 ST             20
R304   R206  00-00-00  125 ST             20
R305   R206  01-00-00  125 ST              1
             01-00-02  125 ST             20
R310   R053  01-00-02  3 AV-149 ST        20
R317   R408  01-05-01  SIMPSON ST          1
R322   R386  00-00-02  174 ST             20
R333   R366  00-00-01  225 ST              1
R414   R162  00-00-01  ELDER AV            1
             00-03-00  ELDER AV            1
R526   R096  00-05-03  82 ST-JACKSON H     1
R550   R072  00-03-0A  34 ST-HUDSON YD     2
R622   R123  00-00-00  FRANKLIN AV        20
R629   R065  00-03-02  ROCKAWAY AV         1
R632   R067  00-00-02  PENNSYLVANIA AV     1
R646   R110  01-00-01  FLATBUSH AV-B.C    20
Length: 83, dtype: int64

Let's clean these up. If the entries are smaller the day before let's make the sum positive. In order to avoid outliers let's cap the total daily count for daily entries to be 1,000,000 people.

def get_daily_counts(row, max_counter):
    counter = row["ENTRIES"] - row["PREV_ENTRIES"]
    if counter < 0:
        counter = -counter
    if counter > max_counter:
        print(row["ENTRIES"], row["PREV_ENTRIES"])
        return 0
    return counter

# If counter is > 1 million, then the counter might have been reset.  
# Just set it to zero as different counters have different cycle limits
turnstiles_daily["DAILY_ENTRIES"] = turnstiles_daily.apply(get_daily_counts, axis=1, max_counter=1000000) 

# apply takes a function as parameter and applies it along the given axis (1=apply by row) 
# apply takes care of passing each row to the function

194 1894869683.0
262183 67111299.0
63 1054429.0
2862 2044694.0
1912607049 7509132.0
285 1494569141.0
2542 14129479.0
67108864 0.0
335600216 1374824.0
36 3252790.0
1117 3665497.0
262269 10823938.0
1452 1460387.0
1154413610 8957775.0
100665773 35655801.0
1443 1014193.0
186916568 604712476.0
874 6130902.0
598 11078528.0
1160 74672293.0
0 117440532.0
100663296 1299.0
117440514 100663296.0
2123 9170483.0
428 2336204.0
1024 12498870.0

There we have it! We have the daily entries.

turnstiles_daily.head()

	C/A	UNIT	SCP	STATION	DATE	ENTRIES	PREV_DATE	PREV_ENTRIES	DAILY_ENTRIES
1	A002	R051	02-00-00	59 ST	08/28/2016	5800798	08/27/2016	5800121.0	677.0
2	A002	R051	02-00-00	59 ST	08/29/2016	5802336	08/28/2016	5800798.0	1538.0
3	A002	R051	02-00-00	59 ST	08/30/2016	5803875	08/29/2016	5802336.0	1539.0
4	A002	R051	02-00-00	59 ST	08/31/2016	5805383	08/30/2016	5803875.0	1508.0
5	A002	R051	02-00-00	59 ST	09/01/2016	5806990	08/31/2016	5805383.0	1607.0

The ultimate goal of this project is to find the busiest stations and find their peak times. In order to do that we will want to look at exits as well. Entries and exit represents total foot traffic for a subway station. The data for all of 2017 was cleaned in a manner similar to above and the daily exits were calculated similarly to the daily entries calculated above. The file was pickled and we will analyze it below.

Let's load in the pickle file.

picklefile = "/Users/kevin/Metis/Projects/1/Project_1/sorted_traffic_dataframe2.pickle"

with open(picklefile,'rb') as read_file:
    new_df = pickle.load(read_file)
    
new_df.head()

	C/A_x	Unit_x	SCP_x	Station_x	Date_x	Entries	Prev_Date	Prev_Entries	Daily_Entries	C/A_y	Unit_y	SCP_y	Station_y	Date_y	Exits	Prev_Date_Exit	Prev_Exits	Daily_Exits	Total_Traffic
idxkey
678338	N500	R020	00-00-02	47-50 STS ROCK	12/31/2016	4468625	12/22/2017	5410194.0	941569.0	N500	R020	00-00-02	47-50 STS ROCK	12/31/2016	4569237	12/22/2017	5563842.0	994605.0	1936174.0
12598	A011	R080	01-00-05	57 ST-7 AV	12/31/2016	145783	12/22/2017	1089781.0	943998.0	A011	R080	01-00-05	57 ST-7 AV	12/31/2016	129254	12/22/2017	1098714.0	969460.0	1913458.0
541720	N224	R157	00-00-02	NORWOOD 205 ST	10/12/2017	191	10/11/2017	927201.0	927010.0	N224	R157	00-00-02	NORWOOD 205 ST	10/12/2017	384	10/11/2017	878198.0	877814.0	1804824.0
365672	N062A	R010	00-06-00	42 ST-PORT AUTH	12/31/2016	3380304	12/22/2017	4239131.0	858827.0	N062A	R010	00-06-00	42 ST-PORT AUTH	12/31/2016	3839490	12/22/2017	4780257.0	940767.0	1799594.0
50068	A043	R462	00-00-02	CANAL ST	12/31/2016	11589689	12/22/2017	12523977.0	934288.0	A043	R462	00-00-02	CANAL ST	12/31/2016	11020982	12/22/2017	11872484.0	851502.0	1785790.0

This dataset has the daily entries an exits which have been cleaned similarly to the method above. The total traffic was calculated from the sum of the entries and exits. We can use this data set to find the highest trafficked subway station. Below we see that Penn Station takes that mark followed by Grand Central Station.

new_df.groupby(['Station_x'])['Total_Traffic'].sum().sort_values(ascending = False)

Station_x
34 ST-PENN STA     157471286.0
GRD CNTRL-42 ST    117943132.0
23 ST              113720407.0
34 ST-HERALD SQ    110897711.0
TIMES SQ-42 ST      99527207.0
FULTON ST           98723918.0
42 ST-PORT AUTH     90234747.0
14 ST-UNION SQ      88225215.0
86 ST               79366669.0
CANAL ST            78786103.0
125 ST              78162595.0
59 ST               73153782.0
59 ST COLUMBUS      67191670.0
14 ST               64763286.0
47-50 STS ROCK      64344034.0
CHAMBERS ST         64220137.0
96 ST               58186778.0
FLUSHING-MAIN       56416580.0
28 ST               55584270.0
50 ST               50971007.0
ATL AV-BARCLAY      50016550.0
WALL ST             48328075.0
72 ST               48067940.0
42 ST-BRYANT PK     46994826.0
PATH NEW WTC        45306495.0
LEXINGTON AV/53     44990061.0
7 AV                44690688.0
JKSN HT-ROOSVLT     44679602.0
W 4 ST-WASH SQ      41536440.0
JAY ST-METROTEC     39208948.0
                      ...     
BRONX PARK EAST      2270873.0
BOTANIC GARDEN       2270152.0
RIT-ROOSEVELT        2231208.0
BAY 50 ST            2202479.0
75 AV                2143095.0
CENTRAL AV           2083234.0
AVENUE N             2014391.0
55 ST                1864742.0
SENECA AVE           1813891.0
215 ST               1790435.0
AVENUE P             1693259.0
PENNSYLVANIA AV      1658665.0
ATLANTIC AV          1607611.0
CYPRESS HILLS        1415806.0
BEACH 90 ST          1413628.0
AQUEDUCT RACETR      1410731.0
NEPTUNE AV           1317538.0
NASSAU ST            1279833.0
121 ST               1267656.0
BEACH 36 ST          1255856.0
E 143/ST MARY'S      1203541.0
AVENUE I             1058871.0
ROCKAWAY PARK B      1034571.0
BEACH 44 ST           927467.0
BEACH 98 ST           717647.0
JUNIUS ST             620667.0
TOMPKINSVILLE         563156.0
BEACH 105 ST          410905.0
ORCHARD BEACH          23643.0
PATH WTC                   0.0
Name: Total_Traffic, Length: 380, dtype: float64

Graph's for foot traffic for the top 5 Stations over a 6 Month Period

Below we are going to break out all 5 stations and plot the daily traffic over a 6 month period.

Create Function to Create Graphs for Foot Traffic

list_of_stations = ['34 ST-PENN STA', 'GRD CNTRL-42 ST', '23 ST', '34 ST-HERALD SQ', 'TIMES SQ-42 ST']

def foot_traffic_graph(station_list):
    '''This function graphs the foot traffic through a list of stations that are passed in via a list
    args: A list of the stations to be graphed.
    
    returns: Graphs for the foot traffic for every day the first 6 months of the year 
    '''
    
    
    for index, station in enumerate(station_list):
        
        mask = new_df['Station_x'] == station # Selects only the input station
        grouped_station = new_df[mask].groupby(['Station_x', 'Date_x'])['Total_Traffic'].sum() # Sums the traffic per day
        filtered_station = grouped_station[grouped_station < grouped_station.quantile(.96)] # Remove outliers
        six_months_station = filtered_station.values[:len(filtered_station)//2] # Grab only the first 6 months
        nine_nine = pd.Series(six_months_station).quantile(.99) # Create 99th percentile line
        topper = np.full(len(six_months_station),1)*nine_nine 
        one = pd.Series(six_months_station).quantile(.1) # Create 1 percentile line
        botper = np.full(len(six_months_station),1)*one
    
    
        plt.figure(figsize=[15,10])
        plt.plot(six_months_station, 'b')
        ticks_x = np.linspace(0, len(six_months_station), 7)
        months = ['Jan','Feb','Mar','Apr', 'May', 'June', 'July']
        plt.xlabel('Months', fontsize = '20')
        plt.ylabel('Number of People', fontsize = '20')
        plt.ylim(50000, 350000)
        plt.title('Foot Traffic Through ' + station, fontsize = 20);
        plt.xticks(ticks_x, months);
        plt.plot(topper)
        plt.plot(botper)
        #plt.savefig('FootTrafficThoughStations.png')

foot_traffic_graph(list_of_stations)

Graphs for Weekly Traffic Trends

Here we are going to make a bar chart of the foot traffic per day of the week

This data was loaded in via pickle from a cleaned dataset

weeklypicklefile = '/Users/kevin/Downloads/df2017_Top5.pickle'

with open(weeklypicklefile,'rb') as read_file:
    df2017_Top5 = pickle.load(read_file)
    
new_df.head(5)

	C/A	UNIT	SCP	STATION	LINE NAME	DIVISION	DATE	TIME	DESCRIPTION	ENTRIES	...	PREV_ENTRIES	ENTRIES_DIF	PREV_EXITS	EXITS_DIF	TOTAL_TRAF	WEEK_TYPE	HOURS	AVG_HOURS	HOUR_DAY	DAY_OF_WEEK
5913322	R145	R032	00-00-04	TIMES SQ-42 ST	1237ACENQRSW	IRT	10/16/2017	16:00:00	REGULAR	1568612	...	1568546.0	66.0	1070716.0	45.0	111.0	WeekDay	16	16.0	16	Monday
5913336	R145	R032	00-00-04	TIMES SQ-42 ST	1237ACENQRSW	IRT	10/14/2017	08:00:00	REGULAR	1568193	...	1568191.0	2.0	1070264.0	13.0	15.0	Weekend	8	12.0	8	Saturday
5913335	R145	R032	00-00-04	TIMES SQ-42 ST	1237ACENQRSW	IRT	10/14/2017	12:00:00	REGULAR	1568209	...	1568193.0	16.0	1070277.0	46.0	62.0	Weekend	12	12.0	12	Saturday
5913334	R145	R032	00-00-04	TIMES SQ-42 ST	1237ACENQRSW	IRT	10/14/2017	16:00:00	REGULAR	1568245	...	1568209.0	36.0	1070323.0	40.0	76.0	Weekend	16	13.0	16	Saturday
5913333	R145	R032	00-00-04	TIMES SQ-42 ST	1237ACENQRSW	IRT	10/14/2017	20:00:00	REGULAR	1568290	...	1568245.0	45.0	1070363.0	39.0	84.0	Weekend	20	14.4	20	Saturday

5 rows × 23 columns

df1 = df2017_Top5.groupby('DATE')['TOTAL_TRAF'].sum().reset_index()

df1['datetime'] = pd.to_datetime(df1['DATE'], infer_datetime_format=True)

df1['weekday'] = df1['datetime'].dt.dayofweek

df1.groupby('weekday')['TOTAL_TRAF'].mean().reset_index().sort_values('weekday')

	weekday	TOTAL_TRAF
0	0	1.165029e+06
1	1	1.241894e+06
2	2	1.298062e+06
3	3	1.298698e+06
4	4	1.285390e+06
5	5	1.712970e+06
6	6	5.372261e+05

plt.figure(figsize=[14,8])
plt.xlabel('Days of Week', fontsize = 20)
plt.ylabel('Number of Foot Traffic per Day', fontsize = 20)
plt.title('Traffic by Day of Week', fontsize = 40)
plt.bar(days, df1.groupby('weekday')['TOTAL_TRAF'].mean());
plt.savefig('weekly.png')

Conclusions

We can see a cyclic trend over a 6 month time period for the top 5 stations. Most stations have higher traffic in the summer months. The weekdays and Saturday are the highest trafficked days for the top subway stations.