Every day millions of residents and visitors of New York City flood the subway system as their primary means of transportation throughout the city. Every station for the subway system, named the MTA, is a high foot traffic area, which is optimal for interfacing with a large number of people. I set out to analyze the raw data for traffic through the MTA subway system. I was hoping to find the stations which had the highest foot traffic and find the times in which this occured. These stations would be optimal for placing representives for campaigns that needed to reach a large number of people.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
import pickle
%matplotlib inline
The first thing we want to do is grab the data from the MTA website. The data is in the form of csv files linked through the MTA website. I only grabbed the data for three week initially to be able to load and clean the data faster. The final model ran on a years worth of data.
# Source: http://web.mta.info/developers/dfturnstile.html
def get_data(week_nums):
url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
dfs = []
for week_num in week_nums:
file_url = url.format(week_num)
dfs.append(pd.read_csv(file_url))
return pd.concat(dfs)
week_nums = [160903, 160910, 160917]
turnstiles_df = get_data(week_nums)
Now that the data is loaded in let's take a peek at it:
turnstiles_df.head()
We see that there are entries and exits for each unit, every day at different hours of the day. If we could use this data to find the numbers of entries for each station, we could find which stations are the most popular. Let's start with this goal in mind.
There is an explanation of the column headers from the MTA website:
C/A - Control Area name/Booth name. This is the internal identification of a booth at a given station.
Unit - Remote unit ID of station.
SCP - Subunit/Channel/position represents a specific address for a given device.
STATION - Name assigned to the subway station by operations planning. This name is used in all references to stations, as well as in debit/credit purchase receipts, and customer’s bank activity statements.
LINENAME - Train lines stopping at this location. Can contain up to 20 single character identifier. When more than one train line appears, it is usually intercepting train lines, in major stations where the passenger can transfer between any one of the lines.
DIVISION - Represents the Line originally the station belonged to BMT, IRT, or IND. Each section of the system is assigned a unique line name, usually paired with its original operating company or division (Brooklyn–Manhattan Transit Corporation (BMT), Interborough Rapid Transit Company (IRT), and Independent Subway System (IND).
DATE - Represents the date of the audit data.
Now let's check if there are any issues with formatting for the columns.
turnstiles_df.columns
For good measure let's strip out the white space that may be surronding the column names.
turnstiles_df.columns = [column.strip() for column in turnstiles_df.columns]
turnstiles_df.columns
turnstiles_df.head()
Now let's check to see that everyday is represented and how many times it is represented. We see below that some days have larger counts than others.
turnstiles_df.DATE.value_counts().sort_index()
We will want to turn the dates and times into datetimes so we can work with them.
# Take the date and time fields into a single datetime column
turnstiles_df["DATE_TIME"] = pd.to_datetime(turnstiles_df.DATE + " " + turnstiles_df.TIME, format="%m/%d/%Y %H:%M:%S")
We see below that the new column we created called DATE_TIME has the dates and times in chronological order with recordings every four hours.
turnstiles_df.head()
The SCP represents a single turnstile, however, the SCP is a subunit of the station, unit, and C/A. Therefore, all four of these need to be grouped together along with the date time in order to get the entries and exits for each turnstile. If these four items are grouped together and there are duplicates, that must mean something went wrong with the turnstyle because it gave multiple outputs for the same date and time.
(turnstiles_df
.groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
.ENTRIES.count()
.reset_index() # or use as_index = False; otherwise makes groupby columns new index
.sort_values("ENTRIES", ascending=False)).head(5)
On 9/16, we seem to have two entries for same time. Let's take a look
mask = ((turnstiles_df["C/A"] == "R504") &
(turnstiles_df["UNIT"] == "R276") &
(turnstiles_df["SCP"] == "00-00-01") &
(turnstiles_df["STATION"] == "VERNON-JACKSON") &
(turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 9, 16).date()))
turnstiles_df[mask].head()
Looks to be a incorrect AUD entry. May be we should just select the Regular one.
turnstiles_df.DESC.value_counts()
Since we are only interested in Entries, we might be OK.
# Get rid of the duplicate entry
turnstiles_df.sort_values(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True, \
ascending=False)
turnstiles_df.drop_duplicates(subset=["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True)
Let's do the same check to ensure there are no more duplicate entries.
(turnstiles_df
.groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
.ENTRIES.count()
.reset_index()
.sort_values("ENTRIES", ascending=False)).head(5)
We no longer have duplicate Entries. For now we will only be concerned with entries to each subway.
# Drop Exits and Desc columns. To prevent errors in multiple run of cell,
# errors on drop is ignored (e.g. if some columns were dropped already)
turnstiles_df = turnstiles_df.drop(["EXITS", "DESC"], axis=1, errors="ignore")
Now we will want to group the cleaned data by the four identifiers for a single turnstile along with the date to get all of the turnstiles at each date. This will achieve our goal of getting the data for each day.
turnstiles_daily = turnstiles_df.groupby(["C/A", "UNIT", "SCP", "STATION", "DATE"])\
.ENTRIES.first().reset_index()
turnstiles_daily.head()
The entries are a cumulative sum. The current day needs to be subtracted from the previous day in order to get the total daily entries.
turnstiles_daily[["PREV_DATE", "PREV_ENTRIES"]] = (turnstiles_daily
.groupby(["C/A", "UNIT", "SCP", "STATION"])["DATE", "ENTRIES"]
.transform(lambda grp: grp.shift(1)))
# transform() takes a function as parameter
# shift moves the index by the number of periods given (positive or negative)
turnstiles_daily.head()
The first date does not have a previous date so we will want to drop that date.
# Drop the rows for first date
turnstiles_daily.dropna(subset=["PREV_DATE"], axis=0, inplace=True)
# axis = 0 means index (=1 means column)
Let's check that the number of entries for today is higher than entries for yesterday
turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]].head()
We see that some of the entries are in reverse. Let's look at a few of these.
mask = ((turnstiles_df["C/A"] == "A011") &
(turnstiles_df["UNIT"] == "R080") &
(turnstiles_df["SCP"] == "01-00-00") &
(turnstiles_df["STATION"] == "57 ST-7 AV") &
(turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 8, 27).date()))
# datetime is both name of module and name of constructor of datetime object
turnstiles_df[mask].head()
Let's see how many stations have this problem and how often.
(turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]]
.groupby(["C/A", "UNIT", "SCP", "STATION"])
.size()) # size() behaves same as if we'd done .DATE.count()
Let's clean these up. If the entries are smaller the day before let's make the sum positive. In order to avoid outliers let's cap the total daily count for daily entries to be 1,000,000 people.
def get_daily_counts(row, max_counter):
counter = row["ENTRIES"] - row["PREV_ENTRIES"]
if counter < 0:
counter = -counter
if counter > max_counter:
print(row["ENTRIES"], row["PREV_ENTRIES"])
return 0
return counter
# If counter is > 1 million, then the counter might have been reset.
# Just set it to zero as different counters have different cycle limits
turnstiles_daily["DAILY_ENTRIES"] = turnstiles_daily.apply(get_daily_counts, axis=1, max_counter=1000000)
# apply takes a function as parameter and applies it along the given axis (1=apply by row)
# apply takes care of passing each row to the function
There we have it! We have the daily entries.
turnstiles_daily.head()
The ultimate goal of this project is to find the busiest stations and find their peak times. In order to do that we will want to look at exits as well. Entries and exit represents total foot traffic for a subway station. The data for all of 2017 was cleaned in a manner similar to above and the daily exits were calculated similarly to the daily entries calculated above. The file was pickled and we will analyze it below.
Let's load in the pickle file.
picklefile = "/Users/kevin/Metis/Projects/1/Project_1/sorted_traffic_dataframe2.pickle"
with open(picklefile,'rb') as read_file:
new_df = pickle.load(read_file)
new_df.head()
This dataset has the daily entries an exits which have been cleaned similarly to the method above. The total traffic was calculated from the sum of the entries and exits. We can use this data set to find the highest trafficked subway station. Below we see that Penn Station takes that mark followed by Grand Central Station.
new_df.groupby(['Station_x'])['Total_Traffic'].sum().sort_values(ascending = False)
Below we are going to break out all 5 stations and plot the daily traffic over a 6 month period.
list_of_stations = ['34 ST-PENN STA', 'GRD CNTRL-42 ST', '23 ST', '34 ST-HERALD SQ', 'TIMES SQ-42 ST']
def foot_traffic_graph(station_list):
'''This function graphs the foot traffic through a list of stations that are passed in via a list
args: A list of the stations to be graphed.
returns: Graphs for the foot traffic for every day the first 6 months of the year
'''
for index, station in enumerate(station_list):
mask = new_df['Station_x'] == station # Selects only the input station
grouped_station = new_df[mask].groupby(['Station_x', 'Date_x'])['Total_Traffic'].sum() # Sums the traffic per day
filtered_station = grouped_station[grouped_station < grouped_station.quantile(.96)] # Remove outliers
six_months_station = filtered_station.values[:len(filtered_station)//2] # Grab only the first 6 months
nine_nine = pd.Series(six_months_station).quantile(.99) # Create 99th percentile line
topper = np.full(len(six_months_station),1)*nine_nine
one = pd.Series(six_months_station).quantile(.1) # Create 1 percentile line
botper = np.full(len(six_months_station),1)*one
plt.figure(figsize=[15,10])
plt.plot(six_months_station, 'b')
ticks_x = np.linspace(0, len(six_months_station), 7)
months = ['Jan','Feb','Mar','Apr', 'May', 'June', 'July']
plt.xlabel('Months', fontsize = '20')
plt.ylabel('Number of People', fontsize = '20')
plt.ylim(50000, 350000)
plt.title('Foot Traffic Through ' + station, fontsize = 20);
plt.xticks(ticks_x, months);
plt.plot(topper)
plt.plot(botper)
#plt.savefig('FootTrafficThoughStations.png')
foot_traffic_graph(list_of_stations)
Here we are going to make a bar chart of the foot traffic per day of the week
This data was loaded in via pickle from a cleaned dataset
weeklypicklefile = '/Users/kevin/Downloads/df2017_Top5.pickle'
with open(weeklypicklefile,'rb') as read_file:
df2017_Top5 = pickle.load(read_file)
new_df.head(5)
df1 = df2017_Top5.groupby('DATE')['TOTAL_TRAF'].sum().reset_index()
df1['datetime'] = pd.to_datetime(df1['DATE'], infer_datetime_format=True)
df1['weekday'] = df1['datetime'].dt.dayofweek
df1.groupby('weekday')['TOTAL_TRAF'].mean().reset_index().sort_values('weekday')
plt.figure(figsize=[14,8])
plt.xlabel('Days of Week', fontsize = 20)
plt.ylabel('Number of Foot Traffic per Day', fontsize = 20)
plt.title('Traffic by Day of Week', fontsize = 40)
plt.bar(days, df1.groupby('weekday')['TOTAL_TRAF'].mean());
plt.savefig('weekly.png')
We can see a cyclic trend over a 6 month time period for the top 5 stations. Most stations have higher traffic in the summer months. The weekdays and Saturday are the highest trafficked days for the top subway stations.