Crime in Chicago

The objective of this project is to predict whether a person who committed a particular crime was arrested for the city of Chicago. The city of Chicago Data Portal has every crime dating back to 2001 in it's database with location and crime information for each crime. This dataset will be combined with NOAA weather data and a model will be created for arrests.

import pandas as pd
import numpy as np
import pickle
import feather
import seaborn as sns
import matplotlib.pyplot as plt
import joblib

from sqlalchemy import create_engine  
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from joblib import dump, load

%matplotlib inline

Ignoring the warnings was included at the end for a cleaner looking notebook.

import warnings
warnings.filterwarnings('ignore')

For this project I loaded the chicago crime data onto a remote server and pulled the data down as needed. Using postgreSQL, I created the schema for the data to be loaded onto the remote server.

SQL Query for remote server:

CREATE TABLE IF NOT EXISTS ChicagoCrime (
ID integer,
CaseNumber varchar(20),
Date varchar(50),
Block varchar(50),
IUCR varchar(10),
PrimaryType varchar(50),
Description varchar(100),
LocationDescription varchar(150),
Arrest varchar(10),
Domestic varchar(10),
Beat integer,
District real,
Ward real,
CommunityArea real,
FBICode varchar(10),
XCoordinate varchar(20),
YCoordinate varchar(20),
Year integer,
UpdatedOn varchar(50),
Latitude varchar(15),
Longitude varchar(15),
Location varchar(50)
);

Pull Data From Server

Data was pulled in from the remote server to be analyzed.

cnx = create_engine('postgresql://ubuntu@ec2-100-24-40-180.compute-1.amazonaws.com/chicago')
df = pd.read_sql_query('''SELECT * FROM chicagocrime''', cnx)

The datetimes will need to be converted to datetime format for pandas to recognize the type as a date. I also went ahead and checked to see what the arrest rate was for all crimes committed in chicago.

df['datetime'] = pd.to_datetime(df['date'], infer_datetime_format=True)
mask = df['arrest']  == 'true'
print('Percent of Crimes ending in Arrest: ' + str(len(df[mask])/len(df))b)

Percent of Crimes ending in Arrest: 0.27685069987994154

Only 27.7% of crimes in Chicago end in arrest. Yikes!

In order to work with the data locally we will serialize the data with feather.

#df.to_feather('chicago_crime.feather')
df = feather.read_dataframe('chicago_crime.feather')

Read in Weather Data

Weather data from NOAA was pulled in as it may have some predictive ability for our problem.

weather_data_path = '/Users/kevin/Downloads/1598904.csv'
df_weather = pd.read_csv(weather_data_path)

Weather Column Descriptions:

WT03 - Thunder
WT04 - Ice pellets, sleet, snow pellets, or small hail"
PRCP - Precipitation
WT05 - Hail (may include small hail)
WV03 - Thunder
WT06 - Glaze or rime
WT07 - Dust, volcanic ash, blowing dust, blowing sand, or blowing obstruction
WT08 - Smoke or haze
SNWD - Snow depth
WT09 - Blowing or drifting snow
WDF2 - Direction of fastest 2-minute wind
WDF5 - Direction of fastest 5-second wind
PGTM - Peak gust time
WT11 - High or damaging winds
TMAX - Maximum temperature
WT13 - Mist
WSF2 - Fastest 2-minute wind speed
FMTM - Time of fastest mile or fastest 1-minute wind
WSF5 - Fastest 5-second wind speed
SNOW - Snowfall
WT14 - Drizzle
WT15 - Freezing drizzle
WT16 - Rain (may include freezing rain, drizzle, and freezing drizzle)"
WT17 - Freezing rain
WT18 - Snow, snow pellets, snow grains, or ice crystals
WT19 - Unknown source of precipitation
AWND - Average wind speed
WT21 - Ground fog
WT22 - Ice fog or freezing fog
WV20 - Rain or snow shower
WT01 - Fog, ice fog, or freezing fog (may include heavy fog)
WESD - Water equivalent of snow on the ground
WT02 - Heavy fog or heaving freezing fog (not always distinguished from fog)
TAVG - Average Temperature.
TMIN - Minimum temperature
TSUN - Total sunshine for the period

The weather data will need to be merged with the crime data by date so converted the weather dates to datetimes. I also made the columns lowercase for consistency.

df_weather.columns = map(str.lower, df_weather.columns)
df_weather['datetime'] = pd.to_datetime(df_weather['date'], infer_datetime_format=True)

Merge Weather Data and Crime Data

Now that both weather and crime data are loaded in I merged the two dataframes into one.

df_weather = df_weather.sort_values('datetime')
df = df.sort_values('datetime')
cw_df = pd.merge_asof(df, df_weather, on = 'datetime', direction = 'backward', tolerance = pd.Timedelta('1 day')) 
df = cw_df.reset_index()
#df.to_feather('chicago_crime_and_weather.feather')

Clean Dataset

After analyzing the combined weather and crime dataset I decided which columns to drop. Each column was analyzed and the decision to drop the column was made individually, hence why there are several repetitive drop methods.

df = df.drop('index', axis = 1)
df = df.drop('casenumber', axis = 1)
df = df.drop('id', axis = 1)
df = df.drop('block', axis = 1)
df = df.drop('station', axis = 1)
df = df.drop('fmtm', axis = 1)
df = df.drop('pgtm', axis = 1)
df = df.drop('snwd', axis = 1)
df = df.drop('xcoordinate', axis = 1)
df = df.drop('ycoordinate', axis = 1)
df = df.drop('datetime', axis = 1)
df = df.drop('tavg', axis = 1)
df = df.drop('date_y', axis = 1)
df = df.drop('iucr', axis = 1)
df = df.drop('name', axis = 1)
df = df.drop('year', axis = 1)
df = df.drop('updatedon', axis = 1)
df = df.drop('location', axis = 1)
df = df.drop('fbicode', axis = 1)
df = df.drop('description', axis = 1)
df = df.drop('date_x', axis = 1)
df = df.drop(['wdf2', 'wdf5', 'wesd', 'wsf2', 'wsf5', 'wt01',
       'wt02', 'wt03', 'wt04', 'wt05', 'wt06', 'wt07', 'wt08', 'wt09', 'wt11',
       'wt13', 'wt14', 'wt15', 'wt16', 'wt17', 'wt18', 'wt19', 'wt21', 'wt22',
       'wv03', 'wv20', 'tsun'], axis = 1)

Now that I have all the data I want in my dataframe, I will convert all data types to the proper type so they can be fed into a sklearn classifer.

df['primarytype'] = df['primarytype'].astype('category')
df['description'] = df['description'].astype('category')
df['locationdescription'] = df['locationdescription'].astype('category')
df['arrest'].replace('true', 1, inplace = True)
df['arrest'].replace('false', 0, inplace = True)
df['domestic'].replace('true', 1, inplace = True)
df['domestic'].replace('false', 0, inplace = True)
df['fbicode'] = df['fbicode'].astype('category')
df['xcoordinate'] = df['xcoordinate'].fillna(value=np.nan)
df['xcoordinate'] = df['xcoordinate'].astype('int64', errors = 'ignore')
df['ycoordinate'] = df['ycoordinate'].fillna(value=np.nan)
df['ycoordinate'] = df['ycoordinate'].astype('int64', errors = 'ignore')
df['latitude'] = df['latitude'].astype('float64', errors = 'ignore')
df['longitude'] = df['longitude'].astype('float64', errors = 'ignore')
df['station'] = df['station'].astype('category')
df = df.dropna(subset=['district'])
df = df.dropna(subset=['latitude'])
df['locationdescription'] = df.locationdescription.fillna(value='OTHER')
df['communityarea'] = df.sort_values(by=['beat', 'district', 'ward'])['communityarea'].fillna(method='ffill')
df['ward'] = df.sort_values(by=['beat', 'district', 'communityarea'])['ward'].fillna(method='ffill')
df = df.reset_index()

Now that the dataset is cleaned, I wanted to make sure there are no nulls in the dataset that would cause issues in a model.

for header in df.columns:
    
    nulls_count = df[f'{header}'].isnull().sum()
    
    print(f'There are {nulls_count} in {header}')

The dataset is now cleaned and ready to be examined closer. The work done so far will be saved in a feather file for quick data loading in the future.

#df.to_feather('chicago_crime_cleaned.feather')
df = feather.read_dataframe('chicago_crime_cleaned.feather')

EDA

#df.to_feather('chicago_crime_final.feather')
df = feather.read_dataframe('chicago_crime_final.feather')

I first wanted to check for linear relationships in the data. I ran a pearson correlation to see if there were any relationships between the features and the label. I also wanted to check for colinearity between the features, which could lead to model overfitting.

df.corr()

	index	arrest	domestic	beat	district	ward	communityarea	latitude	longitude	awnd	prcp	snow	tmax	tmin
index	1.000000	-0.055044	0.043370	-0.035996	-0.004956	0.013127	0.004968	-0.005265	0.001056	0.036246	0.016684	0.020514	0.017340	0.038502
arrest	-0.055044	1.000000	-0.069274	-0.015993	-0.016780	-0.015836	-0.008292	0.002096	-0.031477	0.001616	-0.009167	0.002330	-0.023662	-0.025416
domestic	0.043370	-0.069274	1.000000	-0.041821	-0.038657	-0.050101	0.072056	-0.075669	0.004518	0.002332	0.002825	0.002082	0.004467	0.003772
beat	-0.035996	-0.015993	-0.041821	1.000000	0.939092	0.635785	-0.506381	0.612650	-0.473687	-0.003126	-0.000468	0.000737	-0.002075	-0.002319
district	-0.004956	-0.016780	-0.038657	0.939092	1.000000	0.688740	-0.499337	0.620597	-0.528367	-0.001122	-0.000027	0.000919	-0.001339	-0.001220
ward	0.013127	-0.015836	-0.050101	0.635785	0.688740	1.000000	-0.532559	0.626385	-0.432463	0.000059	-0.000011	0.001221	-0.000049	0.000588
communityarea	0.004968	-0.008292	0.072056	-0.506381	-0.499337	-0.532559	1.000000	-0.747118	0.240317	0.000821	0.001185	-0.000435	0.001802	0.001377
latitude	-0.005265	0.002096	-0.075669	0.612650	0.620597	0.626385	-0.747118	1.000000	-0.410834	-0.000028	-0.000483	0.001649	-0.003313	-0.002630
longitude	0.001056	-0.031477	0.004518	-0.473687	-0.528367	-0.432463	0.240317	-0.410834	1.000000	-0.002323	0.000345	-0.002221	0.007822	0.007829
awnd	0.036246	0.001616	0.002332	-0.003126	-0.001122	0.000059	0.000821	-0.000028	-0.002323	1.000000	0.080271	0.099045	-0.250733	-0.215913
prcp	0.016684	-0.009167	0.002825	-0.000468	-0.000027	-0.000011	0.001185	-0.000483	0.000345	0.080271	1.000000	0.104215	0.092846	0.142637
snow	0.020514	0.002330	0.002082	0.000737	0.000919	0.001221	-0.000435	0.001649	-0.002221	0.099045	0.104215	1.000000	-0.227082	-0.203525
tmax	0.017340	-0.023662	0.004467	-0.002075	-0.001339	-0.000049	0.001802	-0.003313	0.007822	-0.250733	0.092846	-0.227082	1.000000	0.941816
tmin	0.038502	-0.025416	0.003772	-0.002319	-0.001220	0.000588	0.001377	-0.002630	0.007829	-0.215913	0.142637	-0.203525	0.941816	1.000000

All categorical variables in the dataframe will need to be converted to dummies before they are loaded into the model. After the dummies are added to the model, the original categorical columns can be dropped.

df = pd.concat([df, pd.get_dummies(df['primarytype'])], axis = 1)
df = df.drop('primarytype', axis = 1)
df = pd.concat([df, pd.get_dummies(df['locationdescription'])], axis = 1)
df = df.drop('locationdescription', axis = 1)

Create an Evaluation Function

An evaluation function was created to consistently analyze all of the models to determine which model was the best. The AUC score was the score that will tell me which model was the best performing. The validation and testing accuracy scores will tell me how accurate the model was with out of sample data. The training accuracy can be compared to the validation and testing accuracy and will let me know if the model is overfitting.

def evaluate_model(clf):
    
    train_preds = clf.predict(X_train)
    train_auc = roc_auc_score(y_train, train_preds)
    val_preds = clf.predict(X_val)
    val_auc = roc_auc_score(y_val, val_preds)
    test_preds = clf.predict(X_test)
    test_auc = roc_auc_score(y_test, test_preds)
    train_score = clf.score(X_train, y_train)
    val_score = clf.score(X_val, y_val)
    test_score = clf.score(X_test, y_test)
    confusion_mat = confusion_matrix(y_test, test_preds)
    
    return print(f"AUC for training set: {train_auc} \nAUC for validation set: {val_auc} \nAUC for test set: {test_auc} \nScore for training set: {train_score}\nScore for validation set: {val_score} \nScore for test set: {test_score} \nConfusion Matrix: \n{confusion_mat}")