Introduction to Geographical Time Series Prediction with Crime Data in R, SQL, and Tableau

When reviewing geographical data, it can be difficult to prepare the data for an analysis. This article helps by covering importing data into a SQL Server database; cleansing and grouping data into a map grid; adding time data points to the set of grid data and filling in the gaps where no crimes occurred; importing the data into R; running XGBoost model to determine where crimes will occur on a specific day



By Jason Wittenauer, Lead Data Scientist at Huron Consulting Group

Figure

 

 

In this tutorial you will learn how to prepare geographical data for time series predictions.

 
When reviewing geographical data, it can be difficult to prepare the data for an analysis. There are specific algorithms that can be used, but they can be limited in what they do. If we spend some time splitting up the data into grids and then adding a time point for each grid point, it opens up more possibilities for modeling algorithms and other features that can be added. However, it does create an issue with the size of the data set.

A five year crime data set can easily consist of 250,000 records. Once that is extrapolated into a time series grid of an entire city, it can easily hit 75 million data points. When dealing with data of this size, it is helpful to use a database to cleanse the data before sending it to a modeling script. The steps we will follow are listed below:

  • Importing data into a SQL Server database.
  • Cleansing and grouping data into a map grid.
  • Adding time data points to the set of grid data and filling in the gaps where no crimes occurred.
  • Importing the data into R
  • Running an XGBoost model to determine where crimes will occur on a specific day

At the end, we will discuss the next steps for making the predictions more usable to end users in BI tools like Tableau.

 

Prerequisites

 
Before beginning this tutorial, you will need:

  • SQL Server Express installed
  • SQL Management Studio or similar IDE to interface with SQL Server
  • R installed
  • R Studio, Jupyter notebook, or other IDE to interface with R
  • A general working knowledge of SQL and R

 

Data Flow Overview

 
We will be keeping the database and data flow in a simple structure for now. The database itself will be very flat and the final data handoff into the Tableau dashboard will be executed via a text file. In a more production ready version, we would be saving the predictions back into the database and pulling from that database into the reporting dashboard. For this example, we are not triyng to architect everything perfectly, but just understand the basics of doing geographical time series predictions.

Our data flow is below:

 

Setup the Database

 
Our prediction model will be using crime data for the Baltimore area from 2012 to 2017. This is located in the "Data" folder within this repo and the filename is "Baltimore Incident Data.zip". Before importing this data you will need to follow one of the options below to setup the database:

Option 1

  • Restore a SQL database using the backup file located in the "Database Objects\Clean Backup\" folder.

Option 2

  • Use the scripts located in the "Database Objects" folder under "Tables" and "Procedures" to manually create all of the objects.

 

Import the Data

 
Once the database has been successfully created, you can now import the data. This will require you to do the following:

  • Unzip the "Baltimore Incident Data.zip" file found in the "Data" folder.
  • Run the "Insert_StagingCrime" procedure and make sure it is pointed to the correct import file and to the correct format file (found in the "Data" folder named "FormatFile.fmt").
EXEC Insert_StagingCrime


This procedure will be truncating the Staging_Crime and inserting data from the file directly into it using a BULK INSERT. The staging table itself has all VARCHAR(MAX) data types which we will convert into better data types in the next phase of the import process. A code snippet of the procedure is below.

BULK INSERT Staging_Crime 
FROM 'C:\Projects\Crime Prediction\Data\Baltimore Incident Data.csv'
WITH (FIRSTROW = 2, FORMATFILE = 'C:\Projects\Crime Prediction\Data\FormatFile.fmt')


 

Review the Data.

 
Now that the data has been imported into the staging table, you can view it by running the below code in SQL Management Studio:

SELECT TOP 10 *
FROM [dbo].[Staging_Crime]


Giving you the following results.

CrimeDate CrimeTime CrimeCode Address Description InsideOutside Weapon Post District Neighborhood Location Premise TotalIncidents
02/28/2017 23:50:00 6D 2400 KEYWORTH AVE LARCENY FROM AUTO O NULL 533 NORTHERN Greenspring (39.3348700000, -76.6590200000) STREET 1
02/28/2017 23:36:00 4D 200 DIENER PL AGG. ASSAULT I HANDS 843 SOUTHWESTERN Irvington (39.2830300000, -76.6878200000) APT/CONDO 1
02/28/2017 23:02:00 4E 1800 N MOUNT ST COMMON ASSAULT I HANDS 742 WESTERN Sandtown-Winchester (39.3092400000, -76.6449800000) ROW/TOWNHO 1
02/28/2017 23:00:00 6D 200 S CLINTON ST LARCENY FROM AUTO O NULL 231 SOUTHEASTERN Highlandtown (39.2894600000, -76.5701900000) STREET 1
02/28/2017 22:00:00 6E 1300 TOWSON ST LARCENY O NULL 943 SOUTHERN Locust Point (39.2707100000, -76.5911800000) STREET 1
02/28/2017 21:40:00 6J 1000 WILMOT CT LARCENY O NULL 312 EASTERN Oldtown (39.2993600000, -76.6034100000) STREET 1
02/28/2017 21:40:00 6J 2400 PENNSYLVANIA AVE LARCENY O NULL 733 WESTERN Penn North (39.3094200000, -76.6417700000) STREET 1
02/28/2017 21:30:00 5D 1500 STACK ST BURGLARY I NULL 943 SOUTHERN Riverside (39.2721500000, -76.6033600000) OTHER/RESI 1
02/28/2017 21:30:00 6D 2100 KOKO LN LARCENY FROM AUTO O NULL 731 WESTERN Panway/Braddish Avenue (39.3117800000, -76.6633200000) STREET 1
02/28/2017 21:10:00 3CF 800 W LEXINGTON ST ROBBERY - COMMERCIAL O FIREARM 712 WESTERN Poppleton (39.2910500000, -76.6310600000) STREET 1

 

Notice how this data has the point in time the crime occurred, the latitude/longitude, and even crime categories. Those categories can be very useful for more advanced modeling techniques and analytics.

 

Cleanse the Data and Create Grids of the Map

 
Next we can move our data into a "Crime" table that has the correct data types associated with it. During this step we will also split out the location field into longitude and latitude fields. All of the logic to complete these steps can be done by running the below execution statement:

EXEC Insert_Crime


And now we will have data populating the "Crime" table. This will allow us to complete the next step to create a grid on the city and assign each crime to one of the grid squares. You will notice that we create two grids (small and large). This allows us to create features that are at the crime location and a little bit further away from the crime location. Essentially giving us hotspots of crime throughout the city.

Run the below code to create the grids and assign a SmallGridID and LargeGridID to the "Crimes" table:

EXEC Update_CrimeCoordinates


This procedure is doing three separate tasks:

  1. Creating a small grid of squares on the map within the GridSmall table (Procedure: Insert_GridSmall).
  2. Creating a large grid of squares on the map within the GridLarge table (Procedure: Insert_GridLarge).
  3. Assigning all the crime records to a small and large square on the map.

The two procedures that create the grid squares are taking in variables to determine the corners of the map and how many squares we want to have on our map. This is defaulting to a small grid that is 200 by 200 and a large grid that has squares twice as big at 100 by 100.

You can view some of the grid data by running the following:

SELECT TOP 10 c.CrimeId, gs.*
FROM [dbo].[Crime] c
JOIN GridSmall gs
    ON gs.GridSmallId = c.GridSmallId


CrimeId GridSmallId BotLeftLatitude TopRightLatitude BotLeftLongitude TopRightLongitude
1 31012 39.3341282000001 39.3349965000001 -76.6594308000001 -76.6585152000001
2 19121 39.2828985 39.2837668 -76.68873 -76.6878144
3 25198 39.3089475000001 39.3098158000001 -76.6456968000001 -76.6447812000001
4 20657 39.2889766 39.2898449 -76.5706176000001 -76.5697020000001
5 16212 39.269874 39.2707423 -76.5916764000001 -76.5907608000001
6 22832 39.2985279000001 39.2993962000001 -76.6035792000001 -76.6026636000001
7 25202 39.3089475000001 39.3098158000001 -76.6420344000001 -76.6411188000001
8 16601 39.2716106 39.2724789 -76.6035792000001 -76.6026636000001
9 25781 39.3115524000001 39.3124207000001 -76.6640088 -76.6630932
10 20992 39.2907132 39.2915815 -76.6319628000001 -76.6310472000001

 

Notice how there are two points being calculated for a square, the top right and bottom left points. We don't have to calculate all four points on the square even though there is technically some curve to a real map when draw longitude and latitude lines. When the scale gets small enough, we can just assume that the squares are mostly straight lines

 

Create Crime Grid and Lag Features

 
The last step we need to complete is creating the entire grid of the map for all time periods we want to evaulate. In our case, that is one grid point per day to determine if a crime is going to occur. To complete this step, the below procedure needs to be executed:

EXEC Insert_CrimeGrid


During this step each crime will be grouped together on the map squares so we can determine when and how many crimes occurred on each date in our data set. We will also need to calculate all the filler dates for each square when no crime occurred to make a complete data set.

The procedure take quite awhile to run. On my laptop, it ran for about 1 hour and the subsequent table ("CrimeGrid") contains about 75 million records. The nice part is that the output is now saved into a table, so we will not have to run it in our R script where large data operations might not run as efficiently as inside a database.

Also during this step we will be creating "lag features". These will be columns that tell us how many times within the last day, two days, week, month, etc. that a crime occured on the grid square. This is essentially helping us do a "hotspot" analysis of the data, which can be used to look at other grid squares nearby to see if the crime is localized to our single square or if it is clustered to all nearby squares, similar to aftershocks in an earthquake. These features may or may not be necessary depending on the type of modeling you do.

 

Prediction Setup

 
With all of the data cleansing and feature engineering being done on the database side, the code for doing predictions is quite simple. In our example we will be using XGBoost in R to analyze five years of training data to predict future crimes.

To start, we load our libraries.

# Load required libraries
library(RODBC)
library(xgboost)
library(ROCR)
library(caret)


Loading required package: gplots

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

Loading required package: lattice
Loading required package: ggplot2
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


Then import the data into R directly from the database. You could also export the data to text files and read in CSV data if that is your preferred method. The queries to pull the data are just written as SQL statements with some date limiters. These could be re-written to pull directly from reporting stored procedures with date parameters for a more productionized version of the code.

# Set seed
set.seed(1001)

# Read in data
dbhandle <- odbcDriverConnect('driver={SQL Server};server=DESKTOP-VLN71V7\\SQLEXPRESS;database=crime;trusted_connection=true')

train <- sqlQuery(dbhandle, 'select IncidentOccurred as target, GridSmallId, GridLargeId, DayOfWeek, MonthOfYear, DayOfYear, Year, PriorIncident1Day, PriorIncident2Days, PriorIncident3Days, PriorIncident7Days, PriorIncident14Days, PriorIncident30Days, PriorIncident1Day_Large, PriorIncident2Days_Large, PriorIncident3Days_Large, PriorIncident7Days_Large, PriorIncident14Days_Large, PriorIncident30Days_Large from crimegrid where crimedate <= \'2/20/2017\' and crimedate >= \'6/1/2012\'')
test <- sqlQuery(dbhandle, 'select IncidentOccurred as target, GridSmallId, GridLargeId, DayOfWeek, MonthOfYear, DayOfYear, Year, PriorIncident1Day, PriorIncident2Days, PriorIncident3Days, PriorIncident7Days, PriorIncident14Days, PriorIncident30Days, PriorIncident1Day_Large, PriorIncident2Days_Large, PriorIncident3Days_Large, PriorIncident7Days_Large, PriorIncident14Days_Large, PriorIncident30Days_Large from crimegrid where crimedate >= \'2/21/2017\' and crimedate <= \'2/27/2017\'')

# Convert integers to numeric for DMatrix
train[] <- lapply(train, as.numeric)
test[] <- lapply(test, as.numeric)

head(train)


target GridSmallId GridLargeId DayOfWeek MonthOfYear DayOfYear Year PriorIncident1Day PriorIncident2Days PriorIncident3Days PriorIncident7Days PriorIncident14Days PriorIncident30Days PriorIncident1Day_Large PriorIncident2Days_Large PriorIncident3Days_Large PriorIncident7Days_Large PriorIncident14Days_Large PriorIncident30Days_Large
0 3780 990 5 9 262 2013 0 0 0 0 0 0 0 0 0 0 0 0
0 3781 991 5 9 262 2013 0 0 0 0 0 0 0 0 0 0 0 0
0 3782 991 5 9 262 2013 0 0 0 0 0 0 0 0 0 0 0 0
0 3783 992 5 9 262 2013 0 0 0 0 0 0 0 0 0 0 0 0
0 3784 992 5 9 262 2013 0 0 0 0 0 0 0 0 0 0 0 0
0 3785 993 5 9 262 2013 0 0 0 0 0 0 0 0 0 0 0 0

 

Notice in the test data set we are pull 7 days worth of data to test at once, but still keeping the features coming in as if we knew what happened the prior day. This is just so we can test 7 days at once and see how multiple days worth of predictions look in our reporting at the very end.

Another item to note is that the features labeled "_Large" are for the larger grid vs. the smaller grid.

 

Create Feature Set and Model Parameters

 
We are defining the features as all of the columns that come after the first "target" column in the SQL query. These are passed in as part of the labeling on the train and test data sets.

# Get feature names (all but first column which is the target)
feature.names <- names(train)[2:ncol(train)]

print(feature.names)

# Make train and test matrices
dtrain <- xgb.DMatrix(data.matrix(train[,feature.names]), label=train$target)
dtest <- xgb.DMatrix(data.matrix(test[,feature.names]), label=test$target)


 [1] "GridSmallId"               "GridLargeId"              
 [3] "DayOfWeek"                 "MonthOfYear"              
 [5] "DayOfYear"                 "Year"                     
 [7] "PriorIncident1Day"         "PriorIncident2Days"       
 [9] "PriorIncident3Days"        "PriorIncident7Days"       
[11] "PriorIncident14Days"       "PriorIncident30Days"      
[13] "PriorIncident1Day_Large"   "PriorIncident2Days_Large" 
[15] "PriorIncident3Days_Large"  "PriorIncident7Days_Large" 
[17] "PriorIncident14Days_Large" "PriorIncident30Days_Large"


Next, setup the parameters for model training. We are keeping them pretty basic for now. Make sure that the evaluation metric is AUC so we can try to maximize our True Positive Rate. This will prevent police officers from patrolling areas unnecessarily. There could be additional work to cover all predicted areas with clever patrolling, but we are not going to get into it during this tutorial.

# Training parameters
watchlist <- list(eval = dtest, train = dtrain)

param <- list(  objective           = "binary:logistic", 
                booster             = "gbtree",
                eta                 = 0.01,
                max_depth           = 10,
                eval_metric         = "auc"
)


 

Run the Model

 
Now that everything is setup, we can run the model and see how well it is predicting. Keep in mind that this is using a pretty basic set of features just to demonstrate one way to deal with geographic time series data. While the data sets can get quite large, they are easy to understand and can run fairly quickly when using cloud based services like AWS.

This particular training data set had 70 million rows and took about 30 minutes to complete 67 rounds of evaluation on my laptop.

# Run model
clf <- xgb.train(   params                  = param, 
                    data                    = dtrain, 
                    nrounds                 = 100, 
                    verbose                 = 2, 
                    early_stopping_rounds   = 10,
                    watchlist               = watchlist,
                    maximize               = TRUE)


[14:30:32] WARNING: amalgamation/../src/learner.cc:686: Tree method is automatically selected to be 'approx' for faster speed. To use old behavior (exact greedy algorithm on single machine), set tree_method to 'exact'.
[1]	eval-auc:0.858208	train-auc:0.858555 
Multiple eval metrics are present. Will use train_auc for early stopping.
Will train until train_auc hasn't improved in 10 rounds.

[2]	eval-auc:0.858208	train-auc:0.858555 
[3]	eval-auc:0.858208	train-auc:0.858555 
[4]	eval-auc:0.858208	train-auc:0.858556 
[5]	eval-auc:0.858208	train-auc:0.858556 
[6]	eval-auc:0.858208	train-auc:0.858556 
[7]	eval-auc:0.858311	train-auc:0.858997 
[8]	eval-auc:0.858311	train-auc:0.858997 
[9]	eval-auc:0.858311	train-auc:0.858997 
[10]	eval-auc:0.858315	train-auc:0.859000 
[11]	eval-auc:0.858436	train-auc:0.859110 
[12]	eval-auc:0.858436	train-auc:0.859110 
[13]	eval-auc:0.858512	train-auc:0.859157 
[14]	eval-auc:0.858493	train-auc:0.859157 
[15]	eval-auc:0.858496	train-auc:0.859160 
[16]	eval-auc:0.858498	train-auc:0.859160 
[17]	eval-auc:0.858498	train-auc:0.859160 
[18]	eval-auc:0.858342	train-auc:0.859851 
[19]	eval-auc:0.858177	train-auc:0.859907 
[20]	eval-auc:0.858228	train-auc:0.859971 
[21]	eval-auc:0.858231	train-auc:0.859971 
[22]	eval-auc:0.858206	train-auc:0.860695 
[23]	eval-auc:0.858207	train-auc:0.860695 
[24]	eval-auc:0.858731	train-auc:0.860894 
[25]	eval-auc:0.858702	train-auc:0.860844 
[26]	eval-auc:0.858607	train-auc:0.860844 
[27]	eval-auc:0.858574	train-auc:0.860842 
[28]	eval-auc:0.858602	train-auc:0.860892 
[29]	eval-auc:0.858576	train-auc:0.860843 
[30]	eval-auc:0.858574	train-auc:0.860841 
[31]	eval-auc:0.858607	train-auc:0.860893 
[32]	eval-auc:0.858578	train-auc:0.860843 
[33]	eval-auc:0.858611	train-auc:0.860894 
[34]	eval-auc:0.858612	train-auc:0.860895 
[35]	eval-auc:0.858614	train-auc:0.860898 
[36]	eval-auc:0.858615	train-auc:0.860899 
[37]	eval-auc:0.858616	train-auc:0.860897 
[38]	eval-auc:0.858573	train-auc:0.860870 
[39]	eval-auc:0.858546	train-auc:0.860822 
[40]	eval-auc:0.858575	train-auc:0.860872 
[41]	eval-auc:0.858622	train-auc:0.860898 
[42]	eval-auc:0.858578	train-auc:0.860875 
[43]	eval-auc:0.858583	train-auc:0.860870 
[44]	eval-auc:0.859223	train-auc:0.861768 
[45]	eval-auc:0.859220	train-auc:0.861760 
[46]	eval-auc:0.859221	train-auc:0.861760 
[47]	eval-auc:0.859099	train-auc:0.861719 
[48]	eval-auc:0.859112	train-auc:0.861735 
[49]	eval-auc:0.859112	train-auc:0.861735 
[50]	eval-auc:0.859094	train-auc:0.861734 
[51]	eval-auc:0.859125	train-auc:0.861785 
[52]	eval-auc:0.859021	train-auc:0.861771 
[53]	eval-auc:0.859028	train-auc:0.861784 
[54]	eval-auc:0.859029	train-auc:0.861781 
[55]	eval-auc:0.859028	train-auc:0.861784 
[56]	eval-auc:0.859035	train-auc:0.861788 
[57]	eval-auc:0.859037	train-auc:0.861789 
[58]	eval-auc:0.859035	train-auc:0.861775 
[59]	eval-auc:0.859035	train-auc:0.861774 
[60]	eval-auc:0.859010	train-auc:0.861738 
[61]	eval-auc:0.859011	train-auc:0.861739 
[62]	eval-auc:0.859039	train-auc:0.861778 
[63]	eval-auc:0.859016	train-auc:0.861739 
[64]	eval-auc:0.859017	train-auc:0.861741 
[65]	eval-auc:0.859018	train-auc:0.861746 
[66]	eval-auc:0.859019	train-auc:0.861747 
[67]	eval-auc:0.859024	train-auc:0.861755 
Stopping. Best iteration:
[57]	eval-auc:0.859037	train-auc:0.861789


The model found the patterns quickly and didn't get too much improvement over each round. This could be enhanced with better model parameters and more features.

 

Review Importance Matrix

 
Time to analyze our features to see which ones are rating highly in the model by looking at the importance matrix. This can help us determine whether the new features that we add are really worth it on a large data set (larger data set = extra processing time per feature added).

# Compute feature importance matrix
importance_matrix <- xgb.importance(feature.names, model = clf)

# Graph important features
xgb.plot.importance(importance_matrix[1:10,])


It looks like the model it using a combination of small and large grid features to determine if an incident will occur on the current day. It is interesting that the longer term features seem to be more important, indicating a history of criminal activity in the area and/or surrounding areas.

 

Predict on Test and Check ROC Curve

 
The last step is to check our predictions against the test data set and see how well they did. We always hope that the ROC curve will spike up really high and really quick, but that is not always the case. In our example, we have a decent score with only basic features included in the model. This definitely shows us that we can do the predictions and more time should be invested to make them better.

# Predict on test data
preds <- predict(clf, dtest)

# Graph AUC curve
xgb.pred <- prediction(preds, test$target)
xgb.perf <- performance(xgb.pred, "tpr", "fpr")

plot(xgb.perf,
     avg="threshold",
     colorize=TRUE,
     lwd=1,
     main="ROC Curve w/ Thresholds",
     print.cutoffs.at=seq(0, 1, by=0.05),
     text.adj=c(-0.5, 0.5),
     text.cex=0.5)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")


 

Review the Confusion Matrix

 
We know there is a decent AUC score but let's look at the actual output of what we predicted vs. what actually happened. The easiest way to do this is to review the confusion matrix. We want the top left and bottom right boxes (good predictions) to be big and the others (bad predictions) to be small.

# Set our cutoff threshold
preds.resp <- ifelse(preds >= 0.5, 1, 0)

# Create the confusion matrix
confusionMatrix(as.factor(preds.resp), as.factor(test$target), positive = "1")


Confusion Matrix and Statistics

          Reference
Prediction      0      1
         0 280581    454
         1     62    367
                                         
               Accuracy : 0.9982         
                 95% CI : (0.998, 0.9983)
    No Information Rate : 0.9971         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.5864         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.447016       
            Specificity : 0.999779       
         Pos Pred Value : 0.855478       
         Neg Pred Value : 0.998385       
             Prevalence : 0.002917       
         Detection Rate : 0.001304       
   Detection Prevalence : 0.001524       
      Balanced Accuracy : 0.723397       
                                         
       'Positive' Class : 1              


Over the course of 7 days, it looks like there were 821 incidents and our model correctly predicted 367 of them. The other key point is that we incorrectly predicted 62 incidents, essentially sending police officers to areas where we thought a crime could occur but did not. At face value, this doesn't seem too bad but we would need to see how this affects response times to criminal activity vs. what the current response times are at right now. The idea being that police officers are in areas close enough to a crime that they can either prevent it with their presence or respond to it very quickly to prevent as much harm as possible.

 

Prepare Data for Reporting

 
Next we can add in the longitude and latitude coordinates for the test data set and export it to a CSV file. This will allow us to take a look at the actual predictions on a map in a BI tool like Tableau.

# Read in the grid coordinates
gridsmall <- sqlQuery(dbhandle, 'select * from gridsmall')

# Merge the predictions with the test data
results <- cbind(test, preds)

# Merge the grid coordinates with the test data
results <- merge(results, gridsmall, by="GridSmallId")

head(results)

# Save to file
write.csv(results,"Data\\CrimePredictions.csv", row.names = TRUE)


GridSmallId target GridLargeId.x DayOfWeek MonthOfYear DayOfYear Year PriorIncident1Day PriorIncident2Days PriorIncident3Days ... PriorIncident3Days_Large PriorIncident7Days_Large PriorIncident14Days_Large PriorIncident30Days_Large preds BotLeftLatitude TopRightLatitude BotLeftLongitude TopRightLongitude GridLargeId.y
1 0 1 3 2 52 2017 0 0 0 ... 0 0 0 0 0.2822689 39.20041 39.20128 -76.71162 -76.7107 1
1 0 1 5 2 54 2017 0 0 0 ... 0 0 0 0 0.2822689 39.20041 39.20128 -76.71162 -76.7107 1
1 0 1 7 2 56 2017 0 0 0 ... 0 0 0 0 0.2822689 39.20041 39.20128 -76.71162 -76.7107 1
1 0 1 4 2 53 2017 0 0 0 ... 0 0 0 0 0.2822689 39.20041 39.20128 -76.71162 -76.7107 1
1 0 1 2 2 58 2017 0 0 0 ... 0 0 0 0 0.2822689 39.20041 39.20128 -76.71162 -76.7107 1
1 0 1 1 2 57 2017 0 0 0 ... 0 0 0 0 0.2822689 39.20041 39.20128 -76.71162 -76.7107 1

 

 

Visualizing the Predictions

 
It is time to see what these predictions actually look like over the course of 7 days. The Tableau dashboard itself can be found here:

While looking at the predictions that were incorrect (red dots), it looks like they are pretty close to predictions that were correct. This would put police officers in the general vicinity of a crime occurring. So maybe the incorrect predictions are not affecting the overall results very much at all. Example screenshots are shown below with the correct predictions (blue) and incorrect predictions (red).

Incorrect Predictions

Incorrect and Correct Predictions

Overall it does not seem too bad, but we will need more features and/or more data to capture all those missing predictions. Also, there is probably a lot more we can do to focus on specific types of crimes that are occuring and key in on specific prediction modeling to handle each type.

 

What Next?

 
This tutorial got us started with doing geographical time series predictions using crime data. We can see that the predictions are definitely working, but there is more work to be done with creating features. We might want to add in some other features that check a larger area for prior crime occurences. Another useful step would be to change our predictions to run hourly and map squad car patrol routes by time of day. Even if the predictions are not perfect, as long as you are putting a police officer in the general vicinity of a crime and it is better than current patrol methodologies, then they can respond much faster or even prevent the crime from occurring in the first place with just their presence.

Other ideas to think about:

  • Remove crime data in the vicinity of police stations, fire stations, hospitals, etc. since those could be biased against people submitting reports.
  • Add in demographic features related to census information.
  • Map crimes to neighborhoods instead of a square grid to predict against.

Hope you enjoyed the tutorial!

 
Bio: Jason Wittenauer is a data scientist focused on improving hospital earnings with a background in R, Python, Microsoft SQL Server, Tableau, TIBCO Spotfire, and .NET web programming. His areas of focus in healthcare are business operations, revenue improvement, and expense reduction.

Original. Reposted with permission.

Related: