Classifying San Francisco Crime Incidents

Given the location and date & time it occurred, could I classify the type of crime being reported?

For this project, I wanted to see if I could apply classification algorithms to the San Francisco crime data available from the SF Open Data website.

The data set I downloaded contained almost 200,000 police incidents recorded between January 2016 and April 2017, and contained data points for the category of crime, a short description, latitude & longitude coordinates, the police district the crime occurred in, and the date & time it was reported.

Crime categories

There were 29 different categories of crime, the most frequent being larceny/theft (53,000 incidents), and the least frequent being gamling (24 incidents).

There were just too many types of crimes, so I chose to focus on the types of crimes I thought I might be able to predict (for instance, I didn’t think there would be much use in trying to predict where white collar crime or ‘suspicious behaviour’ would happen). In the end I chose 7 different types of crimes, which amounted to 108,000 incidents (just over half of the original data set):

The problem here, though, was that I had a hugely imbalanced set of classes – larceny/theft incidents made up roughly half of the data set! To address this issue, I looked into the descriptions and was able to break down the larceny category into a number of different subcategories of theft; namely, theft of property, theft from auto, shoplifting/pickpocket, and a catch-all ‘other theft’ for all other types of theft. In the end I chose to drop the burglary and ‘other categories’, leaving me with:

This still left me with theft from auto being a much larger category than the rest and almost twice as large as assault, but at least it was an improvement on the previous distribution of classes where the largest class was more than 3 times larger than the next.

Data collection

To get the most out of my data I figured my best bet would be to extract as much information as I could from the latitude and longitude coordinates in the data set – they had a lot of potential explanatory power, but in their raw format they were pretty much useless (you can’t train a model to find patterns in coordinates).

My focus, then, would be to seek out other coordinates from external data sets that I could use to ‘make sense’ of the crime coordinate data.

Zip code boundary data

The first data set I acquired was boundary data for all the zip codes in the city – it was easy to find as it is also available from the SF Open Data website. This data was in the form of a GeoJSON file and contained a series of coordinates (also known as Polygons) that mapped out the boundaries for each of the zip codes.

With this data, I performed a spatial join against my crime incident coordinates data (also known as Points) to identify which zip code each incident fell into. I have another blog post explaining this procedure here.

Zillow house price data

Now that I knew which zip code each crime incident occurred, I could connect two other data sets; the first being house price data that I collected through the Zillow API (if you’re reading this in the UK, Zillow is a US equivalent to Zoopla or Rightmove).

Their ‘z-index’ data is only available at the zip code level, but now I was able to gain an idea as to wealth of the neighbourhood each crime incident had occurred. The intuition here was that I thought there may be different types of crimes depending on how ‘well heeled’ the neighbourhood was.

Census income data

The second data set I connected against the zip code data was median income data for each zip code. I was able to collect this information from the 2010 census. The data also contained the total size of each zip code, so for a more accurate representation, I used an income per square mile measure.

The intuition behind this data was essentially the same as the house price data – I thought that measures of personal wealth might make a difference.

BART and Caltrain station locations

Now my attention turned to finding other coordinate Points for various landmarks / places of interest around San Francisco that might explain what types of crimes were being committed.

Train stations are busy places and I thought they may ‘hot beds’ of crime, so my first thought was to collect station locations for BART (Bay Area Rapid Transit – underground train network) and Caltrains (overground trains).

To get this data I used the BART API for collecting the station coordinate data and manually collected the Caltrain data from Wikipedia as there are only a handful stations.

Police station locations

Next, I collected the locations of all the police stations in the city; there aren’t many so I manually copied and pasted the data from the SF police website. However, they only had the street addresses, so I had to run the addresses through the Google Maps Geocoding API to retrieve the latitude and longitude coordinates.

Medical marijuana dispensary locations

Also available from the SF Open Data website are the locations of all the medical marijuana dispensary locations (it’s legal to buy for medical reasons) dotted around the city. There aren’t that many (roughly 30), but I thought that perhaps they might explain crimes occurring around the city.

Health care facility locations

I also collected the coordinates for all the health care facilities in the city. These range from hospitals, through to community clinics for drug treatment and mental health.

Homeless shelter locations

I’m not sure if there’s just a lack of a support network or if the warm climate has anything to do with it, but one thing that surprised me when I first arrived in San Francisco was the sheer number of homeless people in the city.

The last set of geospatial data that I collected was from the Shelter website for all the locations of homeless shelters in the city.

Altitude data

Finally, I ran all the coordinates of the crime incidents through an elevation database to collect the altitude for each incident location. I wasn’t sure what this might lead to, but it was relatively straight forward to collect and I’d let the machine learning take care of the feature importance.

Feature engineering on the location data

Zip codes

As I have already mentioned, I used the GeoJSON data of zip code boundaries to identify which zip code each incident occurred. From there, I used the zip codes to connect the average house price and median income data sets.

Distance to Union Square

For all of the individual crime incidents in my data set I calculated the distance from a city landmark, and Union Square seemed like a good choice.

Basically, I decided to pick an ‘anchor point’ in the city that would allow me to create relationships between the individual incidents.

Closest distances to the various points of interest

For each crime incident I calculated the distance to the closest police station, train station, medical marijuana dispensary, health care facility and homeless shelter. I figured I might identify clusters or ‘hot spots’ of crimes in this way.

Total number of nearby points of interest

Lastly, as a slightly different take on the above, I also calculated the total number of dispensaries, train stations, health care facilities and homeless shelters within a certain distance. The idea was that if there were, say, 3 marijuana dispensaries all within a short distance of each other, this might be a factor that increases the likelihood of different types of crimes occurring within that part of the city.

Examining feature importance

Before I started properly modelling on the data, as part of an initial exploration of the features in my data set I ran a Random Forest Classifier on the data and extracted the feature importance information.

Below were the top 30 features:

I was pleased to see that all the time I spent on data acquisition and feature engineering appeared to have been a worthwhile endeavour with many of the features I had created appearing high up in this list – particularly the top 7 features (which appeared to be significantly more important than the rest):

  1. Distance to Union Square
  2. Distance to nearest train station
  3. Distance to nearest health care facility
  4. Distance to nearest police station
  5. Distance to nearest medical marijuana dispensary
  6. Distance to nearest homeless shelter
  7. Altitude of crime incident


With my data acquisition, feature engineering and data pre-processing (which I haven’t talked about, but involved scaling the numerical data and creating dummy variables of the categorical features) completed, I set up a pipeline to fit 7 different classifiers and perform a grid search on a variety of hyper parameters for each of these models.

The 7 models that I trained on the data were:

  1. AdaBoost Classifier
  2. Bagging Classifier
  3. Extra Trees Classifier
  4. Gradient Boosting Classifier
  5. K-Nearest Neighbors Classifier
  6. Logistic Regresson
  7. Random Forest Classifier

Each model had its own variety of hyper parameter specifications to fit, and after trying all combinations in the grid search, I had actually trained more than 1,000 different variations of these models. As the data was only around 100,000 rows, it didn’t take too long (a few hours). If the data was a lot larger I probably would have used randomized grid search instead.

Also in the pipeline (before commencing with the machine learning) I performed an automatic feature selection where – like above – the features were selected from a Random Forest classifier.

For each of the 7 different models, the grid search identified these as the seven best models from all the hyper parameter options:


Best accuracy score: 0.412


Best accuracy score: 0.438

Extra Trees

Best accuracy score: 0.381

Gradient Boosted Trees

Best accuracy score: 0.457


Best accuracy score: 0.431

Logistic regression

Best accuracy score: 0.384

Random Forest

Best accuracy score: 0.410

There wasn’t really much in it between the optimal Gradient boosted trees, Bagging classifier or KNN models. These were the classification reports for all three:

Gradient Boost



All 3 models performed relatively similarly.

Multi-class classification problems are not easy to deal with, but each of these models was able to classify between 8 different types of crime with an average precision and recall of around 40%. When you consider that randomly classifying would yield only 12.5% accuracy, it’s actually not too bad.

I’m sure with further feature engineering, and sampling the data to remove the class imbalances, these metrics would increase further.


Your email address will not be published. Required fields are marked *