Unlocking the Power of Geospatial Data with GeoPandas

I recently worked with quite a lot of geospatial data for a project and in the process I stumbled upon the life saver that is GeoPandas. As the name implies, it builds on the main functionality found in the regular Pandas library in Python, but allows you to work with geospatial data in a number of ways.

The goal of GeoPandas is to make working with geospatial data in python easier. It combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS. GeoPandas.org

In the project I was working on, I was attempting to classify San Francisco crime incidents using data available from the excellent San Francisco Open Data website. The data included the exact latitude and longitude coordinates of each crime incident and I wanted to make use of this data for lots of location-based feature engineering. This is really useful data to have, but in it’s raw format it’s not much use at all; I needed to extract the potential explanatory power from it and turn the raw data into usable information.

GeoPandas made this task very straightforward, and I wanted to highlight how easy it is to use.

The data

Importing the relevant libraries:

In [1]:
import pandas as pd
import geopandas
from shapely.geometry import Point
import matplotlib.pyplot as plt
%matplotlib inline

Below is an example of my data set. I’ve cleaned it up a bit from the raw data I downloaded from the website (and I’m only working with a sample of the entire data set) but as you can see, each crime incident has a longitude and latitude value.

In [2]:
police_data = pd.read_csv('sf_police_incidents.csv')
police_data.sample(5)
Out[2]:
incident_num category longitude latitude
1663 160520394.0 VANDALISM -122.400040 37.799206
975 160269768.0 WARRANTS -122.419658 37.764221
1920 170266431.0 VANDALISM -122.407538 37.765783
2888 170208847.0 OTHER OFFENSES -122.482592 37.780188
4979 170097822.0 OTHER OFFENSES -122.409969 37.731046

At the moment I’m just working with a standard Pandas DataFrame.

In [3]:
police_data['geometry'] = police_data.apply(lambda row: Point(row['longitude'], row['latitude']), axis=1)
police_data.sample(5)
Out[3]:
incident_num category longitude latitude geometry
2064 160694961.0 WARRANTS -122.412433 37.774332 POINT (-122.4124334 37.77433216)
2438 160566144.0 LARCENY/THEFT -122.448278 37.804272 POINT (-122.4482784 37.80427189)
1442 160450462.0 LARCENY/THEFT -122.414810 37.786519 POINT (-122.41481 37.78651941)
589 160318149.0 NON-CRIMINAL -122.411615 37.783161 POINT (-122.4116151 37.78316141)
397 160203114.0 OTHER OFFENSES -122.414850 37.761288 POINT (-122.4148498 37.76128793)
In [4]:
type(police_data)
Out[4]:
pandas.core.frame.DataFrame

Using the Shapely library, I created Point coordinates from the latitude and longitude values and gave it a column name of geometry. I’ll use this Point data with GeoPandas.

From the GeoPandas documentation:

The most important property of a GeoDataFrame is that it always has one GeoSeries column that holds a special status. This GeoSeries is referred to as the GeoDataFrame‘s “geometry”. When a spatial method is applied to a GeoDataFrame (or a spatial attribute like area is called), this commands will always act on the “geometry” column.

For simplicity, I set my column name to geometry.

To learn more about Shapely, the documentation is available here.

Create a GeoDataFrame

In [5]:
geo_police_data = geopandas.GeoDataFrame(police_data, geometry='geometry')
geo_police_data.crs = {'init': 'epsg:4326'}
In [6]:
geo_police_data.sample(5)
Out[6]:
incident_num category longitude latitude geometry
2179 160307669.0 DRUNKENNESS -122.418434 37.751420 POINT (-122.418434 37.75141993)
1207 170106100.0 WARRANTS -122.447842 37.788134 POINT (-122.447842 37.78813407)
373 179000652.0 LARCENY/THEFT -122.483800 37.782989 POINT (-122.4837996 37.78298908)
2315 160127934.0 ASSAULT -122.418749 37.755437 POINT (-122.4187487 37.75543719)
2005 160770436.0 OTHER OFFENSES -122.428255 37.767709 POINT (-122.4282547 37.76770909)
In [7]:
type(geo_police_data)
Out[7]:
geopandas.geodataframe.GeoDataFrame

Setting a Projection

In the second line of code in [5] above, I specified the Coordinate Reference System.

Again, from the GeoPandas documentation:

CRS are important because the geometric shapes in a GeoSeries or GeoDataFrame object are simply a collection of coordinates in an arbitrary space. A CRS tells Python how those coordinates related to places on the Earth.

EPSG:4326

EPSG stands for the European Petroleum Survey Group, and EPSG:4326 (also known as WGS84 – World Geodetic System 1984) is the commonly used latitude / longitude coordinate system.

Visualising the data

Let’s take a look at the coordinates data:

In [8]:
geo_police_data.plot(figsize=(13,10), marker="o", mfc="red", markersize=8, markeredgecolor="black", alpha=0.3)
plt.show()

It may be difficult to tell because there’s no discernible map underneath the Points, but I now have map coordinates.

GeoJSON data of San Francisco

So what can I do with this data? This is where the feature engineering begins.

I also downloaded a GeoJSON file of the neighborhood boundaries in San Francisco from the same website. These are saved as Polygons rather than Points, and draw out the boundary lines for each of the neighborhoods.

In [9]:
sf = geopandas.read_file('sf.geojson')
sf.crs = {'init': 'epsg:4326'}
sf = sf.rename(columns={'geometry': 'geometry','nhood':'neighborhood_name'}).set_geometry('geometry')
sf.sample(5)
Out[9]:
geometry neighborhood_name
16 (POLYGON ((-122.4266236816762 37.8088925471013… Marina
35 (POLYGON ((-122.4191679998507 37.7752779999068… Tenderloin
2 (POLYGON ((-122.4265550005568 37.7694849998470… Castro/Upper Market
25 (POLYGON ((-122.4723749999696 37.7825040000270… Outer Richmond
5 (POLYGON ((-122.3875252162534 37.7827973443872… Financial District/South Beach

Like before, I set the coordinate reference system and set the geometry column so that GeoPandas knows which column contains the geospatial data.

Let’s take a look at the first Polygon:

In [10]:
sf[0:1].plot(figsize=(8,8))
plt.show()

And now all of the Polygons together:

In [11]:
sf.plot(figsize=(13,10), color='gray')
plt.show()

OK great, now I have the boundary coordinates for all of the neighborhoods in San Francisco.

Overlaying Points on top of Polygons

In [12]:
fig, ax = plt.subplots(1, figsize=(13,10))
sf_map = sf.plot(ax=ax, color='gray')
geo_police_data.plot(ax=sf_map, marker="o", mfc="red", markersize=8, markeredgecolor="black", alpha=0.3)
ax.set_title("San Francisco incidents of crime")
plt.show()

Now that I have the neighborhood boundaries plotted underneath the locations of the crime incidents, it’s easier to make sense of the data and see where most of the crimes tend to occur. The latitude and longitude data points are now beginning to become useful.

Conducting a spatial join

Visually, I can see which neighborhood each incident occurred but if I want to include the neighborhood as a feature in my data set, I need to save these neighborhood labels as a new column in my DataFrame. This is where spatial joins come into play.

In [13]:
combined = geopandas.tools.sjoin(geo_police_data, sf, how='left')
combined.sample(5)
Out[13]:
incident_num category longitude latitude geometry index_right neighborhood_name
839 166152096.0 LARCENY/THEFT -122.409925 37.787764 POINT (-122.4099253 37.7877641) 35 Tenderloin
1991 160562001.0 VANDALISM -122.402210 37.728104 POINT (-122.4022101 37.72810422) 27 Portola
4122 160632858.0 ASSAULT -122.417757 37.769814 POINT (-122.4177567 37.76981424) 18 Mission
601 166165033.0 LARCENY/THEFT -122.376850 37.740194 POINT (-122.3768501 37.74019357) 0 Bayview Hunters Point
2493 161022052.0 OTHER OFFENSES -122.418810 37.781235 POINT (-122.4188096 37.78123526) 35 Tenderloin

And it’s as simple as that. Much like the way the standard Pandas library conducts joins, or even how a SQL left join works, the GeoPandas library conducted a spatial join, where it identified the neighborhood Polygon that each of the individual crime incident Points fell into.

My combined GeoDataFrame now has the incident data and the name of the neighborhood in which it occurred – which is a lot more useful than the original latitude and longitude data points themselves.

Now I can do things like this…

Aggregating by Neighborhood and Creating a Chloropleth Map

You can even generate Chloropleth maps using GeoPandas (where the colour of each shape is based on the value of an associated variable). By setting the number of crime incidents as that variable, I can see which neighborhoods had the highest number of crime incidents (darker colour = more crime):

In [14]:
neighborhood_counts = combined.groupby('neighborhood_name')['incident_num'].count().reset_index()
neighborhood_counts.columns = ['neighborhood_name','num_incidents']
neighborhood_counts = neighborhood_counts.sort_values(by='num_incidents', ascending=False)
In [15]:
neighborhood_counts.head()
Out[15]:
neighborhood_name num_incidents
32 South of Market 566
18 Mission 550
34 Tenderloin 510
5 Financial District/South Beach 362
0 Bayview Hunters Point 299
In [16]:
chloropleth_data = neighborhood_counts.merge(sf)
chloropleth_data = geopandas.GeoDataFrame(chloropleth_data, geometry='geometry')
chloropleth_data.crs = {'init': 'epsg:4326'}
In [17]:
chloropleth_data.plot(column='num_incidents', cmap='OrRd', figsize=(13,10))

Your email address will not be published. Required fields are marked *