Part-1 Introduction to Spatially enabled DataFrame

Introduction

A DataFrame is a fundamental Pandas data structure that represents a rectangular table of data and contains an ordered collection of columns. You can think of it as a spreadsheet or a SQL table where each column has a column name for reference and each row can be accessed by using row numbers.

The Spatially enabled DataFrame (SeDF) adds "spatial abilities" into the popular Pandas DataFrame by inserting a custom namespace called spatial. This namespace (also known as accessor) allows us to use Pandas operations on both the non-spatial and spatial columns. With SeDF, you can now easily manipulate geometric and other attribute data.

The SeDF is based on data structures inherently suited to data analysis, with natural operations for the filtering and inspection of subsets of values which are fundamental to statistical and geographic manipulations.

Note: Spatial Data Engineering using SeDF builds on top of core Data Engineering concepts in Python. If you are new to Pandas, NumPy and related libraries, we recommend you start with the Introduction to Data Engineering guide series and then come here.

What does "adding spatial abilities" mean?

Well, it means adding capabilities that allow us to:

take spatial data as input
visualize the spatial data
perform various geospatial operations on it
export, publish or save spatial data

To add "spatial abilities", a SeDF must be created from the data, and to create a SeDF, the data must be spatial. In other words, the dataset must have location information (such as an address or latitude, longitude coordinates) or geometry information (such as point, line or polygon, etc.) to create a SeDF from it. There are various ways to create a SeDF from the data and we will go into those details in part-2 of the guide series.

In the background, SeDF uses the spatial namespace to add a SHAPE column to the data. The SHAPE column is of a special data type called geometry and it holds the geometry for each record in the DataFrame. When a spatial method such as plot() is applied to a SeDF (or a spatial property such as geometry_type is called), this command will always act on the geometry column SHAPE.

The image below shows a SeDF created from a Pandas DataFrame. A new SHAPE column, highlighted in red, gets added to the SeDF.

Custom Namespaces

The GeoAccessor and the GeoSeriesAccessor classes, from the arcgis.features module, add two custom namespaces to a given Pandas DataFrame or a Series. The GeoAccessor class adds spatial namespace to the DataFrame and the GeoSeriesAccessor class adds geom namespace to the Series.

By adding custom namespaces, we extend the capabilities of Pandas to allow for spatial operations using various geometry objects. The different geometry objects supported by these namespaces are:

Point
Polyline
Polygon

You can learn more about these geometry objects in our Working with Geometries guide series.

The `spatial` namespace

The spatial namespace allows us to performs spatial operations on a given Pandas DataFrame. The namespace provides:

Dataset level operations
Dataset information
Input/Output operations

The spatial namespace can be accessed using the .spatial accessor pattern. E.g.:

a. The centroid of a dataframe can be retrieved using the centroid property.

>>> df.spatial.centroid

b. The plot() method can be used to draw the data on a web map.

>>> df.spatial.plot()

The `geom` namespace

The geom namespace enables spatial operations on a given Pandas Series. The namespace is accessible using the .geom accessor pattern. E.g.:

a. The area method can be used to retrieve the Feature object’s area.

>>> df.SHAPE.geom.area

b. The buffer() method can be used to constructs a Polygon at a specified distance from the Geometry object.

>>> df.SHAPE.geom.buffer()

Note that geom accessor operates on a series of data type "geometry". The SHAPE column of a SeDF is of geometry data type.

Importing namespaces

GeoSeriesAccessor and GeoAccessor classes are similar to pandas.Series and pandas.DataFrame objects. However, you do not work with them directly. Instead, you import them right after you import Pandas as shown in the snippet below. Importing these classes registers the spatial functionality with Pandas and allows you to start performing spatial operations on your DataFrames.

You may import the classes as follows:

import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor

Geometry Engines

The ArcGIS API for Python uses either shapely or arcpy as back-ends (engines) for processing geometries. The API is identical no matter which engine you use. However, at any point in time, only one engine will be used.

ArcPy provides a useful and productive way to perform geographic data analysis, data conversion, data management, and map automation with Python. With arcpy as the geometry engine, you can read/write different file types, perform various geometric operations and do a lot more without needing multiple other third-party packages that perform such operations.

By default, the ArcGIS API for Python looks for arcpy as the geometry engine. In the absence of arcpy, it looks for shapely. The ArcGIS API for Python integrates the Shapely, Fiona, and PyShp packages so that spatial data from other sources can be accessed through the API. This makes it easier to use the ArcGIS API for Python and work with geospatial data regardless of the platform used. However, we recommend using arcpy for better accuracy and support for a wider gamut of data sources. Here is a one-line overview of each of these packages:

Shapely is used for the manipulation and analysis of geometric objects.
Fiona can read and write real-world data using multi-layered GIS formats, including Esri File Geodatabase. It is often used in combination with Shapely so that Fiona is used for creating the input and output, while Shapely does the data wrangling part.
PyShp is used for reading and writing ESRI shapefiles.

Note: In the absence of arcpy, the ArcGIS API for Python looks for a shapely geometry engine. To allow for a seamless experience, both Shapely and Fiona packages must be present in your current conda environment. If these packages are not installed, you may install them using conda as follows:

conda install shapely
conda install fiona

It could be that both arcpy and shapely are not present in your current environment. In such a scenario, the number of spatial operations you could perform using SeDF will be extremely limited. The cell below shows how to easily detect the current geometry engine in your environment.

import imp
try:
    if imp.find_module('arcpy'):
        print("Has arcpy")
    elif imp.find_module('shapely'):
        print("Has shapely")
    elif imp.find_module('arcpy') and imp.find_module('shapely'):
        print("Has both arcpy and shapely")
except:
    print("Does not have either arcpy or shapely")

Has arcpy

So far, we have gone through some of the basics of Spatially enabled DataFrame. Now, it's time to see the spatial and geom namespaces in action. Let's look at a quick example.

Quick Example

Let's look at a quick example showcasing the spatial and geom namespaces at work. We will start with a common use case of importing the data from a csv file.

In this example, we will:

read the data with location information from a csv file into a Pandas DataFrame
create a SeDF from the Pandas DataFrame
check some properties of the SeDF
apply spatial operations on the geometry column using the geom accessor
plot the SeDF on a map

Data: We will use the Covid-19 data for Nursing Homes in the U.S. to illustrate this example. The data has 124 records and 10 columns.

Note: the dataset used in this example is a subset of Covid-19 Nursing Home data and has been curated for illustration purposes. The complete dataset is available at the Centers for Medicare & Medicaid Services (CMS) website.

# Import Libraries

import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
from arcgis.gis import GIS
from IPython.display import display

# Create an anonymous GIS Connection
gis = GIS()

Get Data

# Read the data
df = pd.read_csv('../data/sample_cms_data.csv')

# Return the first 5 records
df.head()

	Provider Name	Provider City	Provider State	Residents Total Admissions COVID-19	Residents Total COVID-19 Cases	Residents Total COVID-19 Deaths	Number of All Beds	Total Number of Occupied Beds	LONGITUDE	LATITUDE
0	GROSSE POINTE MANOR	NILES	IL	5	56	12	99	61	-87.792973	42.012012
1	MILLER'S MERRY MANOR	DUNKIRK	IN	0	0	0	46	43	-85.197651	40.392722
2	PARKWAY MANOR	MARION	IL	0	0	0	131	84	-88.982944	37.750143
3	AVANTARA LONG GROVE	LONG GROVE	IL	6	141	0	195	131	-87.986442	42.160843
4	HARMONY NURSING & REHAB CENTER	CHICAGO	IL	19	75	16	180	116	-87.726353	41.975505

# Check Shape
df.shape

(124, 10)

The dataset contains 124 records and 10 columns. Each record represents a nursing home in the states of Indiana and Illinois. Each column contains information about the nursing home such as:

Name of the nursing home, its city and state
Details of resident Covid cases, deaths and number of beds
Location of nursing home as Latitude and Longitude

Create a SeDF

A Spatially enabled DataFrame can be created from any Pandas DataFrame with location information (Latitude and Longitude) using the from_xy() method of the spatial namespace.

# Read into a SeDF
sedf = pd.DataFrame.spatial.from_xy(df=df, x_column='LONGITUDE', y_column='LATITUDE', sr=4326)

# Check head
sedf.head()

	Provider Name	Provider City	Provider State	Residents Total Admissions COVID-19	Residents Total COVID-19 Cases	Residents Total COVID-19 Deaths	Number of All Beds	Total Number of Occupied Beds	LONGITUDE	LATITUDE	SHAPE
0	GROSSE POINTE MANOR	NILES	IL	5	56	12	99	61	-87.792973	42.012012	{"spatialReference": {"wkid": 4326}, "x": -87....
1	MILLER'S MERRY MANOR	DUNKIRK	IN	0	0	0	46	43	-85.197651	40.392722	{"spatialReference": {"wkid": 4326}, "x": -85....
2	PARKWAY MANOR	MARION	IL	0	0	0	131	84	-88.982944	37.750143	{"spatialReference": {"wkid": 4326}, "x": -88....
3	AVANTARA LONG GROVE	LONG GROVE	IL	6	141	0	195	131	-87.986442	42.160843	{"spatialReference": {"wkid": 4326}, "x": -87....
4	HARMONY NURSING & REHAB CENTER	CHICAGO	IL	19	75	16	180	116	-87.726353	41.975505	{"spatialReference": {"wkid": 4326}, "x": -87....

We can see that a new SHAPE column has been added while creating a SeDF.

Let's look at the detailed information of the DataFrame.

# Check info
sedf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124 entries, 0 to 123
Data columns (total 11 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   Provider Name                        124 non-null    object  
 1   Provider City                        124 non-null    object  
 2   Provider State                       124 non-null    object  
 3   Residents Total Admissions COVID-19  124 non-null    int64   
 4   Residents Total COVID-19 Cases       124 non-null    int64   
 5   Residents Total COVID-19 Deaths      124 non-null    int64   
 6   Number of All Beds                   124 non-null    int64   
 7   Total Number of Occupied Beds        124 non-null    int64   
 8   LONGITUDE                            124 non-null    float64 
 9   LATITUDE                             124 non-null    float64 
 10  SHAPE                                124 non-null    geometry
dtypes: float64(2), geometry(1), int64(5), object(3)
memory usage: 10.8+ KB

Here, we see that the SHAPE column is of geometry data type.

Check Properties of a SeDF

We just created a SeDF. Let's use the spatial namespace to check some properties of the SeDF.

# Check geometry type
sedf.spatial.geometry_type

['point']

# Visualize geometry
sedf.SHAPE[0]

The geometry_type tells us that our dataset is point data.

# Get true centroid
sedf.spatial.true_centroid

(-87.16989602419355, 40.383302290322575)

Retrieves the true centroid of the DataFrame.

# Get full extent
sedf.spatial.full_extent

(-90.67644, 37.002806, -84.861849, 42.380225)

Retrieves the extent of the data in our DataFrame.

Apply spatial operations using `.geom`

Let's use the geom namespace to apply spatial operations on the geometry column of the SeDF.

Add buffers

We will use the buffer() method to create a 2 unit buffer around each nursing home and add the buffers as a new column to the data.

# Create buffer
sedf['buffer_2'] = sedf.SHAPE.geom.buffer(distance=2)

# Check head
sedf['buffer_2'].head()

0    {"curveRings": [[[-87.792973, 44.012012], {"a"...
1    {"curveRings": [[[-85.197651, 42.392722], {"a"...
2    {"curveRings": [[[-88.982944, 39.750143], {"a"...
3    {"curveRings": [[[-87.986442, 44.160843], {"a"...
4    {"curveRings": [[[-87.726353, 43.975505], {"a"...
Name: buffer_2, dtype: geometry

# Visualize a buffer geometry
sedf['buffer_2'][0]

We can see that the buffers created are of geometry data type.

# Get area
sedf.buffer_2.geom.area

0      12.566371
1      12.566371
2      12.566371
3      12.566371
4      12.566371
         ...    
119    12.566371
120    12.566371
121    12.566371
122    12.566371
123    12.566371
Name: area, Length: 124, dtype: object

The area property retrives the area of each buffer in the units of the DataFrame's spatial reference.

Now that we have created a new buffer_2 column, our data should have two columns of geometry data type i.e. SHAPE and buffer_2. Let's check.

# Check info
sedf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124 entries, 0 to 123
Data columns (total 12 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   Provider Name                        124 non-null    object  
 1   Provider City                        124 non-null    object  
 2   Provider State                       124 non-null    object  
 3   Residents Total Admissions COVID-19  124 non-null    int64   
 4   Residents Total COVID-19 Cases       124 non-null    int64   
 5   Residents Total COVID-19 Deaths      124 non-null    int64   
 6   Number of All Beds                   124 non-null    int64   
 7   Total Number of Occupied Beds        124 non-null    int64   
 8   LONGITUDE                            124 non-null    float64 
 9   LATITUDE                             124 non-null    float64 
 10  SHAPE                                124 non-null    geometry
 11  buffer_2                             124 non-null    geometry
dtypes: float64(2), geometry(2), int64(5), object(3)
memory usage: 11.8+ KB

Calculate distance

Let's calculate the distance from one nursing home to another. We will use the distance_to() method to calculate distance to a given geometry.

# Calculate distance to the first nursing home
sedf.SHAPE.geom.distance_to(sedf.SHAPE[0])

0           0.0
1      3.059052
2      4.424879
3      0.244092
4      0.075967
         ...   
119    0.462069
120    0.116743
121    2.951358
122    0.144996
123    4.055468
Name: distance_to, Length: 124, dtype: object

We just performed some spatial operations on a pandas Series (SHAPE) using the geom namespace. Now, let's perform some basic Pandas operations on SeDF.

Perform Pandas Operations on a SeDF

Let's perform some basic Pandas operations on a SeDF. One of the benefits of the accessor pattern in SeDF is that the SeDF object is of type DataFrame. Thus, you can continue to perform regular Pandas DataFrame operations. We will:

Check the count of records for each state in our data
Remove records that have 0 cases and death values
Create a scatter plot of cases and deaths

# Check record count for each state
sedf['Provider State'].value_counts()

IN    67
IL    57
Name: Provider State, dtype: int64

# Remove records with no cases and deaths
new_df = sedf.query('`Residents Total COVID-19 Cases` != 0 & \
                    `Residents Total COVID-19 Deaths` != 0').copy()
new_df.head()

	Provider Name	Provider City	Provider State	Residents Total Admissions COVID-19	Residents Total COVID-19 Cases	Residents Total COVID-19 Deaths	Number of All Beds	Total Number of Occupied Beds	LONGITUDE	LATITUDE	SHAPE	buffer_2
0	GROSSE POINTE MANOR	NILES	IL	5	56	12	99	61	-87.792973	42.012012	{"spatialReference": {"wkid": 4326}, "x": -87....	{"curveRings": [[[-87.792973, 44.012012], {"a"...
4	HARMONY NURSING & REHAB CENTER	CHICAGO	IL	19	75	16	180	116	-87.726353	41.975505	{"spatialReference": {"wkid": 4326}, "x": -87....	{"curveRings": [[[-87.726353, 43.975505], {"a"...
6	HARCOURT TERRACE NURSING AND REHABILITATION	INDIANAPOLIS	IN	2	1	1	110	66	-86.193469	39.904128	{"spatialReference": {"wkid": 4326}, "x": -86....	{"curveRings": [[[-86.193469, 41.904128], {"a"...
7	GREENCROFT HEALTHCARE	GOSHEN	IN	3	65	13	153	155	-85.817798	41.561063	{"spatialReference": {"wkid": 4326}, "x": -85....	{"curveRings": [[[-85.817798, 43.561063], {"a"...
8	WATERS OF MARTINSVILLE, THE	MARTINSVILLE	IN	2	33	8	103	44	-86.432593	39.407438	{"spatialReference": {"wkid": 4326}, "x": -86....	{"curveRings": [[[-86.432593, 41.407438], {"a"...

# Check shape
new_df.shape

(37, 12)

# Plot cases and deaths
new_df.plot('Residents Total COVID-19 Cases',
            'Residents Total COVID-19 Deaths', 
             kind='scatter',
             title = "Cases vs Deaths");

We just saw how easy it was to perform some Pandas data selection and manipulation operations on a SeDF... piece of cake! Now, let's plot the complete data on a map.

Note - If you would like to learn more about Pandas and data engineering with Pandas, checkout our Data Engineering primer guide part-3.

Plot on a Map

We will use the plot() method of the spatial namespace to plot the SeDF on a map.

# Create Map
m1 = gis.map('IL, USA')
m1

Points displayed on the map show location of each nursing home in our data with at-least 1 case and 1 death. Clicking on a point displays attribute information for that nursing home.

# Plot SeDF on a map
new_df.spatial.plot(m1)

True

With Spatially enabled DataFrame, you can now perform a variety of geospatial operations such as creating buffers, calculating the distance to another geometry or plotting your data on a map, and rendering it using various renderers. While you are at it, you can continue to perform various operations on the DataFrame using Pandas or other open-source libraries such as Seaborn, Scikit-learn, etc. Isn't that exciting!

Conclusion

A DataFrame is a fundamental Pandas data structure and a building block for performing various scientific computations in Python. In this part of the guide series, we introduced the concept of Spatially enabled DataFrame (SeDF) and how it adds "spatial" abilities to a Pandas DataFrame or Series. We also discussed the custom namespaces and geometry engines that operate behind the scenes and allow us to perform spatial operations. You have also seen an end-to-end example of using SeDF to perform various spatial operations along with Pandas operations.

In the next part of this guide series, you will learn about creating a SeDF using GIS data in various formats.

Note: Given the importance and popularity of Spatially enabled DataFrame, we are revisiting our documentation for this topic. Our goal is to enhance the existing documentation to showcase the various capabilities of Spatially enabled DataFrame in detail with even more examples this time.

Creating quality documentation is time-consuming and exhaustive but we are committed to providing you with the best experience possible. With that in mind, we will be rolling out the revamped guides on this topic as different parts of a guide series (like the Data Engineering or Geometry guide series). This is "part-1" of the guide series for Spatially enabled DataFrame. You will continue to see the existing documentation as we revamp it to add new parts. Stay tuned for more on this topic.