This quick tutorial demonstrates some of the basic capabilities of ArcGIS GeoAnalytics Engine, including how to access and manipulate data through DataFrames, an overview of SQL functions and analysis tools, and how to visualize and save your results. To run the code samples in this tutorial you should have GeoAnalytics Engine installed and authorized in a running Spark session.
Creating DataFrames
GeoAnalytics Engine extends PySpark, the Python interface for Spark, and uses Spark DataFrames along with custom geometry data types to represent spatial data. A Spark DataFrame is like a Pandas DataFrame or a table in a relational database but is optimized for distributed queries.
GeoAnalytics Engine comes with several DataFrame extensions for reading from spatial data sources like shapefiles and feature services, in addition to any data source that PySpark supports. When reading from a shapefile or feature service, a geometry column will be created automatically. For other data sources, a geometry column can be created from text or binary columns using GeoAnalytics Engine functions.
The following example shows how to create a Spark DataFrame from a feature service of USA county boundaries, and then show the column names and types.
import geoanalytics
df = spark.read.format("feature-service").load("https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_Census_Counties/FeatureServer/0")
df.printSchema()
root
|-- OBJECTID: long (nullable = false)
|-- NAME: string (nullable = true)
|-- STATE_NAME: string (nullable = true)
|-- STATE_ABBR: string (nullable = true)
|-- STATE_FIPS: string (nullable = true)
|-- COUNTY_FIPS: string (nullable = true)
|-- FIPS: string (nullable = true)
|-- POPULATION: integer (nullable = true)
|-- POP_SQMI: double (nullable = true)
|-- SQMI: double (nullable = true)
|-- Shape__Area: double (nullable = true)
|-- Shape__Length: double (nullable = true)
|-- shape: polygon (nullable = true)
Learn more about using Spark DataFrames with GeoAnalytics Engine
Using functions and tools
GeoAnalytics Engine includes two core modules for manipulating DataFrames:
geoanalytics.sql.functions
contains spatial functions that operate on
columns to do things like create or export geometries, identify spatial
relationships, generate bins, and more. These functions can be called through
python functions or by using SQL, similar to Spark SQL
functions.
The following example shows how to use a SQL function through Python.
import geoanalytics.sql.functions as ST
# Calculate the centroid of each county polygon
county_centroids = df.select("Name", ST.centroid("shape"))
# Display the first 5 rows of the result
county_centroids.show(5)
+--------------+--------------------+
| Name| ST_Centroid(shape)|
+--------------+--------------------+
|Autauga County|{"x":-86.64275399...|
|Baldwin County|{"x":-87.72434612...|
|Barbour County|{"x":-85.39320138...|
| Bibb County|{"x":-87.12644474...|
| Blount County|{"x":-86.56737589...|
+--------------+--------------------+
only showing top 5 rows
geoanalytics.tools
contains spatial and spatiotemporal analysis tools that
execute multi-step workflows on entire DataFrames using geometry, time,
and other values. These tools can only be called with their associated
Python classes.
from geoanalytics.tools import FindSimilarLocations
# Use Find Similar Locations to find counties with population count and density like Alexander County
fsl = FindSimilarLocations() \
.setAnalysisFields("POP_SQMI","POPULATION") \
.setMostOrLeastSimilar("MostSimilar") \
.setNumberOfResults(5) \
.setAppendFields("NAME", "STATE_NAME") \
.run(df.where("NAME = 'Alexander County'"), df.where("NAME != 'Alexander County'"))
# Show the result
fsl.select("simrank", "NAME", "STATE_NAME").filter("NAME is not NULL").sort("simrank").show()
+-------+----------------+--------------+
|simrank| NAME| STATE_NAME|
+-------+----------------+--------------+
| 1|St. James Parish| Louisiana|
| 2| Greene County|North Carolina|
| 3| Dakota County| Nebraska|
| 4| McDuffie County| Georgia|
| 5| Union County| Tennessee|
+-------+----------------+--------------+
Learn more about tools and SQL functions.
Exploring results
When scripting in a notebook-like environment, GeoAnalytics Engine supports simple visualization of spatial data with an included plotting API based on matplotlib.
# Plot counties in Georgia and symbolize on population
df.where("STATE_NAME = 'Georgia'").st.plot(cmap_values="POPULATION", cmap="RdPu", figsize=(6,6), basemap="light")
Any DataFrame can be persisted by writing it to a collection of shapefiles, vector tiles, or any data sink supported by PySpark.
# Write a DataFrame returned from tool as a collection of shapefiles to an S3 bucket
fsl.write.format("shapefile").save("s3a://my-bucket/fsl_result")
What next?
To get started, learn about using and loading data into DataFrames, running analysis and SQL functions, and visualizing results through the available guides and tutorials: