Get started

This quick tutorial demonstrates some of the basic capabilities of ArcGIS GeoAnalytics Engine, including how to access and manipulate data through DataFrames, an overview of SQL functions and analysis tools, and how to visualize and save your results. To run the code samples in this tutorial you should have GeoAnalytics Engine installed and authorized in a running Spark session.

Creating DataFrames

GeoAnalytics Engine includes a Python API and a Scala API that extend Spark with spatial capabilities. GeoAnalytics Engine uses Spark DataFrames along with custom geometry data types to represent spatial data. A Spark DataFrame is like a Pandas DataFrame or a table in a relational database but is optimized for distributed queries.

GeoAnalytics Engine comes with several DataFrame extensions for reading from spatial data sources like shapefiles and feature services, in addition to any data source that Spark supports. When reading from a shapefile or feature service, a geometry column will be created automatically. For other data sources, a geometry column can be created from text or binary columns using GeoAnalytics Engine functions.

The following example shows how to create a Spark DataFrame from a feature service of USA county boundaries, and then show the column names and types.

Python

Scala

Use dark colors for code blocksCopy

import geoanalytics
df = spark.read.format("feature-service").load("https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services/USA_Census_Counties/FeatureServer/0")
df.printSchema()

Result
Use dark colors for code blocksCopy
root
 |-- OBJECTID: long (nullable = false)
 |-- NAME: string (nullable = true)
 |-- STATE_NAME: string (nullable = true)
 |-- STATE_ABBR: string (nullable = true)
 |-- STATE_FIPS: string (nullable = true)
 |-- COUNTY_FIPS: string (nullable = true)
 |-- FIPS: string (nullable = true)
 |-- POPULATION: integer (nullable = true)
 |-- POP_SQMI: double (nullable = true)
 |-- SQMI: double (nullable = true)
 |-- Shape__Area: double (nullable = true)
 |-- Shape__Length: double (nullable = true)
 |-- shape: polygon (nullable = true)

Learn more about using Spark DataFrames with GeoAnalytics Engine

Using functions and tools

GeoAnalytics Engine includes three core modules for manipulating DataFrames:

geoanalytics.sql.functions contains spatial functions that operate on columns to do things like create or export geometries, identify spatial relationships, generate bins, and more. These functions can be called through python functions or by using SQL, similar to Spark SQL functions.

The following example shows how to use a SQL function through Python or Scala.

Python

Scala

Use dark colors for code blocksCopy

import geoanalytics.sql.functions as ST
# Calculate the centroid of each county polygon
county_centroids = df.select("Name", ST.centroid("shape"))
# Display the first 5 rows of the result
county_centroids.show(5)

Result
Use dark colors for code blocksCopy
+--------------+--------------------+
|          Name|  ST_Centroid(shape)|
+--------------+--------------------+
|Autauga County|{"x":-86.64275399...|
|Baldwin County|{"x":-87.72434612...|
|Barbour County|{"x":-85.39320138...|
|   Bibb County|{"x":-87.12644474...|
| Blount County|{"x":-86.56737589...|
+--------------+--------------------+
only showing top 5 rows

geoanalytics.tracks.functions contains spatial functions for managing and analyzing track data. Tracks are linestrings that represent the change in an entity's location over time. These functions can be called through Python or Scala functions or by using SQL, similar to Spark SQL functions.

The following example shows how to use a track function through Python or Scala.

Python

Scala

Use dark colors for code blocksCopy

from geoanalytics.tracks import functions as TRK
from geoanalytics.sql import functions as ST
from pyspark.sql import functions as F

data = [
    ("LINESTRING M (-117.27 34.05 1633455010, -117.22 33.91 1633456062, -116.96 33.64 1633457132)",),
    ("LINESTRING M (-116.89 33.96 1633575895, -116.71 34.01 1633576982, -116.66 34.08 1633577061)",),
    ("LINESTRING M (-116.24 33.88 1633575234, -116.33 34.02 1633576336)",)
]

# Create tracks from WKT
trk_df = spark.createDataFrame(data, ["wkt",]) \
         .withColumn("track", ST.line_from_text("wkt", srid=4326))

# Calculate the length of each track and display it
trk_df.select(F.round(TRK.length("track", "miles"), 3).alias("length")).show()

Result
Use dark colors for code blocksCopy
+------+
|length|
+------+
|33.947|
|16.507|
|10.947|
+------+

geoanalytics.tools contains spatial and spatiotemporal analysis tools that execute multi-step workflows on entire DataFrames using geometry, time, and other values. These tools can only be called with their associated Python classes.

Python
Use dark colors for code blocksCopy

from geoanalytics.tools import FindSimilarLocations
# Use Find Similar Locations to find counties with population count and density like Alexander County
fsl = FindSimilarLocations() \
	.setAnalysisFields("POP_SQMI","POPULATION") \
	.setMostOrLeastSimilar("MostSimilar") \
	.setNumberOfResults(5) \
    .setAppendFields("NAME", "STATE_NAME") \
	.run(df.where("NAME = 'Alexander County'"), df.where("NAME != 'Alexander County'"))

# Show the result
fsl.select("simrank", "NAME", "STATE_NAME").filter("NAME is not NULL").sort("simrank").show()

Result
Use dark colors for code blocksCopy
+-------+----------------+--------------+
|simrank|            NAME|    STATE_NAME|
+-------+----------------+--------------+
|      1|St. James Parish|     Louisiana|
|      2|   Greene County|North Carolina|
|      3|   Dakota County|      Nebraska|
|      4| McDuffie County|       Georgia|
|      5|    Union County|     Tennessee|
+-------+----------------+--------------+

Learn more about tools and SQL functions.

Exploring results

When scripting in a notebook-like environment, GeoAnalytics Engine supports simple visualization of spatial data with an included plotting API based on matplotlib.

Python
Use dark colors for code blocksCopy

# Plot counties in Georgia and symbolize on population
df.where("STATE_NAME = 'Georgia'").st.plot(cmap_values="POPULATION", cmap="RdPu", figsize=(6,6), basemap="light")

Any DataFrame can be persisted by writing it to a collection of shapefiles, vector tiles, or any data sink supported by Spark.

Python
Use dark colors for code blocksCopy

# Write a DataFrame returned from tool as a collection of shapefiles to an S3 bucket
fsl.write.format("shapefile").save("s3a://my-bucket/fsl_result")

What next?

To get started, learn about using and loading data into DataFrames, running analysis and SQL functions, and visualizing results through the available guides and tutorials: