Spatial joins

GeoAnalytics Engine includes a spatial join that combines rows from two DataFrames based on a spatial relationship. This join generates spatial indices for your data automatically at runtime to improve performance. You can perform a spatial join with either Spark SQL or the Spatiotemporal Join tool. Both options utilize the same underlying spatial join implementation included with GeoAnalytics Engine. Differences between the two options and examples of when you might use one versus the other are detailed below.

Spatial joins in Spark SQL

To perform a spatial join with Spark SQL, you must use a GeoAnalytics Engine SQL function in the join condition expression. For example, if you want to join two DataFrames based on where one DataFrame's geometries are within the other, you could use ST_Within in the join condition expression. The following example shows joining a DataFrame of wind turbine locations to a DataFrame of burn area polygons using the Spark SQL join. Because a GeoAnalytics Engine SQL function is part of the join condition, spatial indices are generated automatically and a spatial join is performed.

SQL
Use dark colors for code blocksCopy

burn_areas_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/MTBS_Polygons_v1/FeatureServer/0"
turbines_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Wind_Turbine_Database/FeatureServer/0"

spark.read.format("feature-service").load(burn_areas_url) \
     .withColumn("polygon", ST.transform("shape", 102007)) \
     .createOrReplaceTempView("burn_areas")
spark.read.format("feature-service").load(turbines_url) \
     .withColumn("point", ST.transform("shape", 102007)) \
     .createOrReplaceTempView("turbines")

spark.sql(
    """
    SELECT COUNT(*) AS `Count of turbines in burn areas`
    FROM turbines JOIN burn_areas ON
    ST_Within(turbines.point, burn_areas.polygon)
    """
).show()

Result
Use dark colors for code blocksCopy
+-------------------------------+
|Count of turbines in burn areas|
+-------------------------------+
|                           4812|
+-------------------------------+

You can also perform the same spatial join with Python only using pyspark.sql.DataFrame.join, as shown in the following example.

Python
Use dark colors for code blocksCopy

burn_areas_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/MTBS_Polygons_v1/FeatureServer/0"
turbines_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Wind_Turbine_Database/FeatureServer/0"

burn_areas = spark.read.format("feature-service").load(burn_areas_url) \
                  .select(ST.transform("shape", 102007).alias("polygon"))
turbines = spark.read.format("feature-service").load(turbines_url) \
                .select(ST.transform("shape", 102007).alias("point"))

result = turbines.join(burn_areas, ST.within(turbines.point, burn_areas.polygon))

print("Count of turbines in burn areas:")
result.count()

Result
Use dark colors for code blocksCopy
Count of turbines in burn areas:
4812

You can verify that a spatial join is being used by Spark SQL by calling explain on the join result and looking for SpatialJoin in the physical plan string, as shown below.

Python
Use dark colors for code blocksCopy

result.explain()

Result
Use dark colors for code blocksCopy
== Physical Plan ==
SpatialJoin point#928: point, polygon#868: polygon, SpatialXYRelationship(Within), Default, false
:- *(1) Project [ST_Transform(shape#898, in=WGS_1984_Web_Mercator_Auxiliary_Sphere:3857, out=Hawaii_Albers_Equal_Area_Conic:102007, datum_transform="WGS_1984_(ITRF00)_To_NAD_1983") AS point#928]
:  +- BatchScan Wind Turbines[shape#898] FeatureServiceLayerScan[f=json, query="resultType=standard&returnZ=true&where=1=1&returnM=true&outFields=", paging=oid-range(field=FID,size=32000,count=3)] RuntimeFilters: []
+- *(2) Project [ST_Transform(shape#857, in=GCS_North_American_1983:4269, out=Hawaii_Albers_Equal_Area_Conic:102007) AS polygon#868]
   +- BatchScan MTBS_Polygons[shape#857] FeatureServiceLayerScan[f=json, query="resultType=standard&returnZ=true&where=1=1&returnM=true&outFields=", paging=oid-range(field=OBJECTID,size=4000,count=8)] RuntimeFilters: []

Spatiotemporal Join tool

The Spatiotemporal Join tool offers similar functionality to spatially joining with Spark SQL but also supports joining on temporal relationships and attribute relationships. See the tool documentation for more information. The table below summarizes some key differences between performing spatial joins with PySpark or Spark SQL vs. the Spatiotemporal Join tool.

	Spark SQL join	Spatiotemporal Join tool
Usage	Can be called with SQL or Python syntax.	Available in Python only as a tool.
Spatial relationships	Use one or more of 9 SQL functions to define a spatial relationship. Using the following functions in your join condition expression will trigger a spatial join: ST_Contains ST_Crosses ST_DWithin ST_Equals ST_Intersects ST_Overlaps ST_Relate ST_Touches ST_Within	Choose from one of 9 spatial relationships. These relationships are evaluated using the primary geometry column of each DataFrame. Available spatial relationships include: Contains Crosses Equals Intersects NearGeodesic NearPlanar Overlaps Touches Within
Temporal relationships	Use unix timestamps and Spark SQL to define temporal relationships.	Choose from one of 14 temporal relationships. These relationships are evaluated using the primary time column(s) of each DataFrame. Available temporal relationships include: Contains During Equals Finishes Finished by Intersects Meets Met by Near Near after Near before Overlaps Overlapped by Starts Started by
Attribute relationships	Use Spark SQL to define an attribute relationship in your join expression.	Use ArcGIS Arcade to define an attribute relationship.

Performance considerations & best practices

When using ST_DWithin or the Near spatial relationships, transform your data to a projected spatial reference that preserves distances between geometries for the most accurate results. For more information see Choosing a projected coordinate system.
When using ST_DWithin or the Near spatial relationships, a larger near distance will be slower. To improve performance of near joins across large distances, increase the ratio of memory to cores in your Spark cluster.
Spatial joins are memory-intensive operations. It is recommended that you have at least 8 GB of memory available per core in your Spark cluster to allow for optimal performance.
DataFrames with geometries that are highly clustered will be slower than if the geometries are dispersed evenly over the DataFrame's spatial extent. If possible, add a temporal relationship and/or attribute relationship to your join condition when joining highly clustered data. You can also increase the ratio of memory to cores in your Spark cluster to improve performance with clustered or dense data.
Geodesic distance calculations will be slower than planar distance calculations. For more information see Planar vs. geodesic distance calculations.
Limiting the extent of your data can improve performance by filtering out invalid coordinates and locations irrelevant to your analysis. You can filter by extent prior to joining with ST_EnvIntersects.

What's next?

Learn more about spatial relationships and spatial functions:

Use the analysis tool Nearest neighbors to join neighboring geometries.
Use the analysis tool Group By Proximity to group nearby geometries.
Try the aggregate functions tutorial.