GeoAnalytics Engine includes a spatial join that combines rows from two DataFrames based on a spatial relationship. This join generates spatial indices for your data automatically at runtime to improve performance. You can perform a spatial join with either Spark SQL or the SpatiotemporalJoin tool. Both options utilize the same underlying spatial join implementation included with GeoAnalytics Engine. Differences between the two options and examples of when you might use one versus the other are detailed below.
Spatial joins in Spark SQL
To perform a spatial join with Spark SQL, you must use a GeoAnalytics Engine SQL function in the join condition expression. For example, if you want to join two DataFrames based on where one DataFrame's geometries are within the other, you could use ST_Within in the join condition expression. The following example shows joining a DataFrame of wind turbine locations to a DataFrame of burn area polygons using the Spark SQL join. Because a GeoAnalytics Engine SQL function is part of the join condition, spatial indices are generated automatically and a spatial join is performed.
burn_areas_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/MTBS_Polygons_v1/FeatureServer/0"
turbines_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Wind_Turbine_Database/FeatureServer/0"
spark.read.format("feature-service").load(burn_areas_url) \
.withColumn("polygon", ST.transform("shape", 102007)) \
.createOrReplaceTempView("burn_areas")
spark.read.format("feature-service").load(turbines_url) \
.withColumn("point", ST.transform("shape", 102007)) \
.createOrReplaceTempView("turbines")
spark.sql(
"""
SELECT COUNT(*) AS `Count of turbines in burn areas`
FROM turbines JOIN burn_areas ON
ST_Within(turbines.point, burn_areas.polygon)
"""
).show()
+-------------------------------+
|Count of turbines in burn areas|
+-------------------------------+
| 9570|
+-------------------------------+
You can also perform the same spatial join with Python only using pyspark.sql.DataFrame.join, as shown in the following example.
burn_areas_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/MTBS_Polygons_v1/FeatureServer/0"
turbines_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Wind_Turbine_Database/FeatureServer/0"
burn_areas = spark.read.format("feature-service").load(burn_areas_url) \
.select(ST.transform("shape", 102007).alias("polygon"))
turbines = spark.read.format("feature-service").load(turbines_url) \
.select(ST.transform("shape", 102007).alias("point"))
result = turbines.join(burn_areas, ST.within(turbines.point, burn_areas.polygon))
print("Count of turbines in burn areas:")
result.count()
Count of turbines in burn areas:
9570
You can verify that a spatial join is being used by Spark SQL by calling
explain
on the join result and looking for Spatial
in the physical plan string, as shown below.
result.explain()
== Physical Plan ==
SpatialJoin point#1635: point, polygon#1575: polygon, SpatialXYRelationship(Within), Default, false
:- *(1) Project [ST_Transform(shape#1605, in=WGS_1984_Web_Mercator_Auxiliary_Sphere:3857, out=Hawaii_Albers_Equal_Area_Conic:102007, datum_transform="WGS_1984_(ITRF00)_To_NAD_1983") AS point#1635]
: +- BatchScan[shape#1605] FeatureServiceLayerScan[f=json, query="returnM=true&outFields=&where=1=1&returnZ=true", paging=oid-range(field=FID,size=1000,count=73)] RuntimeFilters: []
+- *(2) Project [ST_Transform(shape#1564, in=GCS_North_American_1983:4269, out=Hawaii_Albers_Equal_Area_Conic:102007) AS polygon#1575]
+- BatchScan[shape#1564] FeatureServiceLayerScan[f=json, query="returnM=true&outFields=&where=1=1&returnZ=true", paging=oid-range(field=OBJECTID,size=2000,count=30)] RuntimeFilters: []
SpatiotemporalJoin tool
The SpatiotemporalJoin tool offers similar functionality to spatially joining with Spark SQL but also supports joining on temporal relationships and attribute relationships. See the tool documentation for more information. The table below summarizes some key differences between performing spatial joins with PySpark or Spark SQL vs. the SpatiotemporalJoin tool.
Spark SQL join | SpatiotemporalJoin tool | |
---|---|---|
Usage | Can be called with SQL or Python syntax. | Available in Python only as a tool. |
Spatial relationships | Use one or more of 9 SQL functions to define a spatial relationship. Using the following functions in your join condition expression will trigger a spatial join:
| Choose from one of 9 spatial relationships. These relationships are evaluated using the primary geometry column of each DataFrame. Available spatial relationships include:
|
Temporal relationships | Use unix timestamps and Spark SQL to define temporal relationships. | Choose from one of 14 temporal relationships. These relationships are evaluated using the primary time column(s) of each DataFrame. Available temporal relationships include:
|
Attribute relationships | Use Spark SQL to define an attribute relationship in your join expression. | Use ArcGIS Arcade to define an attribute relationship. |
Performance considerations & best practices
- When using ST_DWithin or the Near spatial relationships, transform your data to a projected spatial reference that preserves distances between geometries for the most accurate results. For more information see Choosing a projected coordinate system.
- When using ST_DWithin or the Near spatial relationships, a larger near distance will be slower. To improve performance of near joins across large distances, increase the ratio of memory to cores in your Spark cluster.
- Spatial joins are memory-intensive operations. It is recommended that you have at least 8 GB of memory available per core in your Spark cluster to allow for optimal performance.
- DataFrames with geometries that are highly clustered will be slower than if the geometries are dispersed evenly over the DataFrame's spatial extent. If possible, add a temporal relationship and/or attribute relationship to your join condition when joining highly clustered data. You can also increase the ratio of memory to cores in your Spark cluster to improve performance with clustered or dense data.
- Geodesic distance calculations will be slower than planar distance calculations. For more information see Planar vs. geodesic distance calculations.
- Limiting the extent of your data can improve performance by filtering out invalid coordinates and locations irrelevant to your analysis. You can filter by extent prior to joining with ST_EnvIntersects
What's next?
Learn more about spatial relationships and spatial functions:
- Use the analysis tool Nearest neighbors to join neighboring geometries.
- Use the analysis tool Group By Proximity to group nearby geometries.
- Try the aggregate functions tutorial.