Find point clusters

Find Point Clusters finds clusters of points in surrounding noise based on their spatial or spatiotemporal distribution.

Usage notes

The input DataFrame for Find Point Clusters must have a point geometry column. Find Point Clusters extracts clusters from the input DataFrame and identifies any surrounding noise.
Find Point Clusters requires that the input DataFrame's geometry has a projected coordinate system or the tool will fail. You can transform your data to a projected coordinate system by using ST_Transform.

Learn more about coordinate systems and transformations
There are two options available for the cluster method (setClusterMethod()):
- DBSCAN—Finds clusters of points that are in close proximity based on a specified search distance.
- HDBSCAN—Finds clusters of points using varying distances, allowing for clusters with varying densities based on cluster probability (or stability).
If DBSCAN is chosen, clusters can be found in either two-dimensional space only or both in space and time. If you specify a distance for setSearchDistance() and the input DataFrame has a timestamp, DBSCAN will discover spatiotemporal clusters of points that are in close proximity based on a specified search distance and search duration.
The value set with setMinPointsCluster() is used differently, depending on the clustering method:
- DBSCAN - Specifies the number of points that must be found within a search distance of a point for that point to start forming a cluster. The results may include clusters with fewer points than this value. Specify the search distance using setSearchDistance(). When using time to find clusters, setSearchDuration() is required. When searching for cluster members, the value specified for setMinPointsCluster() must be satisfied within both the setSearchDistance() and setSearchDuration() values to form a cluster. Note that this distance and duration are not related to the diameter or time extent of the point clusters discovered.
- HDBSCAN - Specifies the number of points neighboring each point (including the point itself) that will be considered when estimating density. This number is also the minimum cluster size allowed when extracting clusters.

Results

This tool produces a point DataFrame with the following fields:

Field	Description
`CLUSTER_ID`	Identifies the cluster the point belongs to. Values of `-1` are categorized as noise.
`COLOR_ID`	A field to help visualize clusters. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters.

In addition to the above fields, additional fields are added depending on the cluster method:

If the DBSCAN clustering method is used with time to discover spatiotemporal clusters, results will also include the following fields:

Field Description
cluster_start The start time of the time extent of the cluster a row belongs to.
cluster_end The end time of the time extent of the cluster a row belongs to.

Field	Description
`cluster_start`	The start time of the time extent of the cluster a row belongs to.
`cluster_end`	The end time of the time extent of the cluster a row belongs to.

You can use these fields to ensure that all cluster members are drawn together when visualizing spatiotemporal clusters with a time slider in applications like ArcGIS Pro. These fields are used for visualization only. For noise results, start_time and end_time will be equal to FEAT_TIME.

If the HDBCAN clustering method is used, results will also contain the following fields:

Field	Description
`PROB`	The probability that a point belongs in its assigned cluster.
`OUTLIER`	The likelihood that a row is an outlier within its own cluster. A larger value indicates that the row is more likely to be an outlier.
`EXEMPLAR`	The points that are most representative of each cluster. These rows are indicated by a value of 1.
`STABILITY`	The persistence of each cluster across a range of scales. A larger value indicates that a cluster persists over a wider range of distance scales.

Performance notes

Improve the performance of Find Point Clusters by doing one or more of the following::

Only analyze the records in your area of interest. You can pick the records of interest by using one of the following SQL functions:
- ST_Intersection—Clip to an area of interest represented by a polygon. This will modify your input records.
- ST_BboxIntersects—Select records that intersect an envelope.
- ST_EnvIntersects—Select records having an evelope that intersects the envelope of another geometry.
- ST_Intersects—Select records that intersect another dataset or area of intersect represented by a polygon.
Be selective in the search distance and duration. A narrower search distance or radius may perform better on the same data.

Similar capabilities

The following tools perform similar capabilities:

Syntax

For more details, go to the GeoAnalytics Engine API reference for find point clusters.

Setter	Description	Required
`run(dataframe)`	Runs the Find Point Clusters tool using the provided DataFrame.	Yes
`setClusterMethod(cluster_method)`	Sets The algorithm used for cluster analysis. Supported options are `'DBSCAN'` and `'HDBSCAN'`.	Yes
`setMinPointsCluster(min_points_cluster)`	This setter is used differently depending on the clustering method chosen. See the API documentation for more details.	Yes
`setSearchDistance(search_distance, search_distance_unit)`	Sets the search distance within which the number of points specified by `setMinPointsCluster()` must be found (in addition to being within the search duration, if applicable) to form a cluster using the DBSCAN algorithm.	Yes, for DBSCAN only.
`setSearchDuration(search_duration, search_duration_unit)`	Sets the search duration within which the number of points specified by `setMinPointsCluster()` must be found (in addition to being within the search distance) to form a cluster using the DBSCAN algorithm.	No

Examples

Run Find Point Clusters

Python
Use dark colors for code blocksCopy

# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")

# Imports
from geoanalytics.tools import FindPointClusters
from geoanalytics.sql import functions as ST
from pyspark.sql import functions as F

# Path to the Earthquakes data
data_path = r"https://sampleserver6.arcgisonline.com/arcgis/rest/services/" \
             "Earthquakes_Since1970/FeatureServer/0"

# Create an earthquakes DataFrame and transform the geometry to World Cylindrical Equal Area (54034)
df = spark.read.format("feature-service").load(data_path) \
                        .withColumn("shape", ST.transform("shape", 54034))

# Use Find Point Clusters to find clusters of earthquake occurrences across the globe
result = FindPointClusters() \
            .setClusterMethod(cluster_method="HDBSCAN") \
            .setMinPointsCluster(min_points_cluster=7) \
            .run(dataframe=df)

# Show a selection of columns for the first 5 output clusters sorted by cluster id
result.filter(result["CLUSTER_ID"] == 115) \
      .select("name", F.date_format("date_", "yyyy-MM-dd").alias("date_"), "CLUSTER_ID", "PROB", "STABILITY") \
      .sort("name", "date_", ascending=False).show(5, truncate=False)

Result
Use dark colors for code blocksCopy
+-----------------------------+----------+----------+-------------------+-------------------+
|name                         |date_     |CLUSTER_ID|PROB               |STABILITY          |
+-----------------------------+----------+----------+-------------------+-------------------+
|TAJIKISTAN: SHURAB, NEFTEABAD|1980-07-10|115       |0.180143468626724  |0.25173588399009866|
|TAJIKISTAN: SHARORA, GISSAR  |1989-01-21|115       |0.14929508441065453|0.25173588399009866|
|TAJIKISTAN: ROSHTKALA, KHOROG|1988-09-24|115       |0.17619107744283952|0.25173588399009866|
|TAJIKISTAN: ROGHUN           |2002-02-02|115       |0.18251933904618436|0.25173588399009866|
|TAJIKISTAN: ROGHUN           |2002-01-08|115       |0.18251933904618436|0.25173588399009866|
+-----------------------------+----------+----------+-------------------+-------------------+
only showing top 5 rows

Plot results

Python
Use dark colors for code blocksCopy

# Create a continents DataFrame and transform the geometry to World Cylindrical Equal Area (54034)
continents_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/" \
                  "services/World_Continents/FeatureServer/0"
continents_df = spark.read.format("feature-service").load(continents_path) \
                                .withColumn("shape", ST.transform("shape", 54034))

# Plot the clustered results with the world continents data
continents_plot = continents_df.st.plot(facecolor="none",
                                        edgecolors="black",
                                        alpha=0.3,
                                        figsize=(14,12))
result_noise_plot = result.where("COLOR_ID == -1").st.plot(geometry="shape",
                                                           color="lightgrey",
                                                           ax=continents_plot,
                                                           basemap="light")
result_clusters_plot = result.where("COLOR_ID != -1").st.plot(geometry="shape",
                                                     cmap_values="COLOR_ID",
                                                     is_categorical=True,
                                                     cmap="Paired",
                                                     ax=continents_plot,
                                                     legend=True,
                                                     legend_kwds={"title": "Cluster ID",
                                                                  "loc": "lower right",
                                                                  "bbox_to_anchor": (1.09, 0)})
result_clusters_plot.set_title("Point clusters and noise results (grey) for world earthquake occurrences")
result_clusters_plot.set_xlabel("X (Meters)")
result_clusters_plot.set_ylabel("Y (Meters)");

Plotting example for a Find Point Clusters result. Global earthquake clusters are shown.

Version table

Release	Notes
1.0.0	Python tool introduced