Find Point Clusters finds clusters of points in surrounding noise based on their spatial or spatiotemporal distribution.
Usage notes
-
The input DataFrame for Find Point Clusters must have a point geometry column. Find Point Clusters extracts clusters from the input DataFrame and identifies any surrounding noise.
-
Find Point Clusters requires that the input DataFrame's geometry has a projected coordinate system or the tool will fail. You can transform your data to a projected coordinate system by using ST_Transform.
-
There are two options available for the cluster method (
set
):Cluster Method() - DBSCAN—Finds clusters of points that are in close proximity based on a specified search distance.
- HDBSCAN—Finds clusters of points using varying distances, allowing for clusters with varying densities based on cluster probability (or stability).
-
If DBSCAN is chosen, clusters can be found in either two-dimensional space only or both in space and time. If you specify a distance for
set
and the input DataFrame has a timestamp, DBSCAN will discover spatiotemporal clusters of points that are in close proximity based on a specified search distance and search duration.Search Distance() -
The value set with
set
is used differently, depending on the clustering method:Min Points Cluster() -
DBSCAN - Specifies the number of points that must be found within a search distance of a point for that point to start forming a cluster. The results may include clusters with fewer points than this value. Specify the search distance using
set
. When using time to find clusters,Search Distance() set
is required. When searching for cluster members, the value specified forSearch Duration() set
must be satisfied within both theMin Points Cluster() set
andSearch Distance() set
values to form a cluster. Note that this distance and duration are not related to the diameter or time extent of the point clusters discovered.Search Duration() -
HDBSCAN - Specifies the number of points neighboring each point (including the point itself) that will be considered when estimating density. This number is also the minimum cluster size allowed when extracting clusters.
-
Results
This tool produces a point DataFrame with the following fields:
Field | Description |
---|---|
CLUSTER | Identifies the cluster the point belongs to. Values of -1 are categorized as noise. |
COLOR | A field to help visualize clusters. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters. |
In addition to the above fields, additional fields are added depending on the cluster method:
-
If the DBSCAN clustering method is used with time to discover spatiotemporal clusters, results will also include the following fields:
Field Description cluster
_start The start time of the time extent of the cluster a row belongs to. cluster
_end The end time of the time extent of the cluster a row belongs to.
You can use these fields to ensure that all cluster members are drawn together when
visualizing spatiotemporal clusters with a time slider in applications like ArcGIS Pro. These fields are
used for visualization only. For noise results, start
and
end
will be equal to FEAT
.
- If the HDBCAN clustering method is used, results will also contain the following fields:
Field | Description |
---|---|
PROB | The probability that a point belongs in its assigned cluster. |
OUTLIER | The likelihood that a row is an outlier within its own cluster. A larger value indicates that the row is more likely to be an outlier. |
EXEMPLAR | The points that are most representative of each cluster. These rows are indicated by a value of 1. |
STABILITY | The persistence of each cluster across a range of scales. A larger value indicates that a cluster persists over a wider range of distance scales. |
Performance notes
Improve the performance of Find Point Clusters by doing one or more of the following::
-
Only analyze the records in your area of interest. You can pick the records of interest by using one of the following SQL functions:
- ST_Intersection—Clip to an area of interest represented by a polygon. This will modify your input records.
- ST_BboxIntersects—Select records that intersect an envelope.
- ST_EnvIntersects—Select records having an evelope that intersects the envelope of another geometry.
- ST_Intersects—Select records that intersect another dataset or area of intersect represented by a polygon.
- Be selective in the search distance and duration. A narrower search distance or radius may perform better on the same data.
Similar capabilities
The following tools perform similar capabilities:
Syntax
For more details, go to the GeoAnalytics Engine API reference for find point clusters.
Setter | Description | Required |
---|---|---|
run(dataframe) | Runs the Find Point Clusters tool using the provided DataFrame. | Yes |
set | Sets The algorithm used for cluster analysis. Supported options are ' and ' . | Yes |
set | This setter is used differently depending on the clustering method chosen. See the API documentation for more details. | Yes |
set | Sets the search distance within which the number of points specified by set must be found (in addition to being within the search duration, if applicable) to form a cluster using the DBSCAN algorithm. | Yes, for DBSCAN only. |
set | Sets the search duration within which the number of points specified by set must be found (in addition to being within the search distance) to form a cluster using the DBSCAN algorithm. | No |
Examples
Run Find Point Clusters
# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")
# Imports
from geoanalytics.tools import FindPointClusters
from geoanalytics.sql import functions as ST
from pyspark.sql import functions as F
# Path to the Earthquakes data
data_path = r"https://sampleserver6.arcgisonline.com/arcgis/rest/services/" \
"Earthquakes_Since1970/FeatureServer/0"
# Create an earthquakes DataFrame and transform the geometry to World Cylindrical Equal Area (54034)
df = spark.read.format("feature-service").load(data_path) \
.withColumn("shape", ST.transform("shape", 54034))
# Use Find Point Clusters to find clusters of earthquake occurrences across the globe
result = FindPointClusters() \
.setClusterMethod(cluster_method="HDBSCAN") \
.setMinPointsCluster(min_points_cluster=7) \
.run(dataframe=df)
# Show a selection of columns for the first 5 output clusters sorted by cluster id
result.filter(result["CLUSTER_ID"] == 115) \
.select("name", F.date_format("date_", "yyyy-MM-dd").alias("date_"), "CLUSTER_ID", "PROB", "STABILITY") \
.sort("name", "date_", ascending=False).show(5, truncate=False)
+-----------------------------+----------+----------+-------------------+-------------------+
|name |date_ |CLUSTER_ID|PROB |STABILITY |
+-----------------------------+----------+----------+-------------------+-------------------+
|TAJIKISTAN: SHURAB, NEFTEABAD|1980-07-10|115 |0.180143468626724 |0.25173588399009866|
|TAJIKISTAN: SHARORA, GISSAR |1989-01-21|115 |0.14929508441065453|0.25173588399009866|
|TAJIKISTAN: ROSHTKALA, KHOROG|1988-09-24|115 |0.17619107744283952|0.25173588399009866|
|TAJIKISTAN: ROGHUN |2002-02-02|115 |0.18251933904618436|0.25173588399009866|
|TAJIKISTAN: ROGHUN |2002-01-08|115 |0.18251933904618436|0.25173588399009866|
+-----------------------------+----------+----------+-------------------+-------------------+
only showing top 5 rows
Plot results
# Create a continents DataFrame and transform the geometry to World Cylindrical Equal Area (54034)
continents_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/" \
"services/World_Continents/FeatureServer/0"
continents_df = spark.read.format("feature-service").load(continents_path) \
.withColumn("shape", ST.transform("shape", 54034))
# Plot the clustered results with the world continents data
continents_plot = continents_df.st.plot(facecolor="none",
edgecolors="black",
alpha=0.3,
figsize=(14,12))
result_noise_plot = result.where("COLOR_ID == -1").st.plot(geometry="shape",
color="lightgrey",
ax=continents_plot,
basemap="light")
result_clusters_plot = result.where("COLOR_ID != -1").st.plot(geometry="shape",
cmap_values="COLOR_ID",
is_categorical=True,
cmap="Paired",
ax=continents_plot,
legend=True,
legend_kwds={"title": "Cluster ID",
"loc": "lower right",
"bbox_to_anchor": (1.09, 0)})
result_clusters_plot.set_title("Point clusters and noise results (grey) for world earthquake occurrences")
result_clusters_plot.set_xlabel("X (Meters)")
result_clusters_plot.set_ylabel("Y (Meters)");
Version table
Release | Notes |
---|---|
1.0.0 | Python tool introduced |