Getting Started
How do I get started with GeoAnalytics Engine?
To learn more about bringing ArcGIS GeoAnalytics Engine into your Spark environment, see Install and set up and Licensing.
Once you have GeoAnalytics Engine installed, see Get started for a quick tutorial that introduces
some basic features and capabilities of the geoanalytics
library.
There are a variety of other resources for getting familiar with the module. For example:
- Try one of the tutorials
- Explore one of the sample notebooks included in the ArcGIS GeoAnalytics Engine distribution
- Watch a demo
- Read a blog post
How do I use GeoAnalytics Engine documentation?
Documentation is divided into two main components:
- API Reference—A concise reference manual containing details about all functions, classes, return types, and arguments in GeoAnalytics Engine.
- Guide—Descriptions, code samples, and usage notes for each function and tool, as well as installation instructions, core concepts, frequently asked questions, and tutorials.
What are some helpful resources for learning about Spark?
The Spark SQL programming guide provides a high-level overview of Spark DataFrames and Spark SQL functions and includes extensive examples in Scala, Java, and Python. See the Machine Learning Library (MLlib) guide to learn more about Spark’s capabilities in the areas of classification, regression, clustering, and more.
To learn more about PySpark (the Python API for Spark) specifically, see the PySpark Quickstart and API reference. Spark also comes with a collection of PySpark examples that you can use to become more familiar with the API.
What are some helpful resources for learning more about spatial analysis and GIS?
See Esri’s guide called What is GIS? to find more information and resources. The ArcGIS Book is a great free resource for learning about all things GIS, especially the basics of spatial analysis. For more inspiration see the sample notebooks and blog posts for GeoAnalytics Engine.
Subscriptions
What is the subscription term for a Connected prepaid plan?
A Connected prepaid plan expires 1 year from purchase. Any unused core-hours included with your subscription expire at the end of the 1-year subscription period.
What is the term for Additional core-hours prepaid plan?
An Additional core-hours prepaid plan expires 1 year from purchase. Additional core-hours that have been purchased can only be accessed with a valid ArcGIS GeoAnalytics Engine subscription. Any unused core-hours included with an Additional core-hours prepaid plan that have an expiration date after the Connected prepaid plan subscription term will be inaccessible after the subscription expires. Renewing a GeoAnalytics Engine Connected prepaid plan annual subscription prior to the expiration of the additional core-hours will allow the additional core-hours to be accessed again until their expiration (1 year from the additional core-hours purchase).
Core-hours will be debited from a subscription based on those with the closest expiration date first.
Install
Where can I install GeoAnalytics Engine?
See the install guide for a complete list of officially supported Spark environments. These configurations have been tested and certified to work as documented in the install guide. Using GeoAnalytics Engine with other Spark runtimes or deployment environments may cause some functions or tools to not work correctly.
What are the requirements for installing GeoAnalytics Engine?
See Dependencies for a complete description of the install requirements for each version of Spark and GeoAnalytics Engine.
Do I need a license to use GeoAnalytics Engine?
Yes, you must authorize the geoanalytics module before running any function or tool. See Licensing for more information.
How many cores and how much RAM should I use? How many nodes should I have in my cluster?
The size and scale of the Spark cluster you should use depends on the amount of data you’re working with, the type of analysis or queries being run, and the performance required by your use case.
Deploying a Spark cluster in the cloud is a great option if you don’t know what size you need. Managed Spark services have the advantage of allowing you to scale up or down resources quickly without purchasing hardware or making any long-term commitments. This means that you can estimate how large of a Spark cluster you may need and scale it out if needed based on the performance you observe.
Equally as important as the amount of cores/RAM is the ratio of RAM to cores. It is recommended that you have at least 10GB of RAM per core so that each Spark executor has sufficient memory for computations.
Data sources
What data sources or formats are supported by GeoAnalytics Engine?
All functions and tools in GeoAnalytics Engine operate on Spark DataFrames or DataFrame columns. Therefore, the API supports any data source or format that can be loaded into a DataFrame. Spark includes built-in support for reading from Parquet, ORC, JSON, CSV, Text, Binary, and Avro files as well as Hive Tables and JDBC to other Databases. GeoAnalytics Engine also includes native support for reading from file geodatabases, reading and writing shapefiles, GeoJSON, GeoParquet, and feature services, and writing to vector tiles. See Data sources for a summary of the spatial data sources and sinks supported by GeoAnalytics Engine.
How do I connect to my data stored in the cloud?
The way you connect to a cloud store is different for every cloud store and cloud provider. Some cloud stores have connectors that are included with Apache Hadoop and thus Apache Spark. For example, Hadoop comes with an Amazon S3 connector called s3a that can be used from any Spark cluster that is connected to the internet. Other cloud providers may manage their own connector or may not have direct Spark integration and may require you to mount the cloud store as a local drive on your Spark cluster.
Does GeoAnalytics Engine work with imagery or raster data?
No, GeoAnalytics Engine functions and tools operate on vector geometry data only. This includes points, lines, polygons, multipoints, and generic vector geometries.
Working with geometry and time in DataFrames
How do I create a DataFrame?
The most common way to create a DataFrame is by loading data from a
supported data source
with spark.read.load()
. For example:
df = spark.read.load("examples/src/main/resources/users.parquet")
You can also create a DataFrame from a list of values or a Pandas
DataFrame
using create
. See Using
DataFrames for more information.
What are the differences between a PySpark DataFrame and a Pandas DataFrame?
PySpark DataFrames and Pandas DataFrames offer similar ways of representing columnar data in Python, but only PySpark DataFrames can be used with GeoAnalytics Engine.
PySpark DataFrames (often referred to as DataFrames or Spark DataFrames in this documentation) are distributed across a Spark cluster and any operations on them are executed in parallel on all nodes of the cluster. Pandas DataFrames are stored in memory on a single node and operations on them are executed on a single thread. This means that the performance of Pandas DataFrames cannot be scaled out to handle larger datasets and is limited by the memory available on a single machine.
Other differences include that PySpark DataFrames are immutable while Pandas DataFrames are mutable. Also, PySpark uses lazy execution, which means that tasks are not executed until specific actions are taken. In contrast, Pandas uses eager execution which means that tasks are executed as soon as they are called.
How do I covert between a Pandas DataFrame and a PySpark DataFrame?
Several options are available. Koalas is a pandas API for Apache Spark that provides a scalable way to convert between PySpark DataFrames and a pandas-like DataFrame. You must first convert any geometry column into a string or binary column before converting to a Koalas DataFrame.
GeoAnalytics Engine also includes a to
function which converts a PySpark DataFrame to a spatially-enabled
DataFrame
supported by the ArcGIS API for Python. This option will preserve any
geometry columns in your PySpark DataFrame but cannot be distributed
across a Spark cluster and thus is not as scalable as using Koalas.
How do I check and/or set the spatial reference of a geometry?
You can check the spatial reference of any geometry column using get_spatial_reference. If you know the spatial reference of the geometry data, you can set it using ST_SRID or ST_SRText.
To learn more about spatial references and how to set them see Coordinate systems and transformations.
What is the difference between ST_SRID and ST_Transform?
ST_SRID gets or sets the spatial reference ID of a geometry column but does not change any of the data in the column. ST_Transform transforms the geometry data within a column from an existing spatial reference to a new spatial reference and also sets the result column’s spatial reference ID.
To learn more about spatial references and how to transform between them see Coordinate systems and transformations.
Why are the values in my geometry column null?
This usually happens when using the wrong function to create the geometry column or when using an invalid or unsupported format. Double check that you are using the SQL function corresponding to the same geometry type as your input data. If you are unsure of the geometry type of your input data, use one of the generic geometry import functions:
Also verify that you’re using the SQL function corresponding to the format of your geometry data (EsriJSON, GeoJSON, WKT, WKB, or Shape), and that the representation is valid.
How do I create a time column?
GeoAnalytics Engine uses the TimestampType
included with PySpark to represent instants in time. Use the
to
function to create a timestamp column from a numeric or
string column using Spark’s datetime patterns for formatting and
parsing.
Intervals in time are represented by two timestamp columns containing the start and end instants of each interval.
If you have more than one timestamp column, use the st.set
function to specify the time columns.
To check that your time column is set correctly, use the
st.get
function.
How do I specify which geometry columns or time columns to use in a tool?
If there is only one geometry column in a DataFrame it will be used
automatically. If there are multiple geometry columns in a DataFrame,
you must call st.set
on the DataFrame
to specify the primary geometry column.
Similarly, if there is one timestamp column in a DataFrame it will be
used automatically as instant time when time is required by a tool. If
there are multiple timestamp columns or you want to represent intervals
of time you must call st.set
.
Running tools and functions
How do I check the progress of a function or tool after calling it?
The Spark Web UI is the best way to watch the progress of your jobs. The web UI is started on port 4040 by default when you start Spark. All managed Spark services offer their own UIs for tracking the progress of Spark jobs. See the documentation for each service below:
- Amazon EMR – Access the Spark web UIs
- Azure Synapse Analytics - Use Synapse Studio to monitor your Apache Spark applications
- Databricks - View cluster information in the Apache Spark UI
- Google Cloud Dataproc - Cluster web interfaces
Why does nothing happen when I call a function?
PySpark uses lazy evaluation which means that functions are not executed
until certain actions are called. In other words, calling a SQL function
will not run that function on your Spark cluster until you call an
action on the function return value. Examples of actions include
write.save()
,
plot()
,
count()
,
collect()
,
and
take()
.
What does it mean when a function fails with py4j. Py4 J Exception : Method {} does not exist
?
This exception is raised when the arguments to a SQL function are not all of the same documented type or when there are
unexpected arguments. For SQL functions that accept x, y, z, and m values in particular, all coordinates must be of the same
valid type or the exception above will be thrown. For example, ST
is valid because the x and y
coordinates are both floats, but ST
is not because one coordinate is an integer and the other is a float.
Check that the types of your function arguments match the expected types documented in the API reference.
Why are all functions failing with Type Error : ' Java Package' object is not callable
or pyspark.sql.utils. Analysis Exception : Undefined function...
?
This message indicates that the geoanalytics module has been installed in Python but the accompanying jar has not been
properly configured with Spark. To learn more about configuring additional jars with Spark, see the documentation for the
spark.jars
runtime environment properties
and advanced dependency management.
Plotting
Why should I transform my geometries to the same spatial reference for plotting?
When the geometries in two or more DataFrames are in different spatial references, they won't plot in the expected locations relative to each other. Transforming one to the spatial reference of the other ensures that they use the same coordinate system and units and thus plot together as expected. To learn more see Coordinate systems and transformations.