Apache Spark supports a local deployment mode that lets you run PySpark code using your personal computer's resources as a single node cluster. This mode is useful for testing your workflow prior to using resources on a larger Spark cluster. For example, you might choose to write code on your personal computer using a subset of your data before deploying a full-scale Spark cluster in the cloud. This would lower your overall compute time in the cloud and reduce costs.
The following steps explain how to install Apache Spark and GeoAnalytics Engine on Windows or Linux using Spark in local standalone mode. Once complete, you will be able to run PySpark and GeoAnalytics Engine code in a python notebook, the PySpark shell, or with a python script.
Prerequisites:
Note that some versions of Java or Python are deprecated in some versions of Spark. See Dependencies for details.
Install Apache Hadoop
GeoAnalytics Engine requires Hadoop binaries to be installed when reading from or writing to shapefiles. Hadoop is also required when reading from or writing to any distributed file system that Spark supports, including parquet, S3, and others.
To install Hadoop on Linux, download the binaries directly from Apache and unpack the distribution as described in Hadoop documentation.
To install Hadoop on Windows, download the Windows binaries from a third party or
build them yourself. At a minimum you must have
winutils.exe
and hadoop.dll
staged on your machine at <install location
.
For both Linux and Windows, set the HADOOP
environment variable to the Hadoop install location and add
%
to your Path variable. For example:
set HADOOP_HOME=C:\Hadoop
set PATH=%PATH%;%HADOOP_HOME%\bin
Install Apache Spark and PySpark
-
Download Apache Spark. Any supported version of Spark will work, but the release should support the versions of Java and Python you have installed.
-
Set the required environment variables:
-
Set the
SPARK
environment variable to the Spark install directory and add_HOME %
to your Path variable. For example:SPARK _HOM E%\bin Windows Windows Linux Use dark colors for code blocks Copy set SPARK_HOME=C:\Spark\spark-3.2.0-bin-hadoop2.7 set PATH=%PATH%;%SPARK_HOME%\bin
-
Set the
PYSPARK
environment variable to the path of the Python executable you're using, for example:_PYTHON Windows Windows Linux Use dark colors for code blocks Copy set PYSPARK_PYTHON=C:\Python37\python.exe
-
If you want to use GeoAnalytics Engine in a notebook, set the
PYSPARK
environment variable to the path of a Python notebook executable, for example:_DRIVER _PYTHON Windows Windows Linux Use dark colors for code blocks Copy set PYSPARK_DRIVER_PYTHON=C:\Python37\Scripts\jupyter-notebook.exe
If you want to use GeoAnalytics Engine via the PySpark shell or by running python scripts, skip this step.
-
Ensure that
JAVA
is set and that_HOME %
is in your Path environment variable. If not, set it using:JAVA _HOM E%\bin Windows Windows Linux Use dark colors for code blocks Copy set JAVA_HOME=C:\Java set PATH=%PATH%;%SPARK_HOME%\bin;%JAVA_HOME%\bin
-
-
Install PySpark with pip, conda, or by manually installing the package. For more information, see PySpark Installation. Below is an example using pip.
Use dark colors for code blocks Copy pip install pyspark
Start a PySpark session with GeoAnalytics Engine
-
Copy the jar and zip install files to your computer.
-
Open command prompt and run the command below. Change the paths to the jar and zip file before running. You can also change the amount of memory available to Spark by updating the value for
spark.driver.memory
. If you setPYSPARK
to a python notebook, the notebook application will open and the geoanalytics module will be available to import in any notebook you create. If you are using the PySpark shell or running a script, you can import geoanalytics as soon as PySpark starts._DRIVER _PYTHON Windows Windows Linux Use dark colors for code blocks Copy pyspark --jars C:\engine\geoanalytics.jar ^ --py-files C:\engine\geoanalytics.zip ^ --conf spark.plugins=com.esri.geoanalytics.Plugin ^ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer ^ --conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator ^ --conf spark.driver.memory=5g
If you need to perform a transformation that requires supplementary projection data, add the projection data jars to the
--jars
argument. Similarly, if you need to use geocoding or network analysis tools, add the file path ofgeoanalytics-natives.jar
to the--jars
argument. For example:Windows Windows Linux Use dark colors for code blocks Copy pyspark --jars C:\engine\geoanalytics.jar,C:\engine\esri-projection-geographic-north-america.jar,C:\engine\esri-projection-geographic-south-america.jar,C:\engine\geoanalytics-natives.jar ^ ...
Authorize GeoAnalytics Engine
- If using a notebook, create a new notebook or open an existing one. Otherwise, continue to the next step.
-
Import the geoanalytics library and authorize it using your username and password or a license file. See Authorization for more information. For example:
Use dark colors for code blocks Copy import geoanalytics geoanalytics.auth(username="User1", password="p@ssw0rd")
-
Try out the API by importing the SQL functions as an easy-to-use alias like
ST
and listing the first 20 functions in a notebook cell:Use dark colors for code blocks Copy from geoanalytics.sql import functions as ST spark.sql("show user functions like 'ST_*'").show()
What's Next?
You can now use any SQL function, track function, or analysis tool in the geoanalytics
module.
See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.