GeoAnalytics Engine supports accessing and saving data to different locations locally or in the cloud, including local file system, Hadoop Distributed File System, Amazon S3, Azure Blob Storage, Azure Data Lake Storage Gen2, and Google Cloud Storage. Some of the more common options are summarized below.
Local file system
Local file system refers to storing data on a local storage device, such as a hard drive or a solid-state drive (SSD). It is directly attached to a computer or other computing device.
If you have installed GeoAnalytics Engine locally, you can access or save different data sources in a local file system in a PySpark session. Below is an example of accessing and saving a CSV file on Windows C Drive.
# Read a local CSV file
df = spark.read.format("csv").load("C:/<file_path>/<file_name>")
# Write to a local CSV file
df.write.format("csv").mode("overwrite").save("C:/<file_path>/<file_name>")
Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is a distributed file system that is a core component of the Apache Hadoop ecosystem, designed to store large volumes of data. HDFS can be deployed either in a local or cloud environment. For example, it can be installed with Hadoop on your Amazon EMR cluster, or Google Cloud Dataproc.
To store data on HDFS, you can install Hadoop and create a cluster, and upload your data to the cluster using tools such as Hadoop Command Line Interface (CLI). Once you have prepared your data, you can access it with GeoAnalytics Engine using the following steps:
-
Access data on HDFS in a PySpark session.
You can access different data sources stored on HDFS. For example, you can access a Parquet file as shown below by replacing the placeholders
hdfs
,_server hdfs
,_port hdfs
, and_path file
with the appropriate information for your HDFS environment and CSV file._name PythonUse dark colors for code blocks Copy # Read a Parquet file in HDFS df = spark.read.format("parquet").load("hdfs://<hdfs_server>:<hdfs_port>/<hdfs_path>/<file_name>")
-
Write data to HDFS in a PySpark session.
Similarly, you can write data to HDFS as shown below:
PythonUse dark colors for code blocks Copy # Write to a Parquet file in HDFS df.write.format("parquet").mode("overwrite").save("hdfs://<hdfs_server>:<hdfs_port>/<hdfs_path>/<file_name>")
Amazon S3
Amazon S3 is a cloud-based object storage service provided by Amazon Web Services (AWS). It provides a simple web interface that allows you to store and retrieve any amount of data, from anywhere on the web.
Once you have created an Amazon S3 bucket and uploaded data to the bucket, you can access and save your data in a PySpark session.
It is common to store data in Amazon S3 and process data using Amazon EMR. Depending on the
permissions of the IAM role associated
with your EMR cluster, you may directly access data in the S3 bucket, or access the S3 bucket with the
s3a
filesystem connector
that comes with Apache Spark. After you install GeoAnalytics Engine on Amazon EMR,
you can access or save data stored on Amazon S3 storage in a EMR PySpark notebook following the examples below.
-
Access data on Amazon S3 in an EMR PySpark notebook.
This example loads data stored an S3 bucket directly from PySpark without configuring Spark:
PythonUse dark colors for code blocks Copy # Load a table in file geodatabase on Amazon S3 into s PySpark DataFrame path = "s3://<bucket_name>/<file_path>/<file_name>/" df = spark.read.format("filegdb").options(gdbPath=path, gdbName="<table_name>").load()
Alternatively, you may use the
s3a
filesystem connector to set up the AWS credentials before accessing the bucket. The following code snippet shows an example to access data on Amazon S3 when AWS credentials are needed.PythonUse dark colors for code blocks Copy # Set up the AWS credentials access_key = "your_aws_access_key_id" secret_key = "your_aws_secret_access_key" # Configurate the Spark spark = SparkSession.builder.appName("awsAnalysis").getOrCreate() spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId', access_key) spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey', secret_key) spark._sc._jsc.hadoopConfiguration().set('fs.s3a.impl', "org.apache.hadoop.fs.s3a.S3AFileSystem") # Load the shapefile on Amazon S3 into a PySpark DataFrame df = spark.read.format("shapefile").load("s3a://<bucket_name>/<file_path>/<file_name>")
-
Write data to Amazon S3 in a EMR PySpark notebook.
To save data to an Amazon S3 bucket, write the data to a supported data source format, for example, a shapefile:
PythonUse dark colors for code blocks Copy # Write to a single shapefile df.coalesce(1).write.format("shapefile").mode("overwrite").save(r"s3a://<bucket_name>/<file_path>/<file_name>")
Azure Blob Storage and Azure Data Lake Storage Gen2
Azure Blob Storage is a cloud-based object storage service provided by Microsoft Azure. It is designed to store and manage unstructured data that doesn't adhere to a particular data model or definition. Azure Data Lake Storage Gen2 is another cloud-based object storage service provided by Microsoft Azure built on top of Azure Blob Storage. It is designed to handle both structured and unstructured data, including files of various formats such as CSV, Parquet, and ORC. Azure Data Lake Storage Gen2 is primarily designed to work with Hadoop and organize data in a hierarchical directory structure. Refer to the Azure Blob Storage documentation on how to manage data on Azure data storage.
Azure Blob Storage and Azure Data Lake Storage Gen2 integrate with other Azure services such as Azure Databricks and Azure Synapse Analytics. After you install GeoAnalytics Engine on Azure Synapse Analytics or Azure Databricks, you can access or save data stored on Azure storage following the examples below.
-
Access data on Azure Storage in a Databricks notebook.
In Databricks, you can connect to Azure Blob Storage or Data Lake Storage Gen2 using Azure credentials, including Azure service principal, shared access signatures (SAS), and account keys. Set the Spark properties to configure the Azure credentials either in a cluster or a notebook. For example, follow the code snippet to set the account key in a notebook:
PythonUse dark colors for code blocks Copy # Set up credentials spark.conf.set("fs.azure.account.key.<storage-account>.dfs.core.windows.net",\ dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
Replace
storage-account
,scope
, and andstorage-account-access-key
with the Azure Storage account name, Databricks secret scope name, and Azure storage account access key, respectively.You can access the Azure data storage using
abfss
driver once you have properly configured credentials.PythonUse dark colors for code blocks Copy # Read GeoJSON data in a Databricks notebook df = spark.read.format("geojson").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
-
Write data to Azure Storage in a Databricks notebook.
After configuring the Azure credentials , you can write data to Azure Blob Storage or Azure Data Lake Storage Gen2 as shown below:
PythonUse dark colors for code blocks Copy # Save to GeoJSON data df.write.format("geojson").mode("overwrite").save("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
In Azure Synapse Analytics, refer to Introduction to Microsoft Spark Utilities
for more information on how to configure access to Azure Blob Storage and Azure Data Lake Storage Gen2 using wasbs
driver and abfss
driver, respectively.
Google Cloud Storage
Google Cloud Storage is a cloud-based object storage service offered by Google Cloud Platform. Google Cloud Storage is similar to Amazon S3 in that it stores data in containers called buckets, and you can access, share, and manage objects stored in the bucket. Once you have created an Google Cloud Storage bucket, and uploaded data to the bucket, you can access and save your data on Google Cloud Storage in a PySpark session.
Google Cloud Storage integrates with other Google Cloud Platform services. It is a typical data processing workflow to store data in Google Cloud Storage and process data using Google Cloud Dataproc. After you install GeoAnalytics Engine on Google Dataproc, use the following steps to access and save data on Google Cloud Storage:
-
Access data stored in Cloud Storage in a PySpark session.
You can access data stored in Google Cloud Storage using a Google Cloud Storage connector. The connector is installed automatically under the
/usr/local/share/google/dataproc/lib/
directory when you use Google Cloud Dataproc. The data stored in Cloud Storage can be accessed usinggs
prefix. For example::// PythonUse dark colors for code blocks Copy # Read GeoParquet data in Google Cloud Storage df = spark.read.format("geoparquet").load("gs://<bucket_name>/<file_name>")
For other Spark environments, you can install the connector manually by following the Google Cloud Storage Connector for Spark and Hadoop guide.
-
Write data to Google Cloud Storage.
Similarly, you can write data in Google Cloud Storage using the Google Cloud Storage connector:
PythonUse dark colors for code blocks Copy # Save GeoParquet Data in Google Cloud Storage df.write.format("geoparquet").load("gs:/<bucket_name>/<file_name>")