Data storage

GeoAnalytics Engine supports accessing and saving data to different locations locally or in the cloud, including local file system, Hadoop Distributed File System, Amazon S3, Azure Blob Storage, Azure Data Lake Storage Gen2, and Google Cloud Storage. Some of the more common options are summarized below.

Local file system

Local file system refers to storing data on a local storage device, such as a hard drive or a solid-state drive (SSD). It is directly attached to a computer or other computing device.

If you have installed GeoAnalytics Engine locally, you can access or save different data sources in a local file system in a PySpark session. Below is an example of accessing and saving a CSV file on Windows C Drive.

Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Read a local CSV file
df = spark.read.format("csv").load("C:/<file_path>/<file_name>")

# Write to a local CSV file
df.write.format("csv").mode("overwrite").save("C:/<file_path>/<file_name>")

Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a distributed file system that is a core component of the Apache Hadoop ecosystem, designed to store large volumes of data. HDFS can be deployed either in a local or cloud environment. For example, it can be installed with Hadoop on your Amazon EMR cluster, or Google Cloud Dataproc.

To store data on HDFS, you can install Hadoop and create a cluster, and upload your data to the cluster using tools such as Hadoop Command Line Interface (CLI). Once you have prepared your data, you can access it with GeoAnalytics Engine using the following steps:

  1. Access data on HDFS in a PySpark session.

    You can access different data sources stored on HDFS. For example, you can access a Parquet file as shown below by replacing the placeholders hdfs_server, hdfs_port, hdfs_path, and file_name with the appropriate information for your HDFS environment and CSV file.

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Read a Parquet file in HDFS
    df = spark.read.format("parquet").load("hdfs://<hdfs_server>:<hdfs_port>/<hdfs_path>/<file_name>")
    
  2. Write data to HDFS in a PySpark session.

    Similarly, you can write data to HDFS as shown below:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Write to a Parquet file in HDFS
    df.write.format("parquet").mode("overwrite").save("hdfs://<hdfs_server>:<hdfs_port>/<hdfs_path>/<file_name>")
    

Amazon S3

Amazon S3 is a cloud-based object storage service provided by Amazon Web Services (AWS). It provides a simple web interface that allows you to store and retrieve any amount of data, from anywhere on the web.

Once you have created an Amazon S3 bucket and uploaded data to the bucket, you can access and save your data in a PySpark session.

It is common to store data in Amazon S3 and process data using Amazon EMR. Depending on the permissions of the IAM role associated with your EMR cluster, you may directly access data in the S3 bucket, or access the S3 bucket with the s3a filesystem connector that comes with Apache Spark. After you install GeoAnalytics Engine on Amazon EMR, you can access or save data stored on Amazon S3 storage in a EMR PySpark notebook following the examples below.

  1. Access data on Amazon S3 in an EMR PySpark notebook.

    This example loads data stored an S3 bucket directly from PySpark without configuring Spark:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Load a table in file geodatabase on Amazon S3 into s PySpark DataFrame
    path = "s3://<bucket_name>/<file_path>/<file_name>/"
    df = spark.read.format("filegdb").options(gdbPath=path, gdbName="<table_name>").load()
    

    Alternatively, you may use the s3a filesystem connector to set up the AWS credentials before accessing the bucket. The following code snippet shows an example to access data on Amazon S3 when AWS credentials are needed.

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Set up the AWS credentials
    access_key = "your_aws_access_key_id"
    secret_key = "your_aws_secret_access_key"
    
    # Configurate the Spark
    spark = SparkSession.builder.appName("awsAnalysis").getOrCreate()
    spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId', access_key)
    spark._sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey', secret_key)
    spark._sc._jsc.hadoopConfiguration().set('fs.s3a.impl', "org.apache.hadoop.fs.s3a.S3AFileSystem")
    
    # Load the shapefile on Amazon S3 into a PySpark DataFrame
    df = spark.read.format("shapefile").load("s3a://<bucket_name>/<file_path>/<file_name>")
    
  2. Write data to Amazon S3 in a EMR PySpark notebook.

    To save data to an Amazon S3 bucket, write the data to a supported data source format, for example, a shapefile:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Write to a single shapefile
    df.coalesce(1).write.format("shapefile").mode("overwrite").save(r"s3a://<bucket_name>/<file_path>/<file_name>")
    

Azure Blob Storage and Azure Data Lake Storage Gen2

Azure Blob Storage is a cloud-based object storage service provided by Microsoft Azure. It is designed to store and manage unstructured data that doesn't adhere to a particular data model or definition. Azure Data Lake Storage Gen2 is another cloud-based object storage service provided by Microsoft Azure built on top of Azure Blob Storage. It is designed to handle both structured and unstructured data, including files of various formats such as CSV, Parquet, and ORC. Azure Data Lake Storage Gen2 is primarily designed to work with Hadoop and organize data in a hierarchical directory structure. Refer to the Azure Blob Storage documentation on how to manage data on Azure data storage.

Azure Blob Storage and Azure Data Lake Storage Gen2 integrate with other Azure services such as Azure Databricks and Azure Synapse Analytics. After you install GeoAnalytics Engine on Azure Synapse Analytics or Azure Databricks, you can access or save data stored on Azure storage following the examples below.

  1. Access data on Azure Storage in a Databricks notebook.

    In Databricks, you can connect to Azure Blob Storage or Data Lake Storage Gen2 using Azure credentials, including Azure service principal, shared access signatures (SAS), and account keys. Set the Spark properties to configure the Azure credentials either in a cluster or a notebook. For example, follow the code snippet to set the account key in a notebook:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Set up credentials
    spark.conf.set("fs.azure.account.key.<storage-account>.dfs.core.windows.net",\
                   dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
    

    Replace storage-account, scope, and and storage-account-access-key with the Azure Storage account name, Databricks secret scope name, and Azure storage account access key, respectively.

    You can access the Azure data storage using abfss driver once you have properly configured credentials.

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Read GeoJSON data in a Databricks notebook
    df = spark.read.format("geojson").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
    
  2. Write data to Azure Storage in a Databricks notebook.

    After configuring the Azure credentials , you can write data to Azure Blob Storage or Azure Data Lake Storage Gen2 as shown below:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Save to GeoJSON data
    df.write.format("geojson").mode("overwrite").save("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
    

In Azure Synapse Analytics, refer to Introduction to Microsoft Spark Utilities for more information on how to configure access to Azure Blob Storage and Azure Data Lake Storage Gen2 using wasbs driver and abfss driver, respectively.

Google Cloud Storage

Google Cloud Storage is a cloud-based object storage service offered by Google Cloud Platform. Google Cloud Storage is similar to Amazon S3 in that it stores data in containers called buckets, and you can access, share, and manage objects stored in the bucket. Once you have created an Google Cloud Storage bucket, and uploaded data to the bucket, you can access and save your data on Google Cloud Storage in a PySpark session.

Google Cloud Storage integrates with other Google Cloud Platform services. It is a typical data processing workflow to store data in Google Cloud Storage and process data using Google Cloud Dataproc. After you install GeoAnalytics Engine on Google Dataproc, use the following steps to access and save data on Google Cloud Storage:

  1. Access data stored in Cloud Storage in a PySpark session.

    You can access data stored in Google Cloud Storage using a Google Cloud Storage connector. The connector is installed automatically under the /usr/local/share/google/dataproc/lib/ directory when you use Google Cloud Dataproc. The data stored in Cloud Storage can be accessed using gs:// prefix. For example:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Read GeoParquet data in Google Cloud Storage
    df = spark.read.format("geoparquet").load("gs://<bucket_name>/<file_name>")
    

    For other Spark environments, you can install the connector manually by following the Google Cloud Storage Connector for Spark and Hadoop guide.

  2. Write data to Google Cloud Storage.

    Similarly, you can write data in Google Cloud Storage using the Google Cloud Storage connector:

    Python
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    # Save GeoParquet Data in Google Cloud Storage
    df.write.format("geoparquet").load("gs:/<bucket_name>/<file_name>")

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.