Read and write to databases

GeoAnalytics Engine can read and write data to and from databases that are supported by Apache Spark. This tutorial demonstrates how to read and write data to and from DataFrames for the following databases:

Microsoft SQL Server
Oracle
PostgreSQL

Prerequisites

The following are required for this tutorial:

A running Spark session configured with ArcGIS GeoAnalytics Engine.
A Jupyter or JupyterLab notebook connected to your Spark session.
A JDBC driver used to make a connection between GeoAnalytics Engine and the database.

Steps

JDBC drivers

A JDBC driver is required to read and write data to and from a database with GeoAnalytics Engine. Each database type (e.g. PostgreSQL, Microsoft SQL Server, etc.) requires a specific JDBC driver.

In this tutorial, the following JDBC drivers are used:

Database	JDBC Driver
Microsoft SQL Server	Spark MS SQL Connector
Oracle	Ojdbc11
PostgreSQL	PostgreSQL JDBC Driver

The database specific JDBC drivers can be included when starting up PySpark/GeoAnalytics Engine by using the packages flag and specifying a comma-delimited list of Maven coordinates.

Examples:

--packages groupId:artifactId:version
--packages groupId:artifactId:version,groupId:artifactId:version

Import and authorize GeoAnalytics Engine

In a Jupyter notebook, import geoanalytics and authorize the module using a username and password or a license file.

Python
Use dark colors for code blocksCopy
import geoanalytics
geoanalytics.auth(username="user1", password="p@ssword")

Setup modules and data

Import the GeoAnalytics Engine Spatial Type functions and the Point geometry type. The Python uuid module is also imported to create a unique database table name.

Python
Use dark colors for code blocksCopy
import uuid
from geoanalytics.sql import functions as ST
from geoanalytics.sql import Point

Create a DataFrame with columns for a unique ID and Point type geometry.

Python
Use dark colors for code blocksCopy
data = [(1, Point(10, 10)),
       (2, Point(20, 20)),
       (3, Point(30, 30))]

df = spark.createDataFrame(data, ["unique_id", "geometry"])
df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
root
 |-- unique_id: long (nullable = true)
 |-- geometry: point (nullable = true)

+---------+---------------+
|unique_id|       geometry|
+---------+---------------+
|        1|{"x":10,"y":10}|
|        2|{"x":20,"y":20}|
|        3|{"x":30,"y":30}|
+---------+---------------+

Create a new well-known binary column from the Point geometry column and then drop the Point geometry column.

Python
Use dark colors for code blocksCopy
df = df.withColumn("geom_wkb", ST.as_binary("geometry")) \
       .drop("geometry")

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
root
 |-- unique_id: long (nullable = true)
 |-- geom_wkb: binary (nullable = true)

+---------+--------------------+
|unique_id|            geom_wkb|
+---------+--------------------+
|        1|[01 01 00 00 00 0...|
|        2|[01 01 00 00 00 0...|
|        3|[01 01 00 00 00 0...|
+---------+--------------------+

Microsoft SQL Server

SQL Server Write

Write the DataFrame created in the Setup modules and data section to a new table within the Microsoft SQL Server database.

Python
Use dark colors for code blocksCopy
# Create a unique table name using the UUID module
table_name = f"df_write_{uuid.uuid4().hex[:6]}"

# Write the DataFrame to the SQL Server database
df.write \
  .format("jdbc") \
  .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \
  .option("dbtable", table_name) \
  .option("user", username) \
  .option("password", password) \
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
  .save()

Use the Microsoft SQL Server STPointFromWKB geometry method to create a SQL Server point geometry type column from the well-known binary column. The SQL Query below will add an empty SQL Server geometry spatial data type column to the database table created in step 1 and then populate it with the results from STPointFromWKB.

SQL
Use dark colors for code blocksCopy
-- Add a geometry column to the database table
ALTER TABLE table_name
ADD geom Geometry;

-- Populate the geometry column with the results from STPointFromWKB and set the SRID
UPDATE table_name
SET geom=geometry::STPointFromWKB(geom_wkb, SRID);

-- View the database table updates
SELECT * FROM table_name;

Result
Use dark colors for code blocksCopy
 +---------+--------------------------------------------+----------------------------------------------+
 |unique_id|geom_wkb                                    |geom                                          |
 +---------+--------------------------------------------+----------------------------------------------+
 |3        |0x01010000000000000000003E400000000000003E40|0xE6100000010C0000000000003E400000000000003E40|
 |1        |0x010100000000000000000024400000000000002440|0xE6100000010C00000000000024400000000000002440|
 |2        |0x010100000000000000000034400000000000003440|0xE6100000010C00000000000034400000000000003440|
 +---------+--------------------------------------------+----------------------------------------------+

SQL Server Read

When reading from a database table, you can read all records or a subset. Both options are demonstrated below.

The Microsoft SQL Server STAsBinary method is used with the Spark query data source option to convert the SQL Server geometry spatial data type column to it's well-known binary format.

Create a DataFrame by reading all records in the database table that was created in the SQL Server write example above.

Python
Use dark colors for code blocksCopy
df = spark.read \
          .format("jdbc") \
          .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \
          .option("query", f"select unique_id, geom.STAsBinary() as wkb from {table_name}") \
          .option("user", username) \
          .option("password", password) \
          .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
          .load()

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: long (nullable = true)
  |-- wkb: binary (nullable = true)

 +---------+--------------------+
 |unique_id|                 wkb|
 +---------+--------------------+
 |        3|[01 01 00 00 00 0...|
 |        1|[01 01 00 00 00 0...|
 |        2|[01 01 00 00 00 0...|
 +---------+--------------------+

Create a DataFrame using a where SQL statement in the Spark query data source option to read in only a subset of the database table that was created in the SQL Server write example above.

Python
Use dark colors for code blocksCopy
df = spark.read \
          .format("jdbc") \
          .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \
          .option("query", f"select unique_id, geom.STAsBinary() as wkb from {table_name} where unique_id = 1") \
          .option("user", username) \
          .option("password", password) \
          .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
          .load()

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: long (nullable = true)
  |-- wkb: binary (nullable = true)

 +---------+--------------------+
 |unique_id|                 wkb|
 +---------+--------------------+
 |        1|[01 01 00 00 00 0...|
 +---------+--------------------+

Create a new Point geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.

Python
Use dark colors for code blocksCopy
df = df.withColumn("geometry", ST.point_from_binary("wkb"))
df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: long (nullable = true)
  |-- wkb: binary (nullable = true)
  |-- geometry: point (nullable = true)

 +---------+--------------------+---------------+
 |unique_id|                 wkb|       geometry|
 +---------+--------------------+---------------+
 |        3|[01 01 00 00 00 0...|{"x":30,"y":30}|
 |        1|[01 01 00 00 00 0...|{"x":10,"y":10}|
 |        2|[01 01 00 00 00 0...|{"x":20,"y":20}|
 +---------+--------------------+---------------+

Oracle

Oracle write

Write the DataFrame created in the Setup modules and data section to a new table within the Oracle database.

Python
Use dark colors for code blocksCopy
# Create a unique table name using the UUID module
table_name = f"df_write_{uuid.uuid4().hex[:6]}"

# Write the DataFrame to the Oracle database
df.write \
  .format("jdbc") \
  .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \
  .option("dbtable", table_name) \
  .option("user", username) \
  .option("password", password) \
  .option("driver", "oracle.jdbc.driver.OracleDriver") \
  .save()

Use the Oracle Spatial Data Object FROM_WKBGEOMETRY function to create an SDO_Geometry type column from the well-known binary column. The SQL Query below will add an empty Oracle SDO Geometry column to the database table created in step 1 and then populate it with the results from FROM_WKBGEOMETRY.

SQL
Use dark colors for code blocksCopy
-- Add a geometry column to the database table
ALTER TABLE table_name
ADD geom SDO_GEOMETRY;

-- Populate the geometry column with the results from SDO_UTIL.FROM_WKBGEOMETRY
UPDATE table_name
SET geom=SDO_UTIL.FROM_WKBGEOMETRY("geom_wkb");

-- Set the SRID for the geometry column in the database
UPDATE table_name tbl
SET tbl.GEOM.SDO_SRID = SRID;

-- Commit the changes to the database table
commit;

-- View the database table updates
SELECT * FROM table_name;

Result
Use dark colors for code blocksCopy
+---------+--------------------------------------------+----------------------------------+
|unique_id|geom_wkb                                    |GEOM                              |
+---------+--------------------------------------------+----------------------------------+
|2        |0x010100000000000000000034400000000000003440|{2001,4326,{20,20,null},null,null}|
|1        |0x010100000000000000000024400000000000002440|{2001,4326,{10,10,null},null,null}|
|3        |0x01010000000000000000003E400000000000003E40|{2001,4326,{30,30,null},null,null}|
+---------+--------------------------------------------+----------------------------------+

Oracle Read

When reading from a database table, you can read all records or a subset. Both options are demonstrated below.

The Oracle Spatial Data Object TO_WKBGEOMETRY function to convert the Oracle SDO geometry column to it's well-known binary format.

Create a DataFrame by reading all records in the database table that was created in the Oracle write example above.

Python
Use dark colors for code blocksCopy
df = spark.read \
          .format("jdbc") \
          .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \
          .option("query", f'select "unique_id", SDO_UTIL.TO_WKBGEOMETRY(GEOM) as wkb from {table_name}') \
          .option("user", username) \
          .option("password", password) \
          .option("driver", "oracle.jdbc.driver.OracleDriver") \
          .load()

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: decimal(19,0) (nullable = true)
  |-- WKB: binary (nullable = true)

 +---------+--------------------+
 |unique_id|                 WKB|
 +---------+--------------------+
 |        2|[00 00 00 00 01 4...|
 |        3|[00 00 00 00 01 4...|
 |        1|[00 00 00 00 01 4...|
 +---------+--------------------+

Create a DataFrame using a where SQL statement in the Spark query data source option to read in only a subset of the database table that was created in the Oracle write example above.

Python
Use dark colors for code blocksCopy
df = spark.read \
        .format("jdbc") \
        .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \
        .option("query", f'select "unique_id", SDO_UTIL.TO_WKBGEOMETRY(GEOM) as wkb from {table_name} where rownum<3') \
        .option("user", username) \
        .option("password", password) \
        .option("driver", "oracle.jdbc.driver.OracleDriver") \
        .load()

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: decimal(19,0) (nullable = true)
  |-- WKB: binary (nullable = true)

 +---------+--------------------+
 |unique_id|                 WKB|
 +---------+--------------------+
 |        2|[00 00 00 00 01 4...|
 |        1|[00 00 00 00 01 4...|
 +---------+--------------------+

Create a new Point geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.

Python
Use dark colors for code blocksCopy
df = df.withColumn("geometry", ST.point_from_binary("wkb"))
df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: decimal(19,0) (nullable = true)
  |-- WKB: binary (nullable = true)
  |-- geometry: point (nullable = true)

 +---------+--------------------+---------------+
 |unique_id|                 WKB|       geometry|
 +---------+--------------------+---------------+
 |        2|[00 00 00 00 01 4...|{"x":20,"y":20}|
 |        3|[00 00 00 00 01 4...|{"x":30,"y":30}|
 |        1|[00 00 00 00 01 4...|{"x":10,"y":10}|
 +---------+--------------------+---------------+

PostgreSQL

PostgreSQL Write

Write the DataFrame created in the Setup modules and data section to a new table within the PostgreSQL database.

Python
Use dark colors for code blocksCopy
# Create a unique table name using the UUID module
table_name = f"df_write_{uuid.uuid4().hex[:6]}"

# Write the DataFrame to the PostgreSQL database
df.write \
  .format("jdbc") \
  .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \
  .option("dbtable", table_name) \
  .option("user", username) \
  .option("password", password) \
  .option("driver", "org.postgresql.Driver") \
  .save()

Use the PostgreSQL ST_PointFromWKB function to create a POINT geometry type column from the well-known binary column. The SQL Query below will create a new database table that is a copy of the table created in step 1 but with an additional column for the PostgreSQL geometry type that was created using ST_PointFromWKB.

SQL
Use dark colors for code blocksCopy
-- Create a new database table with a geometry column with the results from ST_PointFromWKB
DROP TABLE IF EXISTS new_table_name;
SELECT *, ST_PointFromWKB(geom_wkb, SRID) AS geom INTO new_table_name FROM table_name;

-- View the database table updates
SELECT * FROM new_table_name;

Result
Use dark colors for code blocksCopy
 +---------+--------------------------------------------+--------------------------------------------------------------------------------+
 |unique_id|geom_wkb                                    |geom                                                                            |
 +---------+--------------------------------------------+--------------------------------------------------------------------------------+
 |2        |0x010100000000000000000034400000000000003440|280000000100000001000400E61000000C0000000100000080A09B9FB91880A09B9FB91800000000|
 |3        |0x01010000000000000000003E400000000000003E40|280000000100000001000400E61000000C0000000100000080B0FADF831980B0FADF831900000000|
 |1        |0x010100000000000000000024400000000000002440|280000000100000001000400E61000000C000000010000008090BCDEEE178090BCDEEE1700000000|
 +---------+--------------------------------------------+--------------------------------------------------------------------------------+

PostgreSQL Read

When reading from a database table, you can read all records or a subset. Both options are demonstrated below.

The PostgreSQL STAsBinary function is used with the Spark query data source option to convert the PostgreSQL geometry column to it's well-known binary format.

Create a DataFrame by reading all records in the database table that was created in the PostgreSQL write example above.

Python
Use dark colors for code blocksCopy
df = spark.read \
        .format("jdbc") \
        .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \
        .option("query", f"select unique_id, ST_AsBinary(geom) as wkb from {table_name}") \
        .option("user", username) \
        .option("password", password) \
        .option("driver", "org.postgresql.Driver") \
        .load()

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: long (nullable = true)
  |-- wkb: binary (nullable = true)

 +---------+--------------------+
 |unique_id|                 wkb|
 +---------+--------------------+
 |        3|[01 01 00 00 00 1...|
 |        1|[01 01 00 00 00 2...|
 |        2|[01 01 00 00 00 1...|
 +---------+--------------------+

Create a DataFrame using a where SQL statement in the Spark query data source option to read in only a subset of the database table that was created in the PostgreSQL write example above.

Python
Use dark colors for code blocksCopy
df = spark.read \
        .format("jdbc") \
        .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \
        .option("query", f"select unique_id, ST_AsBinary(geom) as wkb from {table_name}  where unique_id > 1") \
        .option("user", username) \
        .option("password", password) \
        .option("driver", "org.postgresql.Driver") \
        .load()

df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: long (nullable = true)
  |-- wkb: binary (nullable = true)

 +---------+--------------------+
 |unique_id|                 wkb|
 +---------+--------------------+
 |        3|[01 01 00 00 00 1...|
 |        2|[01 01 00 00 00 1...|
 +---------+--------------------+

Create a new Point geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.

Python
Use dark colors for code blocksCopy
df = df.withColumn("geometry", ST.point_from_binary("wkb"))
df.printSchema()
df.show()

Result
Use dark colors for code blocksCopy
 root
  |-- unique_id: long (nullable = true)
  |-- wkb: binary (nullable = true)
  |-- geometry: point (nullable = true)

 +---------+--------------------+--------------------+
 |unique_id|                 wkb|            geometry|
 +---------+--------------------+--------------------+
 |        3|[01 01 00 00 00 1...|{"x":30.000000000...|
 |        1|[01 01 00 00 00 2...|{"x":10.000000000...|
 |        2|[01 01 00 00 00 1...|{"x":20.000000000...|
 +---------+--------------------+--------------------+

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: