GeoAnalytics Engine can read and write data to and from databases that are supported by Apache Spark. This tutorial demonstrates how to read and write data to and from DataFrames for the following databases:
- Microsoft SQL Server
- Oracle
- PostgreSQL
Prerequisites
The following are required for this tutorial:
- A running Spark session configured with ArcGIS GeoAnalytics Engine.
- A Jupyter or JupyterLab notebook connected to your Spark session.
- A JDBC driver used to make a connection between GeoAnalytics Engine and the database.
Steps
JDBC drivers
A JDBC driver is required to read and write data to and from a database with GeoAnalytics Engine. Each database type (e.g. PostgreSQL, Microsoft SQL Server, etc.) requires a specific JDBC driver.
In this tutorial, the following JDBC drivers are used:
Database | JDBC Driver |
---|---|
Microsoft SQL Server | Spark MS SQL Connector |
Oracle | Ojdbc11 |
PostgreSQL | PostgreSQL JDBC Driver |
The database specific JDBC drivers can be included when starting up PySpark/GeoAnalytics Engine by using the packages flag and specifying a comma-delimited list of Maven coordinates.
Examples:
--packages group
Id :artifact Id :version --packages group
Id :artifact Id :version,group Id :artifact Id :version
Import and authorize GeoAnalytics Engine
-
In a Jupyter notebook, import
geoanalytics
and authorize the module using a username and password or a license file.PythonUse dark colors for code blocks Copy import geoanalytics geoanalytics.auth(username="user1", password="p@ssword")
Setup modules and data
-
Import the GeoAnalytics Engine Spatial Type functions and the
Point
geometry type. The Pythonuuid
module is also imported to create a unique database table name.PythonUse dark colors for code blocks Copy import uuid from geoanalytics.sql import functions as ST from geoanalytics.sql import Point
-
Create a DataFrame with columns for a unique ID and
Point
type geometry.PythonUse dark colors for code blocks Copy data = [(1, Point(10, 10)), (2, Point(20, 20)), (3, Point(30, 30))] df = spark.createDataFrame(data, ["unique_id", "geometry"]) df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- geometry: point (nullable = true) +---------+---------------+ |unique_id| geometry| +---------+---------------+ | 1|{"x":10,"y":10}| | 2|{"x":20,"y":20}| | 3|{"x":30,"y":30}| +---------+---------------+
-
Create a new well-known binary column from the
Point
geometry column and then drop thePoint
geometry column.PythonUse dark colors for code blocks Copy df = df.withColumn("geom_wkb", ST.as_binary("geometry")) \ .drop("geometry") df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- geom_wkb: binary (nullable = true) +---------+--------------------+ |unique_id| geom_wkb| +---------+--------------------+ | 1|[01 01 00 00 00 0...| | 2|[01 01 00 00 00 0...| | 3|[01 01 00 00 00 0...| +---------+--------------------+
Microsoft SQL Server
SQL Server Write
-
Write the DataFrame created in the Setup modules and data section to a new table within the Microsoft SQL Server database.
PythonUse dark colors for code blocks Copy # Create a unique table name using the UUID module table_name = f"df_write_{uuid.uuid4().hex[:6]}" # Write the DataFrame to the SQL Server database df.write \ .format("jdbc") \ .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \ .option("dbtable", table_name) \ .option("user", username) \ .option("password", password) \ .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ .save()
-
Use the Microsoft SQL Server STPointFromWKB geometry method to create a SQL Server point geometry type column from the well-known binary column. The SQL Query below will add an empty SQL Server geometry spatial data type column to the database table created in step 1 and then populate it with the results from
ST
.Point From WKB SQLUse dark colors for code blocks Copy -- Add a geometry column to the database table ALTER TABLE table_name ADD geom Geometry; -- Populate the geometry column with the results from STPointFromWKB and set the SRID UPDATE table_name SET geom=geometry::STPointFromWKB(geom_wkb, SRID); -- View the database table updates SELECT * FROM table_name;
ResultUse dark colors for code blocks Copy +---------+--------------------------------------------+----------------------------------------------+ |unique_id|geom_wkb |geom | +---------+--------------------------------------------+----------------------------------------------+ |3 |0x01010000000000000000003E400000000000003E40|0xE6100000010C0000000000003E400000000000003E40| |1 |0x010100000000000000000024400000000000002440|0xE6100000010C00000000000024400000000000002440| |2 |0x010100000000000000000034400000000000003440|0xE6100000010C00000000000034400000000000003440| +---------+--------------------------------------------+----------------------------------------------+
SQL Server Read
-
When reading from a database table, you can read all records or a subset. Both options are demonstrated below.
The Microsoft SQL Server STAsBinary method is used with the Spark
query
data source option to convert the SQL Server geometry spatial data type column to it's well-known binary format.Create a DataFrame by reading all records in the database table that was created in the SQL Server write example above.
PythonUse dark colors for code blocks Copy df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \ .option("query", f"select unique_id, geom.STAsBinary() as wkb from {table_name}") \ .option("user", username) \ .option("password", password) \ .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ .load() df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- wkb: binary (nullable = true) +---------+--------------------+ |unique_id| wkb| +---------+--------------------+ | 3|[01 01 00 00 00 0...| | 1|[01 01 00 00 00 0...| | 2|[01 01 00 00 00 0...| +---------+--------------------+
Create a DataFrame using a
where
SQL statement in the Sparkquery
data source option to read in only a subset of the database table that was created in the SQL Server write example above.PythonUse dark colors for code blocks Copy df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \ .option("query", f"select unique_id, geom.STAsBinary() as wkb from {table_name} where unique_id = 1") \ .option("user", username) \ .option("password", password) \ .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ .load() df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- wkb: binary (nullable = true) +---------+--------------------+ |unique_id| wkb| +---------+--------------------+ | 1|[01 01 00 00 00 0...| +---------+--------------------+
-
Create a new
Point
geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.PythonUse dark colors for code blocks Copy df = df.withColumn("geometry", ST.point_from_binary("wkb")) df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- wkb: binary (nullable = true) |-- geometry: point (nullable = true) +---------+--------------------+---------------+ |unique_id| wkb| geometry| +---------+--------------------+---------------+ | 3|[01 01 00 00 00 0...|{"x":30,"y":30}| | 1|[01 01 00 00 00 0...|{"x":10,"y":10}| | 2|[01 01 00 00 00 0...|{"x":20,"y":20}| +---------+--------------------+---------------+
Oracle
Oracle write
-
Write the DataFrame created in the Setup modules and data section to a new table within the Oracle database.
PythonUse dark colors for code blocks Copy # Create a unique table name using the UUID module table_name = f"df_write_{uuid.uuid4().hex[:6]}" # Write the DataFrame to the Oracle database df.write \ .format("jdbc") \ .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \ .option("dbtable", table_name) \ .option("user", username) \ .option("password", password) \ .option("driver", "oracle.jdbc.driver.OracleDriver") \ .save()
-
Use the Oracle Spatial Data Object FROM_WKBGEOMETRY function to create an SDO_Geometry type column from the well-known binary column. The SQL Query below will add an empty Oracle SDO Geometry column to the database table created in step 1 and then populate it with the results from
FROM
._WKBGEOMETRY SQLUse dark colors for code blocks Copy -- Add a geometry column to the database table ALTER TABLE table_name ADD geom SDO_GEOMETRY; -- Populate the geometry column with the results from SDO_UTIL.FROM_WKBGEOMETRY UPDATE table_name SET geom=SDO_UTIL.FROM_WKBGEOMETRY("geom_wkb"); -- Set the SRID for the geometry column in the database UPDATE table_name tbl SET tbl.GEOM.SDO_SRID = SRID; -- Commit the changes to the database table commit; -- View the database table updates SELECT * FROM table_name;
ResultUse dark colors for code blocks Copy +---------+--------------------------------------------+----------------------------------+ |unique_id|geom_wkb |GEOM | +---------+--------------------------------------------+----------------------------------+ |2 |0x010100000000000000000034400000000000003440|{2001,4326,{20,20,null},null,null}| |1 |0x010100000000000000000024400000000000002440|{2001,4326,{10,10,null},null,null}| |3 |0x01010000000000000000003E400000000000003E40|{2001,4326,{30,30,null},null,null}| +---------+--------------------------------------------+----------------------------------+
Oracle Read
-
When reading from a database table, you can read all records or a subset. Both options are demonstrated below.
The Oracle Spatial Data Object TO_WKBGEOMETRY function to convert the Oracle SDO geometry column to it's well-known binary format.
Create a DataFrame by reading all records in the database table that was created in the Oracle write example above.
PythonUse dark colors for code blocks Copy df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \ .option("query", f'select "unique_id", SDO_UTIL.TO_WKBGEOMETRY(GEOM) as wkb from {table_name}') \ .option("user", username) \ .option("password", password) \ .option("driver", "oracle.jdbc.driver.OracleDriver") \ .load() df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: decimal(19,0) (nullable = true) |-- WKB: binary (nullable = true) +---------+--------------------+ |unique_id| WKB| +---------+--------------------+ | 2|[00 00 00 00 01 4...| | 3|[00 00 00 00 01 4...| | 1|[00 00 00 00 01 4...| +---------+--------------------+
Create a DataFrame using a
where
SQL statement in the Sparkquery
data source option to read in only a subset of the database table that was created in the Oracle write example above.PythonUse dark colors for code blocks Copy df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \ .option("query", f'select "unique_id", SDO_UTIL.TO_WKBGEOMETRY(GEOM) as wkb from {table_name} where rownum<3') \ .option("user", username) \ .option("password", password) \ .option("driver", "oracle.jdbc.driver.OracleDriver") \ .load() df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: decimal(19,0) (nullable = true) |-- WKB: binary (nullable = true) +---------+--------------------+ |unique_id| WKB| +---------+--------------------+ | 2|[00 00 00 00 01 4...| | 1|[00 00 00 00 01 4...| +---------+--------------------+
-
Create a new
Point
geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.PythonUse dark colors for code blocks Copy df = df.withColumn("geometry", ST.point_from_binary("wkb")) df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: decimal(19,0) (nullable = true) |-- WKB: binary (nullable = true) |-- geometry: point (nullable = true) +---------+--------------------+---------------+ |unique_id| WKB| geometry| +---------+--------------------+---------------+ | 2|[00 00 00 00 01 4...|{"x":20,"y":20}| | 3|[00 00 00 00 01 4...|{"x":30,"y":30}| | 1|[00 00 00 00 01 4...|{"x":10,"y":10}| +---------+--------------------+---------------+
PostgreSQL
PostgreSQL Write
-
Write the DataFrame created in the Setup modules and data section to a new table within the PostgreSQL database.
PythonUse dark colors for code blocks Copy # Create a unique table name using the UUID module table_name = f"df_write_{uuid.uuid4().hex[:6]}" # Write the DataFrame to the PostgreSQL database df.write \ .format("jdbc") \ .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \ .option("dbtable", table_name) \ .option("user", username) \ .option("password", password) \ .option("driver", "org.postgresql.Driver") \ .save()
-
Use the PostgreSQL ST_PointFromWKB function to create a POINT geometry type column from the well-known binary column. The SQL Query below will create a new database table that is a copy of the table created in step 1 but with an additional column for the PostgreSQL geometry type that was created using
ST
._Point From WKB SQLUse dark colors for code blocks Copy -- Create a new database table with a geometry column with the results from ST_PointFromWKB DROP TABLE IF EXISTS new_table_name; SELECT *, ST_PointFromWKB(geom_wkb, SRID) AS geom INTO new_table_name FROM table_name; -- View the database table updates SELECT * FROM new_table_name;
ResultUse dark colors for code blocks Copy +---------+--------------------------------------------+--------------------------------------------------------------------------------+ |unique_id|geom_wkb |geom | +---------+--------------------------------------------+--------------------------------------------------------------------------------+ |2 |0x010100000000000000000034400000000000003440|280000000100000001000400E61000000C0000000100000080A09B9FB91880A09B9FB91800000000| |3 |0x01010000000000000000003E400000000000003E40|280000000100000001000400E61000000C0000000100000080B0FADF831980B0FADF831900000000| |1 |0x010100000000000000000024400000000000002440|280000000100000001000400E61000000C000000010000008090BCDEEE178090BCDEEE1700000000| +---------+--------------------------------------------+--------------------------------------------------------------------------------+
PostgreSQL Read
-
When reading from a database table, you can read all records or a subset. Both options are demonstrated below.
The PostgreSQL STAsBinary function is used with the Spark
query
data source option to convert the PostgreSQL geometry column to it's well-known binary format.Create a DataFrame by reading all records in the database table that was created in the PostgreSQL write example above.
PythonUse dark colors for code blocks Copy df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \ .option("query", f"select unique_id, ST_AsBinary(geom) as wkb from {table_name}") \ .option("user", username) \ .option("password", password) \ .option("driver", "org.postgresql.Driver") \ .load() df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- wkb: binary (nullable = true) +---------+--------------------+ |unique_id| wkb| +---------+--------------------+ | 3|[01 01 00 00 00 1...| | 1|[01 01 00 00 00 2...| | 2|[01 01 00 00 00 1...| +---------+--------------------+
Create a DataFrame using a
where
SQL statement in the Sparkquery
data source option to read in only a subset of the database table that was created in the PostgreSQL write example above.PythonUse dark colors for code blocks Copy df = spark.read \ .format("jdbc") \ .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \ .option("query", f"select unique_id, ST_AsBinary(geom) as wkb from {table_name} where unique_id > 1") \ .option("user", username) \ .option("password", password) \ .option("driver", "org.postgresql.Driver") \ .load() df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- wkb: binary (nullable = true) +---------+--------------------+ |unique_id| wkb| +---------+--------------------+ | 3|[01 01 00 00 00 1...| | 2|[01 01 00 00 00 1...| +---------+--------------------+
-
Create a new
Point
geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.PythonUse dark colors for code blocks Copy df = df.withColumn("geometry", ST.point_from_binary("wkb")) df.printSchema() df.show()
ResultUse dark colors for code blocks Copy root |-- unique_id: long (nullable = true) |-- wkb: binary (nullable = true) |-- geometry: point (nullable = true) +---------+--------------------+--------------------+ |unique_id| wkb| geometry| +---------+--------------------+--------------------+ | 3|[01 01 00 00 00 1...|{"x":30.000000000...| | 1|[01 01 00 00 00 2...|{"x":10.000000000...| | 2|[01 01 00 00 00 1...|{"x":20.000000000...| +---------+--------------------+--------------------+
What's next?
Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: