Apache Spark natively supports reading and writing data directly to and from several different types of databases. This is accomplished using Java Database Connectivity, commonly referred to as JDBC. The Spark JDBC data source reads and writes data directly to and from databases using Spark DataFrames. By extension, this means that GeoAnalytics Engine can read and write data to any of the databases supported by Apache Spark. A few of the supported database types are:
- Microsoft SQL Server
- Oracle
- PostgreSQL
A JDBC driver is required to read and write data to and from a database with GeoAnalytics Engine. Each database type (e.g. PostgreSQL, Microsoft SQL Server, etc.) requires a specific JDBC driver. For example the JDBC driver for Microsoft SQL Server is different than the JDBC driver for PostgreSQL. The Maven repository contains the JDBC drivers required to connect to each of the databases supported by Apache Spark.
The table below outlines a few of the common JDBC data source options that are used when reading and writing to and from a database.
Data Source Option | Example | Description |
---|---|---|
url | .option("url", "jdbc | The JDBC connection URL for the database. |
dbtable | .option("dbtable", "database | The name of the database table to read from or write to. Either query or dbtable must be specified. |
query | .option("query", " | A SQL query used to read in a subset of the database table. Either query or dbtable must be specified. |
user | .option("user", "my | The username that will be used to connect to the database. |
password | .option("password", "my | The user password that will be used to connect to the database. |
driver | .option("driver", "com.microsoft.sqlserver.jdbc. | The class name of the JDBC driver that will be used to connect to the database. |
Database read
The Python code sample below creates a new DataFrame by reading in all of the database table records.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/sample") \
.option("dbtable", "locations") \
.option("user", "user1") \
.option("password", "test123") \
.option("driver", "org.postgresql.Driver") \
.load()
Database write
The Python code sample below writes the data from a DataFrame to a database table.
df.write \
.format("jdbc") \
.option("url", "jdbc:sqlserver://localhost\demoInstance;databaseName=sampleTwo") \
.option("dbtable", "places") \
.option("user", "username2") \
.option("password", "p@ssw0rd") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
The read and write to databases tutorial includes detailed examples of how to connect to, read from, and write to Microsoft SQL Server, Oracle, and PostgreSQL.
Usage notes
- Geometry type columns do not work with JDBC.
-
When writing, geometry columns need to be converted to a well-known binary or a well-known text column and then the geometry type column will need to be dropped from the DataFrame before writing it to a database table with JDBC. The following GeoAnalytics Spatial Type functions are used to convert the geometry type column to a type supported by JDBC:
- Well-known binary:
- Well-known text:
-
When reading in a geometry column from a database table, the geometry column needs to be in the well-known binary or the well-known text format. The following GeoAnalytics Spatial Type functions are used to convert a well-known binary or a well-known text column to a geometry type column:
- Well-known binary:
- Well-known text:
-