Learn how to read from, manage, and write to shapefiles. A shapefile data source behaves like other file formats within Spark (parquet, ORC, etc.). You can use shapefiles to read data from, or to write data to.
In this tutorial you will read from shapefiles, write results to new shapefiles, and partition data logically.
Read shapefiles
Prepare your input shapefile
-
Download the sample shapefile from ArcGIS Online.
-
Store it in a local folder on your machine, for example
c
.:\data\shapefile _demo
Set up the workspace
-
Import the required modules.
PythonUse dark colors for code blocks Copy # Import the required modules import os, tempfile import geoanalytics from geoanalytics.sql import functions as ST geoanalytics.auth(username="user1", password="p@ssword")
-
Set the output directory to write your formatted data to.
PythonUse dark colors for code blocks Copy # Set the workspace output_dir = os.path.normpath(r"C:/data/shapefile_demo")
Read from your shapefile and display columns of interest
-
Read the shapefile into a DataFrame. Note that the folder containing the shapefile is specified, and not the full path to the
.shp
file. A folder can contain multiple shapefiles with the same schema and be read as a single DataFrame.PythonUse dark colors for code blocks Copy # Read the shapefile into a DataFrame shapefileDF=spark.read.format("shapefile").load(r"c:\data\shapefile_demo\Mineplants")
-
Visualize a subset of the columns, including the geometry column, by showing a sample of five rows from the input.
PythonUse dark colors for code blocks Copy # Sample your DataFrame shapefileDF.select("commodity","COMPANY_NA","geometry").show(5, truncate=False)
ResultUse dark colors for code blocks Copy +---------+-------------------------------+------------------------+ |commodity|COMPANY_NA |geometry | +---------+-------------------------------+------------------------+ |Aluminum |Alcoa Inc |{"x":-87.336,"y":37.915}| |Aluminum |Century Aluminum Co |{"x":-86.786,"y":37.942}| |Aluminum |Alcan Inc |{"x":-87.5,"y":37.65} | |Aluminum |Ormet Corp |{"x":-90.923,"y":30.138}| |Aluminum |Kaiser Aluminum & Chemical Corp|{"x":-90.755,"y":30.049}| +---------+-------------------------------+------------------------+ only showing top 5 rows
Write shapefiles
Write a DataFrame to a shapefile
Use a defined dataset to create a DataFrame and write it to a shapefile.
-
Define your own dataset.
PythonUse dark colors for code blocks Copy # Define a point dataset myPoints = [(0, -4655711.2806, 222503.076), (1, -4570473.292, 322503.076), (2, -4830838.089, 146545.398), (3, -4570771.608, 116617.112), (4, -4682228.671, 173377.654)] fields = ["id", "latitude", "longitude"]
-
Create a DataFrame from your dataset definition.
PythonUse dark colors for code blocks Copy # Create a DataFrame df = spark.createDataFrame(myPoints, fields) # Enable geometry df = df.withColumn("geometry", ST.srid(ST.point("longitude", "latitude"), 6329)) \ .st.set_geometry_field("geometry")
-
Write your DataFrame to a shapefile.
PythonUse dark colors for code blocks Copy # Write to a single shapefile - update the path to a location accessible to you myshp = df.coalesce(1).write.format("shapefile").mode("overwrite").save(r"C:\data\output_shapefile")
Merge shapefiles with different schemas
Use schema merging when a collection of datasets contains varying schemas. For example, data have been collected over time. Each month a new dataset was created and a new column name for that month was introduced. Use schema merging to resolve the schema differences.
-
If you haven't already downloaded the sample shapefile, follow the steps to prepare your input shapefile.
PythonUse dark colors for code blocks Copy # Read the shapefile into a DataFrame shapefileDF=spark.read.format("shapefile").load(r"c:\data\shapefile_demo\Mineplants")
-
Set the output location for the shapefiles. These are the shapefiles that will have their schemas merged to form a single DataFrame.
PythonUse dark colors for code blocks Copy # Set the output path to store your shapefiles output_shapefiles = os.path.join(output_dir, "merged_shapefile")
-
Create three subset shapefiles. Specify a value of 1 for .coalesce() to write each query result to a single (1) shapefile. A coalesce value enables the number of partitions to be reduced, resulting in fewer output shapefiles. By default, a shapefile will be written for each partition. Each shapefile will have three columns with names in common (
geometry
,id
andcommodity
), and one column with a unique name.- Rows with
id
values between 1 and 5 will have a column namedsite
._name
PythonUse dark colors for code blocks Copy # Create the first subset shapefile shapefileDF.where("id <= 5").select("id", "commodity","site_name","geometry") \ .coalesce(1).write.format("shapefile").mode("overwrite").save(output_shapefiles)
- Rows with
id
values between 6 and 10 will have a column namedcompany
._na
PythonUse dark colors for code blocks Copy # Create the second subset shapefile shapefileDF.where("id between 6 and 10").select("id","commodity", "company_na", "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)
- Rows with
id
values between 11 and 15 will have a column namedstate
._loca
PythonUse dark colors for code blocks Copy # Create the third subset shapefile shapefileDF.where("id between 11 and 15").select("id", "commodity", "state_loca", "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)
- Rows with
-
Use schema merging to create a DataFrame with a single, combined schema.
PythonUse dark colors for code blocks Copy # Merge schemas for the three subset shapefiles spark.read.format("shapefile").option("mergeSchemas","true").load(output_shapefiles) \ .orderBy("id").show()
ResultUse dark colors for code blocks Copy +---+---------+--------------------+--------------------+--------------------+--------------+ | id|commodity| company_na| geometry| site_name| state_loca| +---+---------+--------------------+--------------------+--------------------+--------------+ | 1| Aluminum| null|{"x":-87.336,"y":...|Evansville (Warri...| null| | 2| Aluminum| null|{"x":-86.786,"y":...| Hawesville Smelter| null| | 3| Aluminum| null|{"x":-87.5,"y":37...| Sebree Smelter| null| | 4| Aluminum| null|{"x":-90.923,"y":...| Burnside Refinery| null| | 5| Aluminum| null|{"x":-90.755,"y":...| Gramercy Refinery| null| | 6| Aluminum| Alcoa Inc|{"x":-77.469,"y":...| null| null| | 7| Aluminum|Noranda Aluminum Inc|{"x":-89.564,"y":...| null| null| | 8| Aluminum|Columbia Falls Al...|{"x":-114.139,"y"...| null| null| | 9| Aluminum| Alcoa Inc|{"x":-74.75,"y":4...| null| null| | 10| Aluminum| Alcoa Inc|{"x":-74.881,"y":...| null| null| | 11| Aluminum| null|{"x":-80.873,"y":...| null| Ohio| | 12| Aluminum| null|{"x":-80.05,"y":3...| null|South Carolina| | 13| Aluminum| null|{"x":-83.968,"y":...| null| Tennessee| | 14| Aluminum| null|{"x":-96.554,"y":...| null| Texas| | 15| Aluminum| null|{"x":-97.076,"y":...| null| Texas| +---+---------+--------------------+--------------------+--------------------+--------------+
Partition your shapefile into logical groups
Datasets can be partitioned
by values within one or more columns. Each unique value in a column becomes a directory with the name
<column
. In this example, you will logically separate
the data based on column values for spatial regions.
Without partitioning and coalescing when writing data, you will end up with a shapefile for each record by default. Partitioning your data logically enables you to read, write, and store data in meaningful storage structures.
-
Specify the location to output your newly partitioned data.
PythonUse dark colors for code blocks Copy # Set the output path to store your partitioned datasets partitioned_output = os.path.join(output_dir, "partitioned")
-
Partition your data based on the values for the columns "state_loca" and "commodity".
PythonUse dark colors for code blocks Copy # Partition your data by state and resource type shapefileDF.write.format("shapefile").partitionBy("state_loca", "commodity").mode("overwrite").save(partitioned_output)
-
The result will be a new folder for each state. To preview the results of the partition, list the first thirty newly created datasets.
PythonUse dark colors for code blocks Copy # Print out the first 30 partitions to visualize results for index, (path, names, filenames) in enumerate(os.walk(partitioned_output)): print(os.path.relpath(path, output_dir)) if index == 30: break;
ResultUse dark colors for code blocks Copy partitioned partitioned\state_loca=Alabama partitioned\state_loca=Alabama\commodity=Bentonite partitioned\state_loca=Alabama\commodity=Cement partitioned\state_loca=Alabama\commodity=Common%20Clay%20and%20Shale partitioned\state_loca=Alabama\commodity=Crushed%20Stone partitioned\state_loca=Alabama\commodity=Dimension%20Stone partitioned\state_loca=Alabama\commodity=Gypsum partitioned\state_loca=Alabama\commodity=Iron%20Oxide%20Pigments partitioned\state_loca=Alabama\commodity=Kaolin partitioned\state_loca=Alabama\commodity=Lime partitioned\state_loca=Alabama\commodity=Perlite partitioned\state_loca=Alabama\commodity=Salt partitioned\state_loca=Alabama\commodity=Sand%20and%20Gravel partitioned\state_loca=Alabama\commodity=Silicon partitioned\state_loca=Alabama\commodity=Sulfur partitioned\state_loca=Alaska partitioned\state_loca=Alaska\commodity=Crushed%20Stone partitioned\state_loca=Alaska\commodity=Germanium partitioned\state_loca=Alaska\commodity=Gold partitioned\state_loca=Alaska\commodity=Lead partitioned\state_loca=Alaska\commodity=Sand%20and%20Gravel partitioned\state_loca=Alaska\commodity=Silver partitioned\state_loca=Alaska\commodity=Zinc partitioned\state_loca=Arizona partitioned\state_loca=Arizona\commodity=Bentonite partitioned\state_loca=Arizona\commodity=Cement partitioned\state_loca=Arizona\commodity=Common%20Clay%20and%20Shale partitioned\state_loca=Arizona\commodity=Copper partitioned\state_loca=Arizona\commodity=Crushed%20Stone partitioned\state_loca=Arizona\commodity=Gemstones
What's next?
Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: