Geocode converts addresses into geographic coordinates. This process requires a Spark DataFrame containing the addresses you want to geocode and a locator. The tool matches the addresses against reference data in a locator and returns points representing the address locations along with other output columns.
Usage notes
-
You must set at least one address field using
set
. When multiple fields are passed into the setter, the tool concatenates them into a single string field and uses that as the address field in the tool execution.Address Fields() The order of the fields passed to
set
matters as different results may be returned depending on the order. To obtain the best result possible, it is recommended that the address fields' order aligns with how a standardized address is constructed in that region. For example, addresses in the United States should follow the format ofAddress Fields() {house number} {street} {city} {state} {zip}
. -
The fields from the input DataFrame will always be included in the output.
The result fields in the output DataFrame are determined by the
predefined
parameter in the_set set
setter.Out Fields() The following options are supported:
Location
—Only returns a field calledOnly geocode
which contains points representing address locations._location Minimal
—Returns thegeocode
,_location Status
,Score
,Match
, and_addr Addr
fields. This is the default option._type Minimal
—Returns the fields defined inAnd User Fields Minimal
and the custom output fields available in the locator. User defined fields can be configured during the process of creating a locator in ArcGIS Pro. For more information about locators, read the geocoding core concept.All
—Returns all available output fields including any custom output fields defined in your locator.
-
When an input DataFrame contains a field that has the same name as one of the output fields in the geocoded result, the output field will be automatically renamed with a suffix of "1". For example, if a field named
Address
already exists in the input DataFrame, the result DataFrame will have a field namedAddress
from the input, andAddress1
representing the output field. -
The count of the input records will be equal to the count of the output records. One result will be returned for each input record:
- When matched, the
Status
field will beM
and the rest of the output fields will populate with the matching candidate's info. TheTies
field will returnnull
. - When unmatched, the
Status
field will beU
and the rest of the output fields will benull
. - When tied, the
Status
field will beT
, meaning there is more than one location with the same best match score. One of the tied records will be returned and populate the output fields. All tied records will be stored in a new field calledTies
.
- When matched, the
-
The maximum number of tied records returned for each address is determined by the
Max
value stored in the locator. For example, ifCandidates Max
is 50, then the maximum number of tied geocoded results that can be returned for a single input address is 50. If all or most of the input addresses are returned with the maximum number of tied outputs, try the following to narrow down the search criteria:Candidates - Specify more input address fields in
.set
. For example, specifyingAddress Fields() Address
andCity
fields will likely yield better result than just specifying anAddress
field. - Use input address fields with better data quality. Try to avoid using address fields with a high percentage of
Null
values. - Specify a country code with
.set
.Country Code() - Specify a minimum matching score with
.set
.Min Score()
- Specify more input address fields in
-
Use
set
if you want to limit the records that will be matched in the output based on how well they match the input location. Any records with a score of less than the minimum score will still be included in the output but theirMin Score() Status
will beU
. -
Geocoded results can be limited to selected countries using
set
. If no matching candidate can be found in the selected country,Country Code() Status
will beU
. Whenset
is not used or is set with a country that is not available in the locator, the tool geocodes the address using all supported countries in the locator by default. Country codes should use the ISO 3166-1 alpha-3 standard.Country Code() -
The output points will have the same spatial reference as the address locator.
Limitations
Geocoding with GeoAnalytics Engine requires a locator file. Using a locator service, such as the ArcGIS World Geocoding Service, is not supported.
-
The tool does not currently support automatic or guided rematching of tied records.
-
In Spark 3.4.x and above, the schema of the
geocode
in the_location Ties
nested column can change after being written to Geoparquet or Parquet.
Results
The result of Geocode is a copy of the input DataFrame with new fields added depending on the set
setter.
The table below explains which fields are returned based on the predefined
parameter's value in the set
setter.
There are four options:
Location
—Only geocode
is returned._location Minimal
—geocode
,_location Status
,Score
,Match
, and_addr Addr
are returned. This is the default option._type Minimal
—And User Fields geocode
,_location Status
,Score
,Match
,_addr Addr
and any custom output fields available in the locator are returned._type All
—All fields are returned including any custom fields defined in your locator.
The result fields are detailed in the table below.
Field | Description |
---|---|
Loc | The name of the locator used to return a match result. This field is available only if the locator used for matching the table is a composite locator. |
Status | A code indicating whether or not the address was matched. M means that the address was matched while U means that the address was not matched. T indicates that the address has more than one candidate with the same best match score but at different locations. |
Score | The match score of the candidate to which the address was matched. The score is a number between 0 to 100, in which 100 indicates the candidate is a perfect match. |
Match | The address where the matched location actually resides based on the information of the matched candidate. |
Long | A longer version of Match containing more administrative information. |
Short | A shortened version of Match . |
Addr | The geocoded address type, which indicates the level at which the address matched. Supported match levels vary between countries. The table at the bottom of this section describes some possible values. |
Type | The feature type for results returned by a search. The Type field only includes a value for candidates with an address type of POI or Locality . For example, the feature type of Starbucks might be Coffee Shop. |
Place | The formal name of a geocode match candidate (e.g., Paris or Starbucks). |
Place | The full street address of a place, including street, city, and region (e.g., 275 Columbus Ave., New York, New York). |
Phone | The primary phone number of a place. |
URL | The URL of the primary website for a place. |
Rank | A number that indicates the importance of a result relative to other results with the same name. The smaller numbers represent higher-ranked features. Rank values are based on population or feature type. For example, there are cities in France and Texas named Paris. Paris, France, has a greater population than Paris, Texas, so it will have a higher rank. |
Add | The name of a building (e.g., Empire State Building). |
Add | The alphanumeric value that represents the portion of an address typically known as a house number or building number. This value is returned for Point and Street matches only. |
Add | A value representing the beginning number of a street address range. It is relative to direction of feature digitization and is not always the smallest number in the range. This value is provided for Street match results. |
Add | A value representing the ending number of a street address range. It is relative to direction of feature digitization and is not always the largest number in the range. This value is provided for Street match results. |
Add | The full address number range for the street segment that an address lies on, in the format AddNumFrom-AddNumTo. An example is the AddRange value for the street address 123 Main St. may be 101-199. |
Side | The side of the street where an address resides relative to the direction of feature digitization. This value is not relative to the direction of travel along the street. L indicates that an address is matched to the left side while R means the address is matched to the right side of the street. No value indicates that the address is not matched or the locator could not determine the side of the street. |
St | An address element defining the direction of a street, which occurs before the primary street name (e.g., North in North Main Street). |
St | An address element defining the leading type of a street (e.g., Avenid in Avenida Central or Rue in Rue Lapin). |
St | An address element defining the primary name of a street (e.g., Main in North Main Street). |
St | An address element defining the trailing type of a street (e.g., Street in Main Street). |
St | An address element defining the direction of a street, which occurs after the primary street name (e.g. North in Main Street North). |
St | An address element defining the leading direction of the first street in an intersection. |
St | An address element defining the leading type of the first street in an intersection. |
St | An address element defining the primary name of the first street in an intersection. |
St | An address element defining the trailing direction of the first street in an intersection. |
St | An address element defining the leading direction of the second street in an intersection. |
St | An address element defining the leading type of the second street in an intersection. |
St | An address element defining the primary name of the second street in an intersection. |
St | An address element defining the trailing direction of the second street in an intersection. |
Ties | A nested field containing tied records. |
Bldg | The name or number of a building subunit (e.g., A in Building A). |
Bldg | The classification of a building subunit. Examples include building, hangar, and tower. |
Level | The classification of a floor subunit. Examples include floor, level, and department. |
Level | The name or number of a floor subunit (e.g., 3 in Level 3). |
Unit | The classification of a unit subunit. Examples include unit, apartment, and suite. |
Unit | The name or number of a unit subunit (e.g., 2B in Apartment 2B). |
Sub | The full subunit value for a candidate with an address type of Subaddress . |
St | The street address of a place without a zone, such as city or state (e.g., 275 Columbus Ave). |
Address | The full address of a place (e.g., 2000 MCMILLAN AVE, COMPTON, CA 90220). |
Block | The name of the block-level administrative division for a candidate. A block is the smallest administrative area for a country. It can be described as a subdivision of sector or neighborhood or a named city block. It is not commonly used. |
Sector | The name of the sector-level administrative division for a candidate. A sector is a subdivision of neighborhood, district, or a collection of blocks. It is not commonly used. |
Nbrhd | The name of the neighborhood-level administrative division for a candidate. A neighborhood is a subsection of a city or district. For example, Little Italy is the name of a neighborhood in the city of San Diego, California. |
Neighborhood | The name of the neighborhood-level administrative division for a candidate. It is an alias for the field Nbrhd . |
District | The name of the district-level administrative division for a candidate, for example, a subdivision of city. For example, Wilhelmsburg is a district in the city of Hamburg in Germany. |
City | The name of the city-level administrative division for a candidate. City is a subdivision of a subregion or region. For example, Atlanta is a city within Fulton County in the state of Georgia. |
Metro | The name of the metropolitan area-level administrative division for a candidate. This is usually an urban area consisting of a large city and the smaller cities surrounding it. This can potentially intersect multiple subregions or regions. An example is the Kolkata Metropolitan Area in India. |
Subregion | The name of the subregion-level administrative division for a candidate. Subregion is a subdivision of a region. For example, San Diego County is a subregion of the state of California. |
Region | The name of the region-level administrative division for a candidate. This can be a subdivision of a country or territory. It is typically the largest administrative area for a country (such as state or province) if the Territory administrative division is not used. |
Region | Abbreviated region name. For example, the abbreviated name for California is CA. |
Territory | The name of the territory-level administrative division for a candidate. This is a subdivision of a country and is not commonly used. An example is the Sudeste macroregion of Brazil, which encompasses the states of Espírito Santo, Minas Gerais, Rio de Janeiro, and São Paulo. |
Postal | An alphanumeric address element defining the primary postal code (e.g., V7M 2B4 or 92374). |
Postal | An alphanumeric address element defining the postal code extension (e.g., 8110 in 92373-8110). |
Country | A three-character code for a country that follows the ISO 3166-1 alpha-3 standard. |
Cntry | The full country name for an address candidate. The name may be in the same language as the input address, or in the language specified by the lang parameter. If the full country name is not available in the specified language, the primary language of the country is used (e.g., 日本 for Japan). |
Lang | A three-character language code representing the language of the address. The code should follow the ISO 639-3 standard. |
X | The primary x-coordinate of the matched address in the spatial reference of the locator. |
Y | The primary y-coordinate of the matched address in the spatial reference of the locator. |
Display | The display x-coordinate of an address returned in the spatial reference of the locator. |
Display | The display y-coordinate of an address returned in the spatial reference of the locator. |
Xmin | The minimum x-coordinate of a geocode result. |
Xmax | The maximum x-coordinate of a geocode result. |
Ymin | The minimum y-coordinate of a geocode result. |
Ymax | The maximum y-coordinate of a geocode result. |
Ex | A collection of strings from the input that could not be matched to any part of an address and were used to score or penalize the result. |
The table below outlines the possible values for Addr
:
Value | Description |
---|---|
Subaddress | A street address based on points that represent house and building subaddress locations. Typically, this is the most spatially accurate match level. The subaddress elements of unit type and unit identifier help to distinguish one subaddress within or between structures from another when several occur within the same location. Reference data contains address points or polygons with associated house numbers, street names, and subaddress elements, along with administrative divisions and optional postal code. An example is 3836 Emerald Ave., Suite C, La Verne, CA 91750. |
Point | A street address based on points that represent house and building locations. Reference data contains address points with associated house numbers and street names, along with administrative divisions and optional postal code. The X and Y and geometry output values for a Point match represent the street entry location for the address; this is the location used for routing operations. The Display and Display values represent the rooftop or actual location of the address. An example is 380 New York St., Redlands, CA 92373. |
Parcel | A plot of land that is considered real property and may include one or more homes or other structures. A parcel typically has an address and parcel identification number assigned to it, such as 17 011100120063. |
Street | A street address that differs from Point because the house number is interpolated from a range of numbers. Reference data contains street centerlines with house number ranges, along with administrative divisions and optional postal code information. An example is 647 Haight St., San Francisco, CA 94117. |
Street | A street address consisting of a street intersection along with city and optional state and postal code information. An example is Redlands Blvd. & New York St., Redlands, CA 92373. |
Street | An estimated street address match that is returned when parameter matchOutOfRange=true and the input house number exceeds the house number range for the matched street segment. |
POI | Points of interest. Reference data consists of administrative division, place-names, businesses, landmarks, and geographic features. An example is Starbucks. |
Distance | A street address that represents the linear distance along a street, typically in kilometers or miles, from a designated origin location. An example is Carr 682 KM 4, Barceloneta, 00617. |
Street | The estimated midpoint of a range of house numbers along a street segment that correspond to a city block. An example is 100 Block of Grant Ave, Millville, New Jersey. The location returned for a Street match is more precise than that of a Street match, but less precise than a Street match. This is currently only functional for the United States. |
Street | Similar to a street address but without the house number. Reference data contains street centerlines with associated street names (no numbered address ranges), along with administrative divisions and optional postal code. An example is W Olive Ave., Redlands, CA 92373. |
Postal | A postal code with an additional extension (e.g., 90210-3841). Reference data is postal code points with extensions. |
Postal | Postal code (e.g., 90210). Reference data is postal code points. |
Postal | A combination of postal code and city name. Reference data is typically a union of postal boundaries and administrative (locality) boundaries. An example is 7132 Frauenkirchen. |
Locality | A place-name representing a populated place. The Type output field provides more detailed information about the type of populated place. Possible Type values for Locality matches include Block, Sector, Neighborhood, District, City, MetroArea, County, State or Province, Territory, Country, and Zone. |
Feature | A geocoding result returned by a locator created with the Create Feature Locator tool in ArcGIS Pro. |
Lat | An x,y coordinate pair. The Lat address type is returned when an x,y coordinate pair such as 117.155579,32.703761 is the input. |
X | A match based on the assumption that the first coordinate of the input is longitude and the second is latitude. |
Y | A match based on the assumption that the first coordinate of the input is latitude and the second is longitude. |
MGRS | A Military Grid Reference System (MGRS) location, such as 46VFM5319397841. |
USNG | A United States National Grid (USNG) location, such as 15TXN29753883. |
How Geocode works
See the geocoding core concept topic for more info on the geocoding process.
Performance notes
To improve performance, limit the number of output fields returned in the tool output. For example, returning only the
minimal
set of output fields should take less time to complete than returning all
output fields, especially when
there are tied records in the output.
Syntax
For more details, go to the GeoAnalytics Engine API reference for geocode.
Setter | Description | Required |
---|---|---|
run(dataframe) | Runs the Geocode tool using the provided DataFrame. | Yes |
set | Sets one or more address fields from the input DataFrame. | Yes |
set | Set the address locator that will be used to geocode the addresses. | Yes |
set | Sets the country or countries to search for the geocoded addresses in. | No |
set | Sets the minimum score of the records that will be matched in the output. The value should be greater than 0 and less than 100. Records with a score less than the minimum score will still be included in the output with a Status of U . | No |
set | Sets the fields that will be included in the output DataFrame. The predefined parameter can accept four options: ' , ' (default), ' and ' . | No |
Examples
Run Geocode
# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")
# Imports
from geoanalytics.tools import Geocode
from geoanalytics.sql import functions as ST
# URL to the public schools data
data_url = r"https://services1.arcgis.com/Ua5sjt3LWTPigjyD/arcgis/rest/services/" \
"Public_School_Location_201819/FeatureServer/0"
# Create a public schools DataFrame
df = spark.read.format("feature-service").load(data_url) \
.withColumn("shape", ST.transform("shape", 6423))\
.where("STATE='CA'")\
.select("NAME","STREET","CITY","STATE","ZIP","shape")
# Access the locator
# This needs to be accessible to the machine that is running the Geocoding tool.
# If running on a cluster, it needs to be accessible to all nodes in the cluster.
north_america_locator = r"/data/NA_locator.loc"
# Use Geocode to convert the public school addresses into actual locations
result = Geocode() \
.setLocator(north_america_locator) \
.setAddressFields("NAME", "STREET", "CITY", "STATE", "ZIP") \
.setMinScore(80)\
.setOutFields("all") \
.setCountryCode("USA")\
.run(df)
# Show a selection of columns for the first 5 outputs
result.select("NAME", "STREET", "Score", "Status", "ZIP", "geocode_location").show(5)
+--------------------+--------------------+-----------------+------+-----+--------------------+
| NAME| STREET| Score|Status| ZIP| geocode_location|
+--------------------+--------------------+-----------------+------+-----+--------------------+
| Vasquez High|33630 Red Rover M...|97.38095238095238| M|93510|{"x":-118.2164715...|
|Meadowlark Elemen...|3015 W. Sacrament...|96.17885365609683| M|93510|{"x":-118.1857999...|
| High Desert|3620 Antelope Woo...| 97.5609756097561| M|93510|{"x":-118.1957035...|
|California School...| 500 Walnut Ave.|94.90566037735849| M|94536|{"x":-121.962924,...|
|California School...| 39350 Gallaudet Dr.|95.23809523809523| M|94538|{"x":-121.962924,...|
+--------------------+--------------------+-----------------+------+-----+--------------------+
only showing top 5 rows
Plot results
# Plot the geocoded results
result_plot = result.st.plot(cmap_values="Score",
legend=True,
cmap="Wistia",
figsize=(16,8),
basemap="light")
result_plot.set_title("Geocoded locations for schools in California")
result_plot.set_xlabel("X (Meters)")
result_plot.set_ylabel("Y (Meters)");
Version table
Release | Notes |
---|---|
1.3.0 | Tool introduced |