Geocode

Geocode converts addresses into geographic coordinates. This process requires a Spark DataFrame containing the addresses you want to geocode and a locator. The tool matches the addresses against reference data in a locator and returns points representing the address locations along with other output columns.

Geocode workflow

Usage notes

  • You must set at least one address field using setAddressFields(). When multiple fields are passed into the setter, the tool concatenates them into a single string field and uses that as the address field in the tool execution.

    The order of the fields passed to setAddressFields() matters as different results may be returned depending on the order. To obtain the best result possible, it is recommended that the address fields' order aligns with how a standardized address is constructed in that region. For example, addresses in the United States should follow the format of {house number} {street} {city} {state} {zip}.

  • The fields from the input DataFrame will always be included in the output.

    The result fields in the output DataFrame are determined by the predefined_set parameter in the setOutFields() setter.

    The following options are supported:

    • LocationOnly—Only returns a field called geocode_location which contains points representing address locations.
    • Minimal—Returns the geocode_location, Status, Score, Match_addr, and Addr_type fields. This is the default option.
    • MinimalAndUserFields—Returns the fields defined in Minimal and the custom output fields available in the locator. User defined fields can be configured during the process of creating a locator in ArcGIS Pro. For more information about locators, read the geocoding core concept.
    • All—Returns all available output fields including any custom output fields defined in your locator.
  • When an input DataFrame contains a field that has the same name as one of the output fields in the geocoded result, the output field will be automatically renamed with a suffix of "1". For example, if a field named Address already exists in the input DataFrame, the result DataFrame will have a field named Address from the input, and Address1 representing the output field.

  • The count of the input records will be equal to the count of the output records. One result will be returned for each input record:

    • When matched, the Status field will be M and the rest of the output fields will populate with the matching candidate's info. The Ties field will return null.
    • When unmatched, the Status field will be U and the rest of the output fields will be null.
    • When tied, the Status field will be T, meaning there is more than one location with the same best match score. One of the tied records will be returned and populate the output fields. All tied records will be stored in a new field called Ties.
  • The maximum number of tied records returned for each address is determined by the MaxCandidates value stored in the locator. For example, if MaxCandidates is 50, then the maximum number of tied geocoded results that can be returned for a single input address is 50. If all or most of the input addresses are returned with the maximum number of tied outputs, try the following to narrow down the search criteria:

    • Specify more input address fields in .setAddressFields(). For example, specifying Address and City fields will likely yield better result than just specifying an Address field.
    • Use input address fields with better data quality. Try to avoid using address fields with a high percentage of Null values.
    • Specify a country code with .setCountryCode().
    • Specify a minimum matching score with .setMinScore().
  • Use setMinScore() if you want to limit the records that will be matched in the output based on how well they match the input location. Any records with a score of less than the minimum score will still be included in the output but their Status will be U.

  • Geocoded results can be limited to selected countries using setCountryCode(). If no matching candidate can be found in the selected country, Status will be U. When setCountryCode() is not used or is set with a country that is not available in the locator, the tool geocodes the address using all supported countries in the locator by default. Country codes should use the ISO 3166-1 alpha-3 standard.

  • The output points will have the same spatial reference as the address locator.

Limitations

Geocoding with GeoAnalytics Engine requires a locator file. Using a locator service, such as the ArcGIS World Geocoding Service, is not supported.

  • The tool does not currently support automatic or guided rematching of tied records.

  • In Spark 3.4.x and above, the schema of the geocode_location in the Ties nested column can change after being written to Geoparquet or Parquet.

Results

The result of Geocode is a copy of the input DataFrame with new fields added depending on the setOutFields() setter. The table below explains which fields are returned based on the predefined_set parameter's value in the setOutFields() setter. There are four options:

  • LocationOnlygeocode_location is returned.
  • Minimalgeocode_location, Status, Score, Match_addr, and Addr_type are returned. This is the default option.
  • MinimalAndUserFieldsgeocode_location, Status, Score, Match_addr, Addr_type and any custom output fields available in the locator are returned.
  • All—All fields are returned including any custom fields defined in your locator.

The result fields are detailed in the table below.

FieldDescription
Loc_nameThe name of the locator used to return a match result. This field is available only if the locator used for matching the table is a composite locator.
StatusA code indicating whether or not the address was matched. M means that the address was matched while U means that the address was not matched. T indicates that the address has more than one candidate with the same best match score but at different locations.
ScoreThe match score of the candidate to which the address was matched. The score is a number between 0 to 100, in which 100 indicates the candidate is a perfect match.
Match_addrThe address where the matched location actually resides based on the information of the matched candidate.
LongLabelA longer version of Match_addr containing more administrative information.
ShortLabelA shortened version of Match_addr.
Addr_typeThe geocoded address type, which indicates the level at which the address matched. Supported match levels vary between countries. The table at the bottom of this section describes some possible values.
TypeThe feature type for results returned by a search. The Type field only includes a value for candidates with an address type of POI or Locality. For example, the feature type of Starbucks might be Coffee Shop.
PlaceNameThe formal name of a geocode match candidate (e.g., Paris or Starbucks).
Place_addrThe full street address of a place, including street, city, and region (e.g., 275 Columbus Ave., New York, New York).
PhoneThe primary phone number of a place.
URLThe URL of the primary website for a place.
RankA number that indicates the importance of a result relative to other results with the same name. The smaller numbers represent higher-ranked features. Rank values are based on population or feature type. For example, there are cities in France and Texas named Paris. Paris, France, has a greater population than Paris, Texas, so it will have a higher rank.
AddBldgThe name of a building (e.g., Empire State Building).
AddNumThe alphanumeric value that represents the portion of an address typically known as a house number or building number. This value is returned for PointAddress and StreetAddress matches only.
AddNumFromA value representing the beginning number of a street address range. It is relative to direction of feature digitization and is not always the smallest number in the range. This value is provided for StreetAddress match results.
AddNumToA value representing the ending number of a street address range. It is relative to direction of feature digitization and is not always the largest number in the range. This value is provided for StreetAddress match results.
AddRangeThe full address number range for the street segment that an address lies on, in the format AddNumFrom-AddNumTo. An example is the AddRange value for the street address 123 Main St. may be 101-199.
SideThe side of the street where an address resides relative to the direction of feature digitization. This value is not relative to the direction of travel along the street. L indicates that an address is matched to the left side while R means the address is matched to the right side of the street. No value indicates that the address is not matched or the locator could not determine the side of the street.
StPreDirAn address element defining the direction of a street, which occurs before the primary street name (e.g., North in North Main Street).
StPreTypeAn address element defining the leading type of a street (e.g., Avenid in Avenida Central or Rue in Rue Lapin).
StNameAn address element defining the primary name of a street (e.g., Main in North Main Street).
StTypeAn address element defining the trailing type of a street (e.g., Street in Main Street).
StDirAn address element defining the direction of a street, which occurs after the primary street name (e.g. North in Main Street North).
StPreDir1An address element defining the leading direction of the first street in an intersection.
StPreType1An address element defining the leading type of the first street in an intersection.
StName1An address element defining the primary name of the first street in an intersection.
StDir1An address element defining the trailing direction of the first street in an intersection.
StPreDir2An address element defining the leading direction of the second street in an intersection.
StPreType2An address element defining the leading type of the second street in an intersection.
StName2An address element defining the primary name of the second street in an intersection.
StDir2An address element defining the trailing direction of the second street in an intersection.
TiesA nested field containing tied records.
BldgNameThe name or number of a building subunit (e.g., A in Building A).
BldgTypeThe classification of a building subunit. Examples include building, hangar, and tower.
LevelTypeThe classification of a floor subunit. Examples include floor, level, and department.
LevelNameThe name or number of a floor subunit (e.g., 3 in Level 3).
UnitTypeThe classification of a unit subunit. Examples include unit, apartment, and suite.
UnitNameThe name or number of a unit subunit (e.g., 2B in Apartment 2B).
SubAddrThe full subunit value for a candidate with an address type of Subaddress.
StAddrThe street address of a place without a zone, such as city or state (e.g., 275 Columbus Ave).
AddressThe full address of a place (e.g., 2000 MCMILLAN AVE, COMPTON, CA 90220).
BlockThe name of the block-level administrative division for a candidate. A block is the smallest administrative area for a country. It can be described as a subdivision of sector or neighborhood or a named city block. It is not commonly used.
SectorThe name of the sector-level administrative division for a candidate. A sector is a subdivision of neighborhood, district, or a collection of blocks. It is not commonly used.
NbrhdThe name of the neighborhood-level administrative division for a candidate. A neighborhood is a subsection of a city or district. For example, Little Italy is the name of a neighborhood in the city of San Diego, California.
NeighborhoodThe name of the neighborhood-level administrative division for a candidate. It is an alias for the field Nbrhd.
DistrictThe name of the district-level administrative division for a candidate, for example, a subdivision of city. For example, Wilhelmsburg is a district in the city of Hamburg in Germany.
CityThe name of the city-level administrative division for a candidate. City is a subdivision of a subregion or region. For example, Atlanta is a city within Fulton County in the state of Georgia.
MetroAreaThe name of the metropolitan area-level administrative division for a candidate. This is usually an urban area consisting of a large city and the smaller cities surrounding it. This can potentially intersect multiple subregions or regions. An example is the Kolkata Metropolitan Area in India.
SubregionThe name of the subregion-level administrative division for a candidate. Subregion is a subdivision of a region. For example, San Diego County is a subregion of the state of California.
RegionThe name of the region-level administrative division for a candidate. This can be a subdivision of a country or territory. It is typically the largest administrative area for a country (such as state or province) if the Territory administrative division is not used.
RegionAbbrAbbreviated region name. For example, the abbreviated name for California is CA.
TerritoryThe name of the territory-level administrative division for a candidate. This is a subdivision of a country and is not commonly used. An example is the Sudeste macroregion of Brazil, which encompasses the states of Espírito Santo, Minas Gerais, Rio de Janeiro, and São Paulo.
PostalAn alphanumeric address element defining the primary postal code (e.g., V7M 2B4 or 92374).
PostalExtAn alphanumeric address element defining the postal code extension (e.g., 8110 in 92373-8110).
CountryA three-character code for a country that follows the ISO 3166-1 alpha-3 standard.
CntryNameThe full country name for an address candidate. The name may be in the same language as the input address, or in the language specified by the langCode parameter. If the full country name is not available in the specified language, the primary language of the country is used (e.g., 日本 for Japan).
LangCodeA three-character language code representing the language of the address. The code should follow the ISO 639-3 standard.
XThe primary x-coordinate of the matched address in the spatial reference of the locator.
YThe primary y-coordinate of the matched address in the spatial reference of the locator.
DisplayXThe display x-coordinate of an address returned in the spatial reference of the locator.
DisplayYThe display y-coordinate of an address returned in the spatial reference of the locator.
XminThe minimum x-coordinate of a geocode result.
XmaxThe maximum x-coordinate of a geocode result.
YminThe minimum y-coordinate of a geocode result.
YmaxThe maximum y-coordinate of a geocode result.
ExInfoA collection of strings from the input that could not be matched to any part of an address and were used to score or penalize the result.

The table below outlines the possible values for Addr_type:

ValueDescription
SubaddressA street address based on points that represent house and building subaddress locations. Typically, this is the most spatially accurate match level. The subaddress elements of unit type and unit identifier help to distinguish one subaddress within or between structures from another when several occur within the same location. Reference data contains address points or polygons with associated house numbers, street names, and subaddress elements, along with administrative divisions and optional postal code. An example is 3836 Emerald Ave., Suite C, La Verne, CA 91750.
PointAddressA street address based on points that represent house and building locations. Reference data contains address points with associated house numbers and street names, along with administrative divisions and optional postal code. The X and Y and geometry output values for a PointAddress match represent the street entry location for the address; this is the location used for routing operations. The DisplayX and DisplayY values represent the rooftop or actual location of the address. An example is 380 New York St., Redlands, CA 92373.
ParcelA plot of land that is considered real property and may include one or more homes or other structures. A parcel typically has an address and parcel identification number assigned to it, such as 17 011100120063.
StreetAddressA street address that differs from PointAddress because the house number is interpolated from a range of numbers. Reference data contains street centerlines with house number ranges, along with administrative divisions and optional postal code information. An example is 647 Haight St., San Francisco, CA 94117.
StreetIntA street address consisting of a street intersection along with city and optional state and postal code information. An example is Redlands Blvd. & New York St., Redlands, CA 92373.
StreetAddressExtAn estimated street address match that is returned when parameter matchOutOfRange=true and the input house number exceeds the house number range for the matched street segment.
POIPoints of interest. Reference data consists of administrative division, place-names, businesses, landmarks, and geographic features. An example is Starbucks.
DistanceMarkerA street address that represents the linear distance along a street, typically in kilometers or miles, from a designated origin location. An example is Carr 682 KM 4, Barceloneta, 00617.
StreetMidBlockThe estimated midpoint of a range of house numbers along a street segment that correspond to a city block. An example is 100 Block of Grant Ave, Millville, New Jersey. The location returned for a StreetMidBlock match is more precise than that of a StreetName match, but less precise than a StreetAddress match. This is currently only functional for the United States.
StreetNameSimilar to a street address but without the house number. Reference data contains street centerlines with associated street names (no numbered address ranges), along with administrative divisions and optional postal code. An example is W Olive Ave., Redlands, CA 92373.
PostalExtA postal code with an additional extension (e.g., 90210-3841). Reference data is postal code points with extensions.
PostalPostal code (e.g., 90210). Reference data is postal code points.
PostalLocA combination of postal code and city name. Reference data is typically a union of postal boundaries and administrative (locality) boundaries. An example is 7132 Frauenkirchen.
LocalityA place-name representing a populated place. The Type output field provides more detailed information about the type of populated place. Possible Type values for Locality matches include Block, Sector, Neighborhood, District, City, MetroArea, County, State or Province, Territory, Country, and Zone.
FeatureA geocoding result returned by a locator created with the Create Feature Locator tool in ArcGIS Pro.
LatLongAn x,y coordinate pair. The LatLong address type is returned when an x,y coordinate pair such as 117.155579,32.703761 is the input.
XY—XYA match based on the assumption that the first coordinate of the input is longitude and the second is latitude.
YX—YXA match based on the assumption that the first coordinate of the input is latitude and the second is longitude.
MGRSA Military Grid Reference System (MGRS) location, such as 46VFM5319397841.
USNGA United States National Grid (USNG) location, such as 15TXN29753883.

How Geocode works

See the geocoding core concept topic for more info on the geocoding process.

Performance notes

To improve performance, limit the number of output fields returned in the tool output. For example, returning only the minimal set of output fields should take less time to complete than returning all output fields, especially when there are tied records in the output.

Syntax

For more details, go to the GeoAnalytics Engine API reference for geocode.

SetterDescriptionRequired
run(dataframe)Runs the Geocode tool using the provided DataFrame.Yes
setAddressFields(*address_fields)Sets one or more address fields from the input DataFrame.Yes
setLocator(path)Set the address locator that will be used to geocode the addresses.Yes
setCountryCode(country_code)Sets the country or countries to search for the geocoded addresses in.No
setMinScore(min_score)Sets the minimum score of the records that will be matched in the output. The value should be greater than 0 and less than 100. Records with a score less than the minimum score will still be included in the output with a Status of U.No
setOutFields(predefined_set)Sets the fields that will be included in the output DataFrame. The predefined_set parameter can accept four options: 'LocationOnly', 'Minimal'(default), 'MinimalAndUserFields' and 'All'.No

Examples

Run Geocode

Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")

# Imports
from geoanalytics.tools import Geocode
from geoanalytics.sql import functions as ST

# URL to the public schools data
data_url = r"https://services1.arcgis.com/Ua5sjt3LWTPigjyD/arcgis/rest/services/" \
    "Public_School_Location_201819/FeatureServer/0"

# Create a public schools DataFrame
df = spark.read.format("feature-service").load(data_url) \
                    .withColumn("shape", ST.transform("shape", 6423))\
                    .where("STATE='CA'")\
                    .select("NAME","STREET","CITY","STATE","ZIP","shape")

# Access the locator
# This needs to be accessible to the machine that is running the Geocoding tool.
# If running on a cluster, it needs to be accessible to all nodes in the cluster.
north_america_locator = r"/data/NA_locator.loc"

# Use Geocode to convert the public school addresses into actual locations
result = Geocode() \
            .setLocator(north_america_locator) \
            .setAddressFields("NAME", "STREET", "CITY", "STATE", "ZIP") \
            .setMinScore(80)\
            .setOutFields("all") \
            .setCountryCode("USA")\
            .run(df)

# Show a selection of columns for the first 5 outputs
result.select("NAME", "STREET", "Score", "Status", "ZIP", "geocode_location").show(5)
Result
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
+--------------------+--------------------+-----------------+------+-----+--------------------+
|                NAME|              STREET|            Score|Status|  ZIP|    geocode_location|
+--------------------+--------------------+-----------------+------+-----+--------------------+
|        Vasquez High|33630 Red Rover M...|97.38095238095238|     M|93510|{"x":-118.2164715...|
|Meadowlark Elemen...|3015 W. Sacrament...|96.17885365609683|     M|93510|{"x":-118.1857999...|
|         High Desert|3620 Antelope Woo...| 97.5609756097561|     M|93510|{"x":-118.1957035...|
|California School...|     500 Walnut Ave.|94.90566037735849|     M|94536|{"x":-121.962924,...|
|California School...| 39350 Gallaudet Dr.|95.23809523809523|     M|94538|{"x":-121.962924,...|
+--------------------+--------------------+-----------------+------+-----+--------------------+
only showing top 5 rows

Plot results

Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
# Plot the geocoded results
result_plot = result.st.plot(cmap_values="Score",
                             legend=True,
                             cmap="Wistia",
                             figsize=(16,8),
                             basemap="light")
result_plot.set_title("Geocoded locations for schools in California")
result_plot.set_xlabel("X (Meters)")
result_plot.set_ylabel("Y (Meters)");
Plotting example for a Geocode result.

Version table

ReleaseNotes

1.3.0

Python tool introduced

1.5.0

Added support for loading the locator using SparkContext.addFile.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.