As in Part 1, we are looking at the example of a large retailer evaluating potential sites for a new location. This retailer is interested in using key criteria they know are important based on previous experience to evaluate a few candidates. These criteria include competition, traffic, economic feasibility and market potential for the areas surrounding the potential sites. Utilizing the GeoEnrichment module, the real estate site selection team can include demographic variables such as lifestyle, income, spending and education to understand potential customers in the study areas surrounding the candidate sites.
Although we will go a similar route, in this example all we have to start with are addresses.
from arcgis.geoenrichment import Country, enrich
from arcgis.gis import GIS
gis = GIS(profile="your_online_profile")
country = Country("usa", gis=gis)
country
<Country - United States (GIS @ https://geosaurus.maps.arcgis.com version:10.3)>
import pandas as pd
candidate_df = pd.read_csv("../data/health.csv").loc[
:, ["Name", "Address", "City", "State", "Zip Code"]
]
candidate_df
Name | Address | City | State | Zip Code | |
---|---|---|---|---|---|
0 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE | LOS ANGELES | CA | 90018 |
1 | Facility 2 | 2300 W. WASHINGTON BLVD. | LOS ANGELES | CA | 90018 |
2 | Facility 3 | 4060 E. WHITTIER BLVD. | LOS ANGELES | CA | 90023 |
3 | Facility 4 | 6070 W. PICO BOULEVARD | LOS ANGELES | CA | 90035 |
4 | Facility 5 | 1480 S. LA CIENEGA BL | LOS ANGELES | CA | 90035 |
Next, we are going to concatenate the address into one column and rename the Name
column to loc_id
to match more closely with the example from the GeoEnrichment Part 1 notebook.
# create full address string to make geocoding easier
candidate_df["full_address"] = candidate_df.apply(
lambda r: ", ".join((r["Address"], r["City"], r["State"])) + f' {r["Zip Code"]}',
axis=1,
)
# filter columns
candidate_df = candidate_df.loc[:, ["Name", "full_address"]].rename(
columns={"Name": "loc_id"}
)
candidate_df
loc_id | full_address | |
---|---|---|
0 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE, LOS ANGELES, CA 9... |
1 | Facility 2 | 2300 W. WASHINGTON BLVD., LOS ANGELES, CA 90018 |
2 | Facility 3 | 4060 E. WHITTIER BLVD., LOS ANGELES, CA 90023 |
3 | Facility 4 | 6070 W. PICO BOULEVARD, LOS ANGELES, CA 90035 |
4 | Facility 5 | 1480 S. LA CIENEGA BL, LOS ANGELES, CA 90035 |
Enrich Variables
We are going to use the same variables for enrichment as in Part 1.
analysis_variables = [
"TOTPOP_CY", # Population: Total Population (Esri)
"DIVINDX_CY", # Diversity Index (Esri)
"AVGHHSZ_CY", # Average Household Size (Esri)
"MEDAGE_CY", # Age: Median Age (Esri)
"MEDHINC_CY", # Income: Median Household Income (Esri)
"BACHDEG_CY", # Education: Bachelor's Degree (Esri)
]
analysis_variables
['TOTPOP_CY', 'DIVINDX_CY', 'AVGHHSZ_CY', 'MEDAGE_CY', 'MEDHINC_CY', 'BACHDEG_CY']
Define Study Areas
The enrich capability in Business Analyst requires polygon areas to be used for apportioning demographic data to the input geographies. In this case, they are addresses defining store locations. Geocoding can be used to get the location of the stores, but the enrich
method still requires areas to be able to apportion demographic data.
First, we can use geocoding to get the geographic location of all the stores. Since we are using a Pandas DataFrame, we can take advantage of data manipulation and schema pruning. Firstly, to concatenate the components of the addresss into a concise column for geocoding. Secondly, by reducing the geocoding response to just columns we need for subsequent analysis steps.
from arcgis.features import GeoAccessor
from arcgis.geocoding import get_geocoders
# ensure using intended geocoder
agol_geocoder = get_geocoders(gis)[0]
# geocode the addresses and prune the retunred columns
geocode_df = GeoAccessor.from_df(
candidate_df, address_column="full_address", geocoder=agol_geocoder
).loc[:, ["loc_id", "full_address", "SHAPE"]]
# following pruning schema, re-enable spatial
geocode_df.spatial.set_geometry("SHAPE")
assert geocode_df.spatial.validate()
geocode_df
loc_id | full_address | SHAPE | |
---|---|---|---|
0 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE, LOS ANGELES, CA 9... | {"x": -118.31127251419741, "y": 34.03313999252... |
1 | Facility 2 | 2300 W. WASHINGTON BLVD., LOS ANGELES, CA 90018 | {"x": -118.31183535899584, "y": 34.03988893331... |
2 | Facility 3 | 4060 E. WHITTIER BLVD., LOS ANGELES, CA 90023 | {"x": -118.1843180294075, "y": 34.023902464669... |
3 | Facility 4 | 6070 W. PICO BOULEVARD, LOS ANGELES, CA 90035 | {"x": -118.37276542483494, "y": 34.05264979417... |
4 | Facility 5 | 1480 S. LA CIENEGA BL, LOS ANGELES, CA 90035 | {"x": -118.37613251915946, "y": 34.05099298527... |
As in the example from the first Notebook, study areas can be polygons defined manually beforehand and provided as input. They can also be standard geographic areas defined with the unique identifiers for the areas, such as postal (ZIP) codes. Finally, as is the case with our example, study areas can be provided as lines or points. Since lines and points do not define an area, in these cases, polygons are created on the server to use for apportioning data to each location.
The polygons created around lines and points, by default, is a five kilometer straight-line buffered area. This can be controlled using the proximity
parameters of the enrich method; proximity_type
, proximity_value
and proximity_metric
. For line geometries, only the straight line method can be used, but for point geometries, any transportation network method available in the GIS can be used to define the area surrounding the points, thus delineating the study areas to be used.
Discover Available Travel Modes
In the example we know customers will travel about eight minutes to visit the store locations. The enrich method is capable of creating eight-minute drive time areas around the stores for us, but we need to know how to provide the correct inputs. We can discover available travel modes using the travel_modes
property.
country.travel_modes
name | alias | description | type | impedance | impedance_category | time_attribute_name | distance_attribute_name | travel_mode_id | travel_mode_dict | |
---|---|---|---|---|---|---|---|---|---|---|
0 | driving_time | Driving Time | Models the movement of cars and other similar ... | AUTOMOBILE | TravelTime | temporal | TravelTime | Kilometers | FEgifRtFndKNcJMJ | {"attributeParameterValues": [{"attributeName"... |
1 | driving_distance | Driving Distance | Models the movement of cars and other similar ... | AUTOMOBILE | Kilometers | distance | TravelTime | Kilometers | iKjmHuBSIqdEfOVr | {"attributeParameterValues": [{"attributeName"... |
2 | trucking_time | Trucking Time | Models basic truck travel by preferring design... | TRUCK | TruckTravelTime | temporal | TruckTravelTime | Kilometers | ZzzRtYcPLjXFBKwr | {"attributeParameterValues": [{"attributeName"... |
3 | trucking_distance | Trucking Distance | Models basic truck travel by preferring design... | TRUCK | Kilometers | distance | TruckTravelTime | Kilometers | UBaNfFWeKcrRVYIo | {"attributeParameterValues": [{"attributeName"... |
4 | walking_time | Walking Time | Follows paths and roads that allow pedestrian ... | WALK | WalkTime | temporal | WalkTime | Kilometers | caFAgoThrvUpkFBW | {"attributeParameterValues": [{"attributeName"... |
5 | walking_distance | Walking Distance | Follows paths and roads that allow pedestrian ... | WALK | Kilometers | distance | WalkTime | Kilometers | yFuMFwIYblqKEefX | {"attributeParameterValues": [{"attributeName"... |
6 | rural_driving_time | Rural Driving Time | Models the movement of cars and other similar ... | AUTOMOBILE | TravelTime | temporal | TravelTime | Kilometers | NmNhNDUwZmE1YTlj | {"attributeParameterValues": [{"attributeName"... |
7 | rural_driving_distance | Rural Driving Distance | Models the movement of cars and other similar ... | AUTOMOBILE | Kilometers | distance | TravelTime | Kilometers | Yzk3NjI1NTU5NjVj | {"attributeParameterValues": [{"attributeName"... |
Any value from the name
column can be used direclty as input to the enrich method to define the study area proximity_type
. For this example, to define eight minute drive times, we can populate the proximity factors accordingly.
enrich_df = country.enrich(
geocode_df,
enrich_variables=analysis_variables,
proximity_type="driving_time",
proximity_value=8,
proximity_metric="minutes",
)
enrich_df
loc_id | full_address | source_country | area_type | buffer_units | buffer_units_alias | buffer_radii | aggregation_method | population_to_polygon_size_rating | apportionment_confidence | has_data | medage_cy | totpop_cy | avghhsz_cy | bachdeg_cy | medhinc_cy | divindx_cy | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE, LOS ANGELES, CA 9... | USA | NetworkServiceArea | Minutes | Drive Time Minutes | 8.0 | BlockApportionment:US.BlockGroups;PointsLayer:... | 2.191 | 2.576 | 1 | 32.8 | 276718.0 | 2.81 | 33723.0 | 48083.0 | 87.8 | {"rings": [[[-118.31409427984764, 34.064380416... |
1 | Facility 2 | 2300 W. WASHINGTON BLVD., LOS ANGELES, CA 90018 | USA | NetworkServiceArea | Minutes | Drive Time Minutes | 8.0 | BlockApportionment:US.BlockGroups;PointsLayer:... | 2.191 | 2.576 | 1 | 33.7 | 305454.0 | 2.68 | 43130.0 | 50213.0 | 88.1 | {"rings": [[[-118.31409427984764, 34.072465217... |
2 | Facility 3 | 4060 E. WHITTIER BLVD., LOS ANGELES, CA 90023 | USA | NetworkServiceArea | Minutes | Drive Time Minutes | 8.0 | BlockApportionment:US.BlockGroups;PointsLayer:... | 2.191 | 2.576 | 1 | 30.5 | 170309.0 | 3.62 | 9400.0 | 52719.0 | 65.7 | {"rings": [[[-118.16227969122916, 34.070668594... |
3 | Facility 4 | 6070 W. PICO BOULEVARD, LOS ANGELES, CA 90035 | USA | NetworkServiceArea | Minutes | Drive Time Minutes | 8.0 | BlockApportionment:US.BlockGroups;PointsLayer:... | 2.191 | 2.576 | 1 | 38.5 | 201739.0 | 2.21 | 54857.0 | 96918.0 | 80.4 | {"rings": [[[-118.36597175035031, 34.088185662... |
4 | Facility 5 | 1480 S. LA CIENEGA BL, LOS ANGELES, CA 90035 | USA | NetworkServiceArea | Minutes | Drive Time Minutes | 8.0 | BlockApportionment:US.BlockGroups;PointsLayer:... | 2.191 | 2.576 | 1 | 38.6 | 198064.0 | 2.18 | 54132.0 | 97074.0 | 79.6 | {"rings": [[[-118.37652690642967, 34.088185662... |
The response includes metadata related to how the enrichment was performed. However, if we are only interested in the demographic columns added, we can filter using the available enrich variable names.
# get just the enrich columns
enrich_cols = [
c for c in enrich_df if c in country.enrich_variables.name.str.lower().values
]
# combine the enrich columns with a few others we want to keep
keep_cols = ["loc_id"] + enrich_cols + ["SHAPE"]
# filter the enrich data frame to just these columns
enrich_df = enrich_df.loc[:, keep_cols].set_index("loc_id")
# re-enable spatial awareness
enrich_df.spatial.set_geometry("SHAPE")
enrich_df
medage_cy | totpop_cy | avghhsz_cy | bachdeg_cy | medhinc_cy | divindx_cy | SHAPE | |
---|---|---|---|---|---|---|---|
loc_id | |||||||
Facility 1 | 32.8 | 276718.0 | 2.81 | 33723.0 | 48083.0 | 87.8 | {"rings": [[[-118.31409427984764, 34.064380416... |
Facility 2 | 33.7 | 305454.0 | 2.68 | 43130.0 | 50213.0 | 88.1 | {"rings": [[[-118.31409427984764, 34.072465217... |
Facility 3 | 30.5 | 170309.0 | 3.62 | 9400.0 | 52719.0 | 65.7 | {"rings": [[[-118.16227969122916, 34.070668594... |
Facility 4 | 38.5 | 201739.0 | 2.21 | 54857.0 | 96918.0 | 80.4 | {"rings": [[[-118.36597175035031, 34.088185662... |
Facility 5 | 38.6 | 198064.0 | 2.18 | 54132.0 | 97074.0 | 79.6 | {"rings": [[[-118.37652690642967, 34.088185662... |
Evaluate Results
An extremely effective starting point for analysis is visualizing the results. Here, we are using matplotlib
to visualize the differences between the locations based on the enriched data.
# this is due to a deprication warning inside matplotlib
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
fig, axs = plt.subplots(2, 3)
fig.set_figheight(10.0)
fig.set_figwidth(18.0)
fig.subplots_adjust(hspace=0.4)
plt.sca(axs[0, 0])
_ = enrich_df.medage_cy.plot(title="Median Age", kind="bar")
plt.sca(axs[0, 1])
_ = enrich_df.totpop_cy.plot(title="Total Population", kind="bar")
plt.sca(axs[0, 2])
_ = enrich_df.avghhsz_cy.plot(title="Average Household Size", kind="bar")
plt.sca(axs[1, 0])
_ = enrich_df.bachdeg_cy.plot(title="Bachelor's Degree", kind="bar")
plt.sca(axs[1, 1])
_ = enrich_df.medhinc_cy.plot(title="Median Household Income", kind="bar")
plt.sca(axs[1, 2])
_ = enrich_df.divindx_cy.plot(title="Diversity Index", kind="bar")
As in Part 1:
Facility 1 and facility 2 have higher populations, and are diverse with less income. Facility 3 is far younger with larger households, less education, and have lower incomes. Facility 4 and facility 5 are older, more educated and have a higher income.
If interested in opening a discount department store, facility 2 is the most attractive location with facility 1 as a close second. The diversity and lower income can allow us to conclude that people will buy at lower prices.
If interested in opening a quick service restaurant, facility 3 may be the best option to meet the needs of a young, busy and price conscious population.
Obviously, depending on the key characteristics of the business looking for a new location, the key demographic indicators will be different. Using geoenrichment, paired with the ArcGIS API for Python, enables extremely quick access to demographic variables for informed decision making.