Performing big data analysis

The ArcGIS API for Python allows GIS analysts and data scientists to query, visualize, analyze, and transform their spatial data using the powerful GeoAnalytics Tools available in their organization. Learn more about the analysis capabilities of the API at the documentation site.

The big data analysis tools can be accessed via the arcgis.geoanalytics module.

Tools Overview

The GeoAnalytics tools are presented through a set of sub modules within the arcgis.geoanalytics module. To view the list of tools available, refer to the page titled Working with big data. In this page, we will learn how to execute big data tools.

Get started

The arcgis.geoanalytics module provides types and functions for distributed analysis of large datasets. These GeoAnalytics tools work with big data registered in the GIS's datastores as well as with the feature layers.

Use arcgis.geoanalytics.is_analysis_supported(gis) to check if geoanalytics is supported in your GIS.

Feature Input

You can run the GeoAnalytics Tools on the following:

arcgis.features.FeatureLayer (hosted, hosted feature layer views, and from feature services)
arcgis.features.FeatureCollection
Big data file shares registered with ArcGIS GeoAnalytics Server

Feature Output

The output from running GeoAnalytics Tools can be one of two options:

A hosted feature layer with data stored in ArcGIS Data Store registered with the portal's hosting server.
A dataset stored to a big data file share (a folder, cloud store, HDFS location) that you have registered with your GeoAnalytics Server.

Refer to this page for detailed information about feature layers and features.

Next, we will specify which big data file share the GeoAnalyticss results will save to. If set to None, the arcgis.env.output_datastore will reset to default. Allowed string values are: spatiotemporal or relational.

import arcgis
arcgis.geoanalytics.define_output_datastore(datastore='relational')

True

Environment settings

The arcgis.env module provides a shared environment used by the different modules. It stores globals, such as the currently active GIS, the default geocoder, and more. It also stores environment settings that are common among all tools, such as the output spatial reference, cell size, etc.

Set spatial reference

The GeoAnalytics Tools use a process spatial reference during execution. Analyses with square or hexagon bins require a projected coordinate system. We'll use the World Cylindrical Equal Area projection (WKID 54034) below (as it is the default used when running tools in ArcGIS Online). All results are stored in the spatiotemporal datastore of the Enterprise in the WGS 84 Spatial Reference.

See the GeoAnalytics Documentation for a full explanation of analysis environment settings.

arcgis.env.process_spatial_reference=54034

Verbosity of messages

The ArcGIS Platform, including the ArcGIS API for Python, manages and transforms geographic data with a large suite of tools and functions collectively known as geoprocessing. The GeoAnalytics Tools in the ArcGIS API for Python are a subset of geoprocessing tools that operate in the context of a geoprocessing environment. You can set various aspects of this environment to control how tools are executed and what messages you receive during and after the execution. See the Logging and error handling section in the API for Python Geoprocessing Guide's Advanced concepts for ways to control messaging, including the arcgis.env.verbose setting.

arcgis.env.verbose=True

Context Parameter

ArcGIS GeoAnalytics Server tasks that have the outSR property in their Context parameter will save results in the specified spatial reference. If you are saving the results to the spatiotemporal data store, all results will be projected to World Geographic Coordinate System 1984 after analysis for storage and the outSR will not be used. Set the spatial reference that results will be analyzed in using the Process Spatial Reference property.

GeoAnalytics operations use the following context parameters defined in the arcgis.env module:

Conetxt Parameter	Description
out_spatial_reference	Used for setting the output spatial reference
process_spatial_reference	Used for setting the processing spatial reference.
analysis_extent	Used for setting the analysis extent.
output_datastore	Used for setting the output datastore to be used.

#example
context = {
  "extent": {
    "xmin": -122.68,
    "ymin": 45.53,
    "xmax": -122.45,
    "ymax": 45.6,
    "spatialReference": {
      "wkid": 4326
    }
  },
  "outSR" : {"wkid" : 3857},
  "dataStore" : "relational"
}

Executing a GeoAnalytics tool

In the previous guide, you learnt how to register big data file share with your ArcGIS GeoAnalytics Server. Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item in the portal. We can search for these types of items using the item_type parameter. When you add a big data file share, a corresponding item gets created on your portal. You can search for it like any other portal Item and query its layers.

# connect to Enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics

portal_gis = GIS("your_enterprise_profile")

When no parameters are specified with geoanalytics methods, they use the active GIS connection, which you can query with the arcgis.env.active_gis property. However, if you are working with more than one GIS object, you can specify the desired GIS object as the gis parameter of this method. For example, let us create a connection to an Enterprise deployment and check if GeoAnalytics is supported.

Ensure your GIS supports GeoAnalytics

After connecting to Enterprise portal, you need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, we will call the is_supported() method.

arcgis.geoanalytics.is_supported(gis=portal_gis)

True

Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item in the portal. We can search for these types of items using the item_type parameter.

search_result = portal_gis.content.search("bigDataFileShares_ServiceCallsOrleans", 
                                          item_type = "big data file share", 
                                          max_items=40)
search_result

[<Item title:"bigDataFileShares_ServiceCallsOrleans" type:Big Data File Share owner:portaladmin>]

data_item = search_result[0]
data_item

bigDataFileShares_ServiceCallsOrleans

Big Data File Share by portaladmin
Last Modified: October 05, 2019
0 comments, 0 views

Querying the layers property of the item returns a feature layer representing the data. The object is actually an API Layer object.

#displays layers in the item
data_item.layers

[<Layer url:"https://pythonapi.playground.esri.com/ga/rest/services/DataStoreCatalogs/bigDataFileShares_ServiceCallsOrleans/BigDataCatalogServer/yearly_calls">]

calls = data_item.layers[0] #select first layer 
calls

<Layer url:"https://pythonapi.playground.esri.com/ga/rest/services/DataStoreCatalogs/bigDataFileShares_ServiceCallsOrleans/BigDataCatalogServer/yearly_calls">

Access the aggregate_points() tool through the summarize_data module. This example uses the Aggregate Points tool to aggregate the point features representing earthquakes into 1 Kilometer square bins. The tool creates an output feature layer in your portal you can access once processing is complete.

from arcgis.geoanalytics.summarize_data import aggregate_points
from datetime import datetime as dt

Sync execution

By default, all the tools have the Future parameter set to False. The tools return output results as feature layer items.

agg_result1 = aggregate_points(calls, 
                               bin_type='Hexagon',
                               bin_size=1,
                               bin_size_unit='Meters',
                               output_name="aggregate results of call" + str(dt.now().microsecond))
agg_result1

aggregate_results_of_call415510
aggregate_results_of_call415510

Feature Layer Collection by arcgis_python
Last Modified: June 17, 2021
0 comments, 0 views

Async execution

If Future=True, a GPJob is returned, rather than results. The GPJob can be queried on the status of the execution.

agg_result2 = aggregate_points(calls, 
                               bin_type='Hexagon',
                               bin_size=1,
                               bin_size_unit='Meters',
                               output_name="aggregate results of call" + str(dt.now().microsecond),
                               future=True)
agg_result2

<AggregatePoints GA Job: jd47e5f0d6f82413fb31be8bd6ec476d7>

agg_result2.result()

{"messageCode":"BD_101054","message":"Some records have either missing or invalid geometries."}
{"messageCode":"BD_101088","message":"Some result features were clipped to the valid extent of the resulting spatial reference."}

aggregate_results_of_call158276
aggregate_results_of_call158276

Feature Layer Collection by admin
Last Modified: May 22, 2021
0 comments, 0 views

The aggregate points tool returns a feature layer item that contains the processed results.

Apply spatial filter

The context parameter helps to set spatial and temporal filters. It takes the following keys to set an extent or time filter.

The tool output above shows that some data points have been located outside New Orleans because of missing or invalid geometries. We want to explore data points within New Orleans city limits. As such, we want to run the tool only in the zoomed extent. Let's set our area of interest as the zoomed extent of map.

ext = m1.extent
ext

{'spatialReference': {'latestWkid': 3857, 'wkid': 102100},
 'xmin': -10022118.236961203,
 'ymin': 3491517.7562587974,
 'xmax': -10017417.35972154,
 'ymax': 3493428.6819659774}

agg_result3 = aggregate_points(calls, 
                               bin_type='Hexagon',
                               bin_size=1,
                               bin_size_unit='Meters',
                               output_name="aggregate results of call" + str(dt.now().microsecond), 
                               context=ext)
agg_result3

Attaching log redirect
Log level set to DEBUG
Detaching log redirect

aggregate_results_of_call187298
aggregate_results_of_call187298

Feature Layer Collection by arcgis_python
Last Modified: June 17, 2021
0 comments, 0 views

Apply filter by field value

Using filter property, you can apply a filter on feature layers to run your analysis only on a subset of data.

item = portal_gis.content.get('67908048c99f44998dfd464de004bffa')
item

service_calls_in_new_Orleans343149
service_calls_in_new_Orleans343149

Feature Layer Collection by arcgis_python
Last Modified: May 24, 2021
0 comments, 3 views

fl = item.layers[0]

fl.query(as_df=True).columns

Index(['BLOCK_ADDRESS', 'Disposition', 'DispositionText', 'INSTANT_DATETIME',
       'Location', 'MapX', 'MapY', 'NOPD_Item', 'OBJECTID', 'PoliceDistrict',
       'Priority', 'SHAPE', 'TimeArrive', 'TimeClosed', 'TimeCreate',
       'TimeDispatch', 'TypeText', 'Type_', 'Zip', 'globalid'],
      dtype='object')

# Apply a filter on Zip field
fl.filter = 'Zip=70119'

agg_result4 = aggregate_points(fl, 
                               bin_type='Hexagon',
                               bin_size=1,
                               bin_size_unit='Meters',
                               output_name="aggregate results of call" + str(dt.now().microsecond))
agg_result4

Attaching log redirect
Log level set to DEBUG
{"messageCode":"BD_101068","message":"Bin generation and analysis requires a projected coordinate system and a default projection of World Cylindrical Equal Area has been applied."}
Detaching log redirect

aggregate_results_of_call835747
aggregate_results_of_call835747

Feature Layer Collection by arcgis_python
Last Modified: June 17, 2021
0 comments, 0 views

The screenshot above displays the aggregated results for the 70119 zip code .

Apply time filter

You can also apply a time filter using the time_filter method, which filters a time-enabled feature layer by datetime. When you apply the filter, the analysis will only be performed on the time filtered features. Refer this page for more details.

# Apply a filter by datetime
fl.time_filter = '2017'

In this guide, we have learned about the analysis capabilities available in the arcgis.geoanalytics module and how some of the common concepts, such as environment settings, sync operation, filer etc., can be applied across all tools. In the next guide, we will learn in more detail about the tools available in the arcgis.geoanalytics.summarize_data submodule.

Performing big data analysis

Tools Overview

Get started

Feature Input

Feature Output

Environment settings

Set spatial reference

Verbosity of messages

Context Parameter

Executing a GeoAnalytics tool

Ensure your GIS supports GeoAnalytics

Search big data file share item

Sync execution

Async execution

Apply spatial filter

Apply filter by field value

Apply time filter