Analysing the factors of growth and spatial distribution of Airbnb properties across New York City

Introduction

Airbnb properties across cities are a great alternative for travellers to find comparatively cheaper accommodation. It also provides homeowners opportunities to utilize spare or unused rooms as an additional income source. However in recent times the alarming spread of Airbnb properties has become a topic of debate among the public and the city authorities across the world.

Considering the above, a study is carried out in this sample notebook to understand the factors that are fuelling widespread growth in the number of Airbnb listings. These might include location characteristics of concerned neighbourhoods (which in this case, NYC census tracts) and as well as qualitative information about the inhabitants residing in them. The goal is to help city planners deal with the negative externalities of the Airbnb phenomenon (and similar short term rentals) by making informed decision on framing suitable policies.

The primary data is downloaded from the Airbnb website for the city of New York. Other data includes 2019 and 2017 census data using Esri's enrichment services, and various other datasets from the NYCOpenData portal.

Note: We need to install pillow version 9.0.0, seaborn and scikit-learn for this notebook

Necessary Imports

if you have pillow version 9.0.0, seaborn and scikit-learn installed you can skip running the next three cells

pip install pillow==9.0.0

pip install seaborn

pip install scikit-learn

%matplotlib inline
import matplotlib.pyplot as plt


from datetime import datetime as dt
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from IPython.core.pylabtools import figsize
import seaborn as sns


# Machine Learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.metrics as metrics
from sklearn import preprocessing

# Arcgis api imports
import arcgis
from arcgis.geoenrichment import Country
from arcgis.features import summarize_data
from arcgis.features.enrich_data import enrich_layer
from arcgis.features import use_proximity 
from arcgis.gis import GIS
from arcgis.features import summarize_data

gis = GIS(profile='your_online_profile')

Access the NYC Airbnb and Tracts dataset

Airbnb Data - It contains information about 48,000 Airbnb properties available in New York as of 2019. These include location of the property, its neighbourhood characters and transit facilities available, information about the owner, details of the room including number of bedrooms etc., and rental price per night.

NYC Tracts - It is a polygon shapefile consisting 2167 tracts of New York City, including area of the tracts along with unique id for each tract.

# Accessing NYCTracts
nyc_tract_full = gis.content.search('NYCTractData owner:api_data_owner', 'feature layer')[0]
nyc_tract_full

NYCTractData

Feature Layer Collection by api_data_owner
Last Modified: August 14, 2019
0 comments, 558 views

nyc_tracts_layer = nyc_tract_full.layers[0]

# Accessing airbnb NYC
airbnb_nyc2019 = gis.content.search('AnBNYC2019 owner:api_data_owner', 'feature layer')[0]
airbnb_nyc2019

AnBNYC2019

Feature Layer Collection by api_data_owner
Last Modified: September 30, 2019
0 comments, 133 views

airbnb_layer = airbnb_nyc2019.layers[0]

Visualizing dataset

# NYC Tracts
m1 = gis.map('New York City')
m1.content.add(nyc_tracts_layer)
m1

# NYC Airbnb Properties
m = gis.map('Springfield Gardens, NY')
m.content.add(airbnb_layer)
m

# extracting the dataframe from the layer and visualize it as a pandas dataframe
pd.set_option('display.max_columns', 110)
sdf_airbnb_layer = pd.DataFrame.spatial.from_layer(airbnb_layer)
sdf_airbnb_layer.head(2)

	FID	id	scrape_id	last_scrap	name	summary	space	descriptio	experience	neighborho	notes	transit	access	interactio	house_rule	host_since	host_locat	host_respo	host_res_1	host_accep	host_is_su	host_neigh	host_listi	host_total	host_has_p	host_ident	street	neighbourh	neighbou_1	neighbou_2	city	state	zipcode	market	smart_loca	country_co	country	latitude	longitude	is_locatio	property_t	room_type	accommodat	bathrooms	bedrooms	beds	bed_type	amenities	square_fee	price	weekly_pri	monthly_pr	security_d	cleaning_f	guests_inc	extra_peop	minimum_ni	maximum_ni	minimum_mi	maximum_mi	minimum_ma	maximum_ma	minimum__1	maximum__1	calendar_u	has_availa	availabili	availabi_1	availabi_2	availabi_3	calendar_l	number_of_	number_of1	first_revi	last_revie	review_sco	review_s_1	review_s_2	review_s_3	review_s_4	review_s_5	review_s_6	requires_l	license	jurisdicti	instant_bo	is_busines	cancellati	require_gu	require__1	calculated	calculat_1	calculat_2	calculat_3	reviews_pe	airbnb	SHAPE
0	1	121861	20200000000000.0	2019-06-03	Park Slope Apt:, Spacious 2 bedroom		Imagine a quiet, spacious apartment, with beau...	Imagine a quiet, spacious apartment, with beau...	none						No pets, no smoking. The $25/night for each gu...	2011-05-22	New York, New York, United States	within a day	100%	N/A	t	Park Slope	2	2	t	t	Brooklyn, NY, United States	Brooklyn	Park Slope	Brooklyn	Brooklyn	NY	11215	New York	Brooklyn, NY	US	United States	40.67644	-73.98082	t	Apartment	Entire home/apt	4	2.0	2	2	Real Bed	{TV,"Cable TV",Internet,Wifi,"Air conditioning...	1500	165.0	$1,050.00		$250.00	$60.00	2	$0.00	2	730	2	2	730	730	2.0	730.0	4 weeks ago	t	0	0	0	0	2019-06-03	23	0	2011-05-29	2016-05-02	99	10	10	10	10	10	10	f			f	f	moderate	f	f	2	1	1	0	0.24	1	{'x': -8235507.210868829, 'y': 4964733.1453062...
1	2	123784	20200000000000.0	2019-06-03	NYC Studio for Rent in Townhouse	Comfortable, spacious studio in one of the mos...	This is a large studio room with a private bat...	Comfortable, spacious studio in one of the mos...	none	The new restaurants, stores and cafes. Everyth...	45.00 dollar fee for air-conditioner in the su...		Everything in the studio is for their use.	As much as the guest would like.	no loud music no pets no children $300 dollar...	2011-05-23	New York, New York, United States	within an hour	100%	N/A	f	Harlem	2	2	t	f	New York, NY, United States	Harlem	Harlem	Manhattan	New York	NY	10027	New York	New York, NY	US	United States	40.80481	-73.94794	t	Apartment	Entire home/apt	3	1.0	0	2	Real Bed	{TV,"Cable TV",Wifi,"Air conditioning","Paid p...	0	110.0	$735.00	$3,200.00	$500.00	$60.00	1	$25.00	2	365	2	2	365	365	2.0	365.0	7 months ago	t	8	28	43	296	2019-06-03	138	42	2011-05-30	2019-05-17	94	10	9	10	10	10	9	f			t	f	strict_14_with_grace_period	t	t	2	1	1	0	1.41	1	{'x': -8231847.026011546, 'y': 4983593.6741002...

Aggregating number of Airbnb properties by Tracts for NYC

Number of Airbnb properties per tract is to be estimated using the polygon tract layer and the Airbnb point layer.

The Aggregate Points tool uses area features to summarize a set of point features. The boundaries from the area feature are used to collect the points within each area and use them to calculate statistics. The resulting layer displays the count of points within each area. Here, the polygon tract layer is used as the area feature, and the Airbnb point layer is used as the point feature.

agg_result = summarize_data.aggregate_points(point_layer=airbnb_layer,
                                             polygon_layer=nyc_tracts_layer,
                                             output_name='airbnb_counts'+ str(dt.now().microsecond))

{"cost": 50.968}

agg_result

airbnb_counts131815

Feature Layer Collection by arcgis_python
Last Modified: April 30, 2024
0 comments, 0 views

# mapping the aggregated airbnb data with darker areas showing more airbnb properties per tract
aggr_map = gis.map('New York, NY')
aggr_map.content.add(agg_result)
aggr_map

agg_result_renderer = aggr_map.content.renderer(0)
agg_result_renderer.smart_mapping().class_breaks_renderer(break_type="color", field="Point_Cloud")

airbnb_count_by_tract = agg_result.layers[0]

sdf_airbnb_count_by_tract = airbnb_count_by_tract.query().sdf

sdf_airbnb_count_by_tract = sdf_airbnb_count_by_tract.sort_values('geoid')
sdf_airbnb_count_by_tract.head()

	OBJECTID	statefp	countyfp	tractce	geoid	name	namelsad	mtfcc	funcstat	aland	awater	intptlat	intptlon	Point_Count	AnalysisArea	SHAPE
2095	2096	36	005	000100	36005000100	1	Census Tract 1	G5020	S	1579361	1125765	+40.7934921	-073.8835318	0	2.705062	{"rings": [[[-8226256.9418, 4982172.581], [-82...
2059	2060	36	005	000200	36005000200	2	Census Tract 2	G5020	S	455322	926899	+40.8045733	-073.8568585	0	1.382228	{"rings": [[[-8222638.612, 4985024.3226], [-82...
2067	2068	36	005	000400	36005000400	4	Census Tract 4	G5020	S	912392	602945	+40.8089152	-073.8504884	15	1.515336	{"rings": [[[-8222012.885, 4985135.2266], [-82...
1668	1669	36	005	001600	36005001600	16	Census Tract 16	G5020	S	485079	0	+40.8188478	-073.8580764	1	0.485076	{"rings": [[[-8222181.7567, 4986069.1354], [-8...
2127	2128	36	005	001900	36005001900	19	Census Tract 19	G5020	S	1643654	1139660	+40.8009990	-073.9093729	24	2.783331	{"rings": [[[-8230028.8927, 4984061.5402], [-8...

Here the Point_Count field from the above aggregated dataframe returns the number of Airbnb properties per tract. This would form the target variable for this problem.

Enriching tracts with demographic data using geoenrichment service from Esri

The feature data is now created using selected demographics information for each tracts. This is accomplished accessing the geoenrichment services from Esri, which consists the latest census data. The entire data repository is first visualized, out of which the relevant variables are finalized from a literature study. These selected variables are searched for adding in the feature set.

# Displaying the various data topic available for geoenrichment for USA in the Esri database
usa = Country.get('US')
type(usa)
usa_data = usa.data_collections
df_usa_data = pd.DataFrame(usa_data)
df_usa_data.head()

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
1yearincrements	1yearincrements.AGE0_CY	2023 Population Age <1	2023 Age: 1 Year Increments (Esri)	2023
1yearincrements	1yearincrements.AGE1_CY	2023 Population Age 1	2023 Age: 1 Year Increments (Esri)	2023
1yearincrements	1yearincrements.AGE2_CY	2023 Population Age 2	2023 Age: 1 Year Increments (Esri)	2023
1yearincrements	1yearincrements.AGE3_CY	2023 Population Age 3	2023 Age: 1 Year Increments (Esri)	2023
1yearincrements	1yearincrements.AGE4_CY	2023 Population Age 4	2023 Age: 1 Year Increments (Esri)	2023

All the data topics are visualized that are available in the geoenrichment services.

# Filtering the unique topic under dataCollectionID
df_usa_data.reset_index(inplace=True)
list(df_usa_data.dataCollectionID.unique())

['1yearincrements',
 '5yearincrements',
 'Age',
 'agebyracebysex',
 'agebyracebysex2010',
 'agebyracebysex2020',
 'AgeDependency',
 'AtRisk',
 'AutomobilesAutomotiveProducts',
 'BabyProductsToysGames',
 'basicFactsForMobileApps',
 'businesses',
 'CivicActivitiesPoliticalAffiliation',
 'classofworker',
 'clothing',
 'ClothingShoesAccessories',
 'commute',
 'crime',
 'DaytimePopulation',
 'disability',
 'disposableincome',
 'DniRates',
 'education',
 'educationalattainment',
 'ElectronicsInternet',
 'employees',
 'EmploymentUnemployment',
 'entertainment',
 'financial',
 'FinancialInsurance',
 'food',
 'foodstampsSNAP',
 'gender',
 'Generations',
 'GroceryAlcoholicBeverages',
 'groupquarters',
 'Health',
 'healthinsurancecoverage',
 'HealthPersonalCare',
 'HealthPersonalCareCEX',
 'heatingfuel',
 'hispanicorigin',
 'HistoricalHouseholds',
 'HistoricalHousing',
 'HistoricalPopulation',
 'HomeImprovementGardenLawn',
 'homevalue',
 'HouseholdGoodsFurnitureAppliances',
 'householdincome',
 'households',
 'householdsbyageofhouseholder',
 'HouseholdsByIncome',
 'householdsbyraceofhouseholder',
 'householdsbysize',
 'householdtotals',
 'householdtype',
 'housingbyageofhouseholder',
 'housingbyraceofhouseholder',
 'housingbysize',
 'housingcosts',
 'HousingHousehold',
 'housingunittotals',
 'incomebyage',
 'industry',
 'industrybynaicscode',
 'industrybysiccode',
 'InternetComputerUsage',
 'KeyGlobalFacts',
 'KeyUSFacts',
 'language',
 'LeisureActivitiesLifestyle',
 'LifeInsurancePensions',
 'lifemodegroupsNEW',
 'maritalstatustotals',
 'MediaMagazinesNewspapers',
 'MediaRadioOtherAudio',
 'MediaTVViewing',
 'miscellaneous',
 'networth',
 'NonHispanicOrigin',
 'occupation',
 'OwnerRenter',
 'PetsPetProducts',
 'PhonesYellowPages',
 'Policy',
 'population',
 'populationtotals',
 'presenceofchildren',
 'PsychographicsAdvertising',
 'PsychographicsFood',
 'PsychographicsLifestyle',
 'PsychographicsMedia',
 'PsychographicsShopping',
 'RaceAndEthnicity',
 'raceandhispanicorigin',
 'restaurants',
 'RetailDemandbyNAICS',
 'sales',
 'schoolenrollment',
 'shopping',
 'spendingFactsForMobileApps',
 'SpendingTotal',
 'sports',
 'tapestryadultsNEW',
 'tapestryhouseholdsNEW',
 'TapestryNEW',
 'transportation',
 'TravelCEX',
 'travelMPI',
 'unitsinstructure',
 'urbanizationgroupsNEW',
 'vacant',
 'vehiclesavailable',
 'veterans',
 'Wealth',
 'women',
 'yearbuilt',
 'yearmovedin']

Items can be searched using alias field, for the related analysis variable name -- here as an example a variable with 'Nonprofit' is searched. Out of the these the relevant 'Nonprofit' data is to be selected.

df_usa_data[df_usa_data['alias'].str.contains('Nonprofit')]

	dataCollectionID	analysisVariable	alias	fieldCategory	vintage
3840	classofworker	classofworker.ACSMPRIVNP	2021 Civ Male 16+:Priv Nonprofit (ACS 5-Yr)	2017-2021 Class of Worker (ACS)	2017-2021
3841	classofworker	classofworker.MOEMPRIVNP	2021 Civ Male 16+:Priv Nonprofit MOE (ACS 5-Yr)	2017-2021 Class of Worker (ACS)	2017-2021
3848	classofworker	classofworker.RELMPRIVNP	2021 Civ Male 16+:Priv Nonprofit REL (ACS 5-Yr)	2017-2021 Class of Worker (ACS)	2017-2021
3873	classofworker	classofworker.ACSFPRIVNP	2021 Civ Female 16+:Priv Nonprofit (ACS 5-Yr)	2017-2021 Class of Worker (ACS)	2017-2021
3874	classofworker	classofworker.MOEFPRIVNP	2021 Civ Female 16+:Priv Nonprofit MOE (ACS 5-Yr)	2017-2021 Class of Worker (ACS)	2017-2021
3875	classofworker	classofworker.RELFPRIVNP	2021 Civ Female 16+:Priv Nonprofit REL (ACS 5-Yr)	2017-2021 Class of Worker (ACS)	2017-2021

Adding data using enrichment - At this stage a literature study is undertaken to narrow down the various factors that might impact opening of new Airbnb properties in NYC.

Subsequently these factors are identified from the USA geoenrichment database as shown above. These variable names are then compiled in a dictionary for passing them to the enrichment tool.

enrichment_variables = {'classofworker.ACSCIVEMP':      'Employed Population Age 16+',
 'classofworker.ACSMCIVEMP':                      'Employed Male Pop Age 16+',
 'classofworker.ACSMPRIVNP':                      'Male 16+Priv Nonprofit',
 'classofworker.ACSMEPRIVP':                      'Male 16+:Priv Profit Empl',
 'classofworker.ACSMSELFI':                       'Male 16+:Priv Profit Self Empl',
 'classofworker.ACSMSTGOV':                       'Male 16+:State Govt Wrkr',
 'classofworker.ACSMFEDGOV':                      'Male 16+:Fed Govt Wrkr',
 'classofworker.ACSMSELFNI':                      'Male 16+:Self-Emp Not Inc',
 'classofworker.ACSMUNPDFM':                      'Male 16+:Unpaid Family Wrkr',              
 'classofworker.ACSFCIVEMP':                      'Female Pop Age 16+',
 'classofworker.ACSFEPRIVP':                      'Female 16+:Priv Profit Empl',
 'classofworker.ACSFSELFI':                       'Female 16+:Priv Profit Self Empl',                      
 'classofworker.ACSFPRIVNP':                      'Female 16+:Priv Nonprofit',
 'classofworker.ACSFLOCGOV':                      'Female 16+:Local Govt Wrkr',
 'classofworker.ACSFSTGOV':                       'Female 16+:State Govt Wrkr',
 'classofworker.ACSFFEDGOV':                      'Female 16+:Fed Govt Wrkr',                      
 'classofworker.ACSFSELFNI':                      'Female 16+:Self-Emp Not Inc',                      
 'classofworker.ACSFUNPDFM':                      'Female 16+:Unpaid Family Wrkr',                      
 'gender.MEDAGE_CY':                              '2019 Median Age',
 'Generations.GENALPHACY':                        '2019 Generation Alpha Population',
 'Generations.GENZ_CY':                           '2019 Generation Z Population',
 'Generations.MILLENN_CY':                        '2019 Millennial Population',
 'Generations.GENX_CY':                           '2019 Generation X Population',
 'Generations.BABYBOOMCY':                        '2019 Baby Boomer Population',
 'Generations.OLDRGENSCY':                        '2019 Silent & Greatest Generations Population',
 'Generations.GENBASE_CY':                        '2019 Population by Generation Base',
 'populationtotals.POPDENS_CY':                   '2019 Population Density',
 'DaytimePopulation.DPOP_CY':                     '2019 Total Daytime Population',
 'raceandhispanicorigin.WHITE_CY':                '2019 White Population',
 'raceandhispanicorigin.BLACK_CY':                '2019 Black Population',
 'raceandhispanicorigin.AMERIND_CY':              '2019 American Indian Population',
 'raceandhispanicorigin.ASIAN_CY':                '2019 Asian Population',
 'raceandhispanicorigin.PACIFIC_CY':              '2019 Pacific Islander Population',
 'raceandhispanicorigin.OTHRACE_CY':              '2019 Other Race Population',
 'raceandhispanicorigin.DIVINDX_CY':              '2019 Diversity Index',
 'households.ACSHHBPOV':                          'HHs: Inc Below Poverty Level',
 'households.ACSHHAPOV':                          'HHs:Inc at/Above Poverty Level',
 'households.ACSFAMHH':                           'ACS Family Households',
 'businesses.S01_BUS':                            'Total Businesses (SIC)',
 'businesses.N05_BUS':                            'Construction Businesses (NAICS)',
 'businesses.N08_BUS':                            'Retail Trade Businesses (NAICS)',
 'businesses.N21_BUS':                            'Transportation/Warehouse Bus (NAICS)',
 'ElectronicsInternet.MP09147a_B':                'Own any tablet',
 'ElectronicsInternet.MP09148a_B':                'Own any e-reader',
 'ElectronicsInternet.MP19001a_B':                'Have access to Internet at home',                
 'ElectronicsInternet.MP19070a_I':                'Index: Spend 0.5-0.9 hrs online(excl email/IM .',               
 'ElectronicsInternet.MP19071a_B':                'Spend <0.5 hrs online (excl email/IM time) daily',
 'populationtotals.TOTPOP_CY':                    '2019 Total Population',              
 'gender.MALES_CY':                               '2019 Male Population',
 'gender.FEMALES_CY':                             '2019 Female Population',
 'industry.EMP_CY':                               '2019 Employed Civilian Pop 16+',
 'industry.UNEMP_CY':                             '2019 Unemployed Population 16+',                     
 'industry.UNEMPRT_CY':                           '2019 Unemployment Rate',
 'commute.ACSWORKERS':                            'ACS Workers Age 16+',
 'commute.ACSDRALONE':                            'ACS Workers 16+: Drove Alone',
 'commute.ACSCARPOOL':                            'ACS Workers 16+: Carpooled',
 'commute.ACSPUBTRAN':                            'ACS Workers 16+: Public Transportation',
 'commute.ACSBUS':                                'ACS Workers 16+: Bus',
 'commute.ACSSTRTCAR':                            'ACS Workers 16+: Streetcar',
 'commute.ACSSUBWAY':                             'ACS Workers 16+: Subway',
 'commute.ACSRAILRD':                             'ACS Workers 16+: Railroad',
 'commute.ACSFERRY':                              'ACS Workers 16+: Ferryboat',
 'commute.ACSTAXICAB':                            'ACS Workers 16+: Taxicab',           
 'commute.ACSMCYCLE':                             'ACS Workers 16+: Motorcycle',
 'commute.ACSBICYCLE':                            'ACS Workers 16+: Bicycle',                             
 'commute.ACSWALKED':                             'ACS Workers 16+: Walked',
 'commute.ACSOTHTRAN':                            'ACS Workers 16+: Other Means',
 'commute.ACSWRKHOME':                            'ACS Wrkrs 16+: Worked at Home',
 'OwnerRenter.OWNER_CY':                          '2019 Owner Occupied HUs', 
 'OwnerRenter.RENTER_CY':                         '2019 Renter Occupied HUs', 
 'vacant.VACANT_CY':                              '2019 Vacant Housing Units', 
 'homevalue.MEDVAL_CY':                           '2019 Median Home Value',
 'housingunittotals.TOTHU_CY':                    '2019 Total Housing Units',
 'yearbuilt.ACSMEDYBLT':                          'ACS Median Year Structure Built: HUs',
 'SpendingTotal.X1001_X':                         '2019 Annual Budget Exp',
 'transportation.X6001_X':                        '2019 Transportation',
 'households.ACSTOTHH':                           'ACS Total Households',
 'DaytimePopulation.DPOPWRK_CY':                  '2019 Daytime Pop: Workers',
 'DaytimePopulation.DPOPRES_CY':                  '2019 Daytime Pop: Residents',
 'DaytimePopulation.DPOPDENSCY':                  '2019 Daytime Pop Density',
 'occupation.OCCPROT_CY':                         '2019 Occupation: Protective Service',
 'occupation.OCCFOOD_CY':                         '2019 Occupation: Food Preperation',
 'occupation.OCCPERS_CY':                         '2019 Occupation: Personal Care',
 'occupation.OCCADMN_CY':                         '2019 Occupation: Office/Admin',
 'occupation.OCCCONS_CY':                         '2019 Occupation: Construction/Extraction',
 'occupation.OCCPROD_CY':                         '2019 Occupation: Production'
                  }

# Enrichment operation using ArcGIS API for Python 
enrichment_variables_df = pd.DataFrame.from_dict(enrichment_variables, orient='index',columns=['Variable Definition'])
enrichment_variables_df.reset_index(level=0, inplace=True)
enrichment_variables_df.columns = ['AnalysisVariable','Variable Definition']
enrichment_variables_df.head()

	AnalysisVariable	Variable Definition
0	classofworker.ACSCIVEMP	Employed Population Age 16+
1	classofworker.ACSMCIVEMP	Employed Male Pop Age 16+
2	classofworker.ACSMPRIVNP	Male 16+Priv Nonprofit
3	classofworker.ACSMEPRIVP	Male 16+:Priv Profit Empl
4	classofworker.ACSMSELFI	Male 16+:Priv Profit Self Empl

# Convertng the variables names to list for passing them to the enrichment tool
variable_names = enrichment_variables_df['AnalysisVariable'].tolist()

# checking the firt few values of the list
variable_names[1:5]

['classofworker.ACSMCIVEMP',
 'classofworker.ACSMPRIVNP',
 'classofworker.ACSMEPRIVP',
 'classofworker.ACSMSELFI']

# Data Enriching operation
airbnb_count_by_tract_enriched = enrich_layer(airbnb_count_by_tract,
                                              analysis_variables = variable_names,
                                              output_name='airbnb_tract_enrich1'+ str(dt.now().microsecond))

{"messageCode": "AO_100047", "message": "Enrichment may not be available for some features."}
{"messageCode": "AO_100288", "message": "Unable to detect the country for one or more features."}
{"messageCode": "AO_100047", "message": "Enrichment may not be available for some features."}
{"messageCode": "AO_100000", "message": "Variables [commute.ACSRAILRD] are not defined for country 'US'."}
{"messageCode": "AO_100000", "message": "Variables [households.ACSFAMHH, households.ACSTOTHH] are not defined for country 'US'."}
{"cost": -1}

# Extracting the resulting enriched dataframe after the geoenrichment method
sdf_airbnb_count_by_tract_enriched = airbnb_count_by_tract_enriched.layers[0].query().sdf

# Visualizing the data as a pandas dataframe
print(sdf_airbnb_count_by_tract_enriched.columns)
sdf_airbnb_count_by_tract_enriched_sorted = sdf_airbnb_count_by_tract_enriched.sort_values('geoid')
sdf_airbnb_count_by_tract_enriched_sorted.head()

Index(['OBJECTID', 'statefp', 'countyfp', 'tractce', 'geoid', 'name',
       'namelsad', 'mtfcc', 'funcstat', 'aland',
       ...
       'DPOPWRK_CY', 'DPOPRES_CY', 'DPOPDENSCY', 'OCCPROT_CY', 'OCCFOOD_CY',
       'OCCPERS_CY', 'OCCADMN_CY', 'OCCCONS_CY', 'OCCPROD_CY', 'SHAPE'],
      dtype='object', length=106)

	OBJECTID	statefp	countyfp	tractce	geoid	name	namelsad	mtfcc	funcstat	aland	awater	intptlat	intptlon	Point_Count	AnalysisArea	ID	sourceCountry	ENRICH_FID	aggregationMethod	populationToPolygonSizeRating	apportionmentConfidence	HasData	ACSCIVEMP	ACSMCIVEMP	ACSMPRIVNP	ACSMEPRIVP	ACSMSELFI	ACSMSTGOV	ACSMFEDGOV	ACSMSELFNI	ACSMUNPDFM	ACSFCIVEMP	ACSFEPRIVP	ACSFSELFI	ACSFPRIVNP	ACSFLOCGOV	ACSFSTGOV	ACSFFEDGOV	ACSFSELFNI	MEDAGE_CY	GENALPHACY	GENZ_CY	MILLENN_CY	GENX_CY	BABYBOOMCY	OLDRGENSCY	GENBASE_CY	POPDENS_CY	DPOP_CY	WHITE_CY	BLACK_CY	AMERIND_CY	ASIAN_CY	PACIFIC_CY	OTHRACE_CY	DIVINDX_CY	ACSHHBPOV	ACSHHAPOV	S01_BUS	N05_BUS	N08_BUS	N21_BUS	MP09147a_B	MP09148a_B	MP19001a_B	MP19070a_I	MP19071a_B	TOTPOP_CY	MALES_CY	FEMALES_CY	EMP_CY	UNEMP_CY	UNEMPRT_CY	ACSWORKERS	ACSDRALONE	ACSCARPOOL	ACSPUBTRAN	ACSBUS	ACSSTRTCAR	ACSSUBWAY	ACSFERRY	ACSTAXICAB	ACSBICYCLE	ACSWALKED	ACSOTHTRAN	ACSWRKHOME	OWNER_CY	RENTER_CY	VACANT_CY	MEDVAL_CY	TOTHU_CY	ACSMEDYBLT	X1001_X	X6001_X	DPOPWRK_CY	DPOPRES_CY	DPOPDENSCY	OCCPROT_CY	OCCFOOD_CY	OCCPERS_CY	OCCADMN_CY	OCCCONS_CY	OCCPROD_CY	SHAPE
2095	2096	36	005	000100	36005000100	1	Census Tract 1	G5020	S	1579361	1125765	+40.7934921	-073.8835318	0	2.705062	45	US	2096	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	34.5	0.0	698.0	1997.0	921.0	127.0	5.0	3748.0	3588.6	663.0	629.0	2330.0	17.0	67.0	1.0	672.0	75.2	0.0	0.0	24.0	3.0	4.0	1.0	0.0	0.0	0.0	0.0	0.0	3748.0	3407.0	341.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	663.0	0.0	634.8	0.0	0.0	0.0	0.0	0.0	0.0	{"rings": [[[-8226256.9418, 4982172.581], [-82...
2059	2060	36	005	000200	36005000200	2	Census Tract 2	G5020	S	455322	926899	+40.8045733	-073.8568585	0	1.382228	9	US	2060	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	1664.0	1068.0	51.0	709.0	0.0	43.0	0.0	38.0	0.0	596.0	316.0	15.0	78.0	115.0	60.0	12.0	0.0	35.8	430.0	1218.0	1245.0	913.0	823.0	168.0	4797.0	8988.5	3210.0	438.0	1195.0	69.0	217.0	7.0	2150.0	83.6	278.0	1114.0	42.0	2.0	2.0	5.0	2177.0	437.0	3469.0	86.0	123.0	4797.0	2249.0	2548.0	1834.0	192.0	9.5	1664.0	766.0	13.0	619.0	235.0	0.0	362.0	22.0	39.0	0.0	200.0	0.0	27.0	767.0	766.0	78.0	558333.0	1611.0	1957.0	132707872.0	13576654.0	196.0	3014.0	6014.9	48.0	106.0	15.0	162.0	33.0	130.0	{"rings": [[[-8222638.612, 4985024.3226], [-82...
2067	2068	36	005	000400	36005000400	4	Census Tract 4	G5020	S	912392	602945	+40.8089152	-073.8504884	15	1.515336	17	US	2068	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	3128.0	1675.0	253.0	932.0	21.0	38.0	104.0	75.0	0.0	1453.0	804.0	37.0	276.0	281.0	16.0	39.0	0.0	35.7	506.0	1589.0	1589.0	1226.0	945.0	166.0	6021.0	10291.0	3758.0	641.0	1803.0	60.0	240.0	6.0	2278.0	85.4	107.0	2092.0	57.0	3.0	9.0	3.0	2754.0	730.0	4455.0	103.0	172.0	6021.0	2937.0	3084.0	2807.0	120.0	4.1	3082.0	1628.0	61.0	1077.0	250.0	0.0	776.0	33.0	0.0	0.0	130.0	64.0	122.0	1548.0	537.0	108.0	552034.0	2193.0	2001.0	197321413.0	20238565.0	471.0	3287.0	6423.1	239.0	83.0	96.0	429.0	99.0	42.0	{"rings": [[[-8222012.885, 4985135.2266], [-82...
1668	1669	36	005	001600	36005001600	16	Census Tract 16	G5020	S	485079	0	+40.8188478	-073.8580764	1	0.485076	18	US	1669	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	2513.0	959.0	31.0	620.0	29.0	0.0	18.0	20.0	0.0	1554.0	927.0	7.0	159.0	296.0	90.0	51.0	24.0	36.6	477.0	1384.0	1313.0	996.0	997.0	350.0	5517.0	29457.3	5334.0	497.0	2127.0	70.0	123.0	3.0	1900.0	84.6	526.0	1661.0	57.0	0.0	7.0	2.0	2392.0	491.0	3950.0	80.0	114.0	5517.0	2394.0	3123.0	2087.0	183.0	8.1	2422.0	880.0	73.0	1154.0	604.0	0.0	538.0	12.0	19.0	12.0	110.0	29.0	145.0	295.0	1712.0	103.0	650463.0	2110.0	1973.0	117903160.0	11830562.0	1831.0	3503.0	28480.2	101.0	104.0	35.0	299.0	3.0	21.0	{"rings": [[[-8222181.7567, 4986069.1354], [-8...
2127	2128	36	005	001900	36005001900	19	Census Tract 19	G5020	S	1643654	1139660	+40.8009990	-073.9093729	24	2.783331	27	US	2128	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	1790.0	887.0	144.0	580.0	46.0	0.0	7.0	41.0	17.0	903.0	585.0	15.0	133.0	64.0	17.0	0.0	89.0	36.1	362.0	946.0	1286.0	885.0	553.0	134.0	4166.0	3876.6	8595.0	637.0	1656.0	62.0	67.0	5.0	1236.0	85.8	323.0	938.0	409.0	75.0	45.0	26.0	1836.0	347.0	3093.0	80.0	94.0	4166.0	2182.0	1984.0	2152.0	114.0	5.0	1751.0	279.0	89.0	1054.0	142.0	14.0	898.0	0.0	16.0	21.0	75.0	35.0	182.0	118.0	1486.0	124.0	381395.0	1728.0	1964.0	118882651.0	12251786.0	6516.0	2079.0	7998.0	32.0	127.0	162.0	177.0	58.0	64.0	{"rings": [[[-8230028.8927, 4984061.5402], [-8...

The field name of the enriched dataframe are code words which needs to be elaborated. Hence these are renamed with their actual definition from the variable definition of the list that was first created during selection of the variables.

enrichment_variables_df.head()

	AnalysisVariable	Variable Definition
0	classofworker.ACSCIVEMP	Employed Population Age 16+
1	classofworker.ACSMCIVEMP	Employed Male Pop Age 16+
2	classofworker.ACSMPRIVNP	Male 16+Priv Nonprofit
3	classofworker.ACSMEPRIVP	Male 16+:Priv Profit Empl
4	classofworker.ACSMSELFI	Male 16+:Priv Profit Self Empl

enrichment_variables_copy = enrichment_variables_df.copy()
enrichment_variables_copy.head(2)

	AnalysisVariable	Variable Definition
0	classofworker.ACSCIVEMP	Employed Population Age 16+
1	classofworker.ACSMCIVEMP	Employed Male Pop Age 16+

enrichment_variables_copy['AnalysisVariable'] = enrichment_variables_copy.AnalysisVariable.str.split(pat='.', expand=True)[1]
enrichment_variables_copy

	AnalysisVariable	Variable Definition
0	ACSCIVEMP	Employed Population Age 16+
1	ACSMCIVEMP	Employed Male Pop Age 16+
2	ACSMPRIVNP	Male 16+Priv Nonprofit
3	ACSMEPRIVP	Male 16+:Priv Profit Empl
4	ACSMSELFI	Male 16+:Priv Profit Self Empl
...	...	...
81	OCCFOOD_CY	2019 Occupation: Food Preperation
82	OCCPERS_CY	2019 Occupation: Personal Care
83	OCCADMN_CY	2019 Occupation: Office/Admin
84	OCCCONS_CY	2019 Occupation: Construction/Extraction
85	OCCPROD_CY	2019 Occupation: Production

86 rows × 2 columns

enrichment_variables_copy.set_index("AnalysisVariable", drop=True, inplace=True)
dictionary = enrichment_variables_copy.to_dict()
new_columns = dictionary['Variable Definition']

# Field renamed and new dataframe visualized
pd.set_option('display.max_columns', 150)
sdf_airbnb_count_by_tract_enriched_sorted.rename(columns=new_columns, inplace=True)
sdf_airbnb_count_by_tract_enriched_sorted.head()

	OBJECTID	statefp	countyfp	tractce	geoid	name	namelsad	mtfcc	funcstat	aland	awater	intptlat	intptlon	Point_Count	AnalysisArea	ID	sourceCountry	ENRICH_FID	aggregationMethod	populationToPolygonSizeRating	apportionmentConfidence	HasData	Employed Population Age 16+	Employed Male Pop Age 16+	Male 16+Priv Nonprofit	Male 16+:Priv Profit Empl	Male 16+:Priv Profit Self Empl	Male 16+:State Govt Wrkr	Male 16+:Fed Govt Wrkr	Male 16+:Self-Emp Not Inc	Male 16+:Unpaid Family Wrkr	Female Pop Age 16+	Female 16+:Priv Profit Empl	Female 16+:Priv Profit Self Empl	Female 16+:Priv Nonprofit	Female 16+:Local Govt Wrkr	Female 16+:State Govt Wrkr	Female 16+:Fed Govt Wrkr	Female 16+:Self-Emp Not Inc	2019 Median Age	2019 Generation Alpha Population	2019 Generation Z Population	2019 Millennial Population	2019 Generation X Population	2019 Baby Boomer Population	2019 Silent & Greatest Generations Population	2019 Population by Generation Base	2019 Population Density	2019 Total Daytime Population	2019 White Population	2019 Black Population	2019 American Indian Population	2019 Asian Population	2019 Pacific Islander Population	2019 Other Race Population	2019 Diversity Index	HHs: Inc Below Poverty Level	HHs:Inc at/Above Poverty Level	Total Businesses (SIC)	Construction Businesses (NAICS)	Retail Trade Businesses (NAICS)	Transportation/Warehouse Bus (NAICS)	Own any tablet	Own any e-reader	Have access to Internet at home	Index: Spend 0.5-0.9 hrs online(excl email/IM .	Spend <0.5 hrs online (excl email/IM time) daily	2019 Total Population	2019 Male Population	2019 Female Population	2019 Employed Civilian Pop 16+	2019 Unemployed Population 16+	2019 Unemployment Rate	ACS Workers Age 16+	ACS Workers 16+: Drove Alone	ACS Workers 16+: Carpooled	ACS Workers 16+: Public Transportation	ACS Workers 16+: Bus	ACS Workers 16+: Streetcar	ACS Workers 16+: Subway	ACS Workers 16+: Ferryboat	ACS Workers 16+: Taxicab	ACS Workers 16+: Bicycle	ACS Workers 16+: Walked	ACS Workers 16+: Other Means	ACS Wrkrs 16+: Worked at Home	2019 Owner Occupied HUs	2019 Renter Occupied HUs	2019 Vacant Housing Units	2019 Median Home Value	2019 Total Housing Units	ACS Median Year Structure Built: HUs	2019 Annual Budget Exp	2019 Transportation	2019 Daytime Pop: Workers	2019 Daytime Pop: Residents	2019 Daytime Pop Density	2019 Occupation: Protective Service	2019 Occupation: Food Preperation	2019 Occupation: Personal Care	2019 Occupation: Office/Admin	2019 Occupation: Construction/Extraction	2019 Occupation: Production	SHAPE
2095	2096	36	005	000100	36005000100	1	Census Tract 1	G5020	S	1579361	1125765	+40.7934921	-073.8835318	0	2.705062	45	US	2096	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	34.5	0.0	698.0	1997.0	921.0	127.0	5.0	3748.0	3588.6	663.0	629.0	2330.0	17.0	67.0	1.0	672.0	75.2	0.0	0.0	24.0	3.0	4.0	1.0	0.0	0.0	0.0	0.0	0.0	3748.0	3407.0	341.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	663.0	0.0	634.8	0.0	0.0	0.0	0.0	0.0	0.0	{"rings": [[[-8226256.9418, 4982172.581], [-82...
2059	2060	36	005	000200	36005000200	2	Census Tract 2	G5020	S	455322	926899	+40.8045733	-073.8568585	0	1.382228	9	US	2060	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	1664.0	1068.0	51.0	709.0	0.0	43.0	0.0	38.0	0.0	596.0	316.0	15.0	78.0	115.0	60.0	12.0	0.0	35.8	430.0	1218.0	1245.0	913.0	823.0	168.0	4797.0	8988.5	3210.0	438.0	1195.0	69.0	217.0	7.0	2150.0	83.6	278.0	1114.0	42.0	2.0	2.0	5.0	2177.0	437.0	3469.0	86.0	123.0	4797.0	2249.0	2548.0	1834.0	192.0	9.5	1664.0	766.0	13.0	619.0	235.0	0.0	362.0	22.0	39.0	0.0	200.0	0.0	27.0	767.0	766.0	78.0	558333.0	1611.0	1957.0	132707872.0	13576654.0	196.0	3014.0	6014.9	48.0	106.0	15.0	162.0	33.0	130.0	{"rings": [[[-8222638.612, 4985024.3226], [-82...
2067	2068	36	005	000400	36005000400	4	Census Tract 4	G5020	S	912392	602945	+40.8089152	-073.8504884	15	1.515336	17	US	2068	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	3128.0	1675.0	253.0	932.0	21.0	38.0	104.0	75.0	0.0	1453.0	804.0	37.0	276.0	281.0	16.0	39.0	0.0	35.7	506.0	1589.0	1589.0	1226.0	945.0	166.0	6021.0	10291.0	3758.0	641.0	1803.0	60.0	240.0	6.0	2278.0	85.4	107.0	2092.0	57.0	3.0	9.0	3.0	2754.0	730.0	4455.0	103.0	172.0	6021.0	2937.0	3084.0	2807.0	120.0	4.1	3082.0	1628.0	61.0	1077.0	250.0	0.0	776.0	33.0	0.0	0.0	130.0	64.0	122.0	1548.0	537.0	108.0	552034.0	2193.0	2001.0	197321413.0	20238565.0	471.0	3287.0	6423.1	239.0	83.0	96.0	429.0	99.0	42.0	{"rings": [[[-8222012.885, 4985135.2266], [-82...
1668	1669	36	005	001600	36005001600	16	Census Tract 16	G5020	S	485079	0	+40.8188478	-073.8580764	1	0.485076	18	US	1669	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	2513.0	959.0	31.0	620.0	29.0	0.0	18.0	20.0	0.0	1554.0	927.0	7.0	159.0	296.0	90.0	51.0	24.0	36.6	477.0	1384.0	1313.0	996.0	997.0	350.0	5517.0	29457.3	5334.0	497.0	2127.0	70.0	123.0	3.0	1900.0	84.6	526.0	1661.0	57.0	0.0	7.0	2.0	2392.0	491.0	3950.0	80.0	114.0	5517.0	2394.0	3123.0	2087.0	183.0	8.1	2422.0	880.0	73.0	1154.0	604.0	0.0	538.0	12.0	19.0	12.0	110.0	29.0	145.0	295.0	1712.0	103.0	650463.0	2110.0	1973.0	117903160.0	11830562.0	1831.0	3503.0	28480.2	101.0	104.0	35.0	299.0	3.0	21.0	{"rings": [[[-8222181.7567, 4986069.1354], [-8...
2127	2128	36	005	001900	36005001900	19	Census Tract 19	G5020	S	1643654	1139660	+40.8009990	-073.9093729	24	2.783331	27	US	2128	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	1790.0	887.0	144.0	580.0	46.0	0.0	7.0	41.0	17.0	903.0	585.0	15.0	133.0	64.0	17.0	0.0	89.0	36.1	362.0	946.0	1286.0	885.0	553.0	134.0	4166.0	3876.6	8595.0	637.0	1656.0	62.0	67.0	5.0	1236.0	85.8	323.0	938.0	409.0	75.0	45.0	26.0	1836.0	347.0	3093.0	80.0	94.0	4166.0	2182.0	1984.0	2152.0	114.0	5.0	1751.0	279.0	89.0	1054.0	142.0	14.0	898.0	0.0	16.0	21.0	75.0	35.0	182.0	118.0	1486.0	124.0	381395.0	1728.0	1964.0	118882651.0	12251786.0	6516.0	2079.0	7998.0	32.0	127.0	162.0	177.0	58.0	64.0	{"rings": [[[-8230028.8927, 4984061.5402], [-8...

The renamed data frame above is now self explanatory hence more interpretable.

Estimating distances of tracts from various city features

The next set of feature data set will be the distances of each of the tract from various city features. These distance variables accomplishes two important tasks.

First is that they include the spatial components of the Airbnb development phenomenon into the model.

Secondly each Airbnb properties are impacted by unique locational factors. This is reflected from the Airbnb reviews where the most highly rated in demand Airbnb property are located in neighbourhood with good transit accessibility. Hence these are accounted into the model by including the distances of different public transit options from the tracts.

The hypothesis formed here is that tracts located near transit hubs which could be subway station, bus stops, railroad lines, subway routes etc., might attract more Airbnb property. Similarly the central business district which for New York is located at lower Manhattan might also influence Airbnb properties, since this is the city's main business hub. In the following these various distances are estimated using ArcGIS API for Python proximity method.

gis.content.search('NYCBusStop owner:api_data_owner')

[]

busi_distr = gis.content.search('BusinessDistricts owner:api_data_owner', 'feature layer')[0]
cbd = gis.content.search('NYCBD owner:api_data_owner', 'feature layer')[0]
bus_stop = gis.content.search('NYCBusStop owner:api_data_owner', 'feature layer')[0]
hotels = gis.content.search('NYCHotels owner:api_data_owner', 'feature layer')[0]
railroad = gis.content.search('NYCRailroad owner:api_data_owner', 'feature layer')[0]
subwy_rt = gis.content.search('NYCSubwayRoutes owner:api_data_owner', 'feature layer')[0]
subwy_stn = gis.content.search('NYCSubwayStation owner:api_data_owner', 'feature layer')[0]

# accessing the various city feature shapefile from arcgis portal
busi_distr, cbd, bus_stop, hotels, railroad, subwy_rt, subwy_stn

(<Item title:"tract_busi_distrs_dist874392" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"NYCBD" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_bus_stop_dist452536" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_hotel_dist128626" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"tract_railroad_dist613260" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_subway_routes_dist709262" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_subway_station_dist691504" type:Feature Layer Collection owner:api_data_owner>)

bus_stop_lyr = bus_stop.layers[0]
cbd_lyr = cbd.layers[0] 
hotels_lyr = hotels.layers[0] 
subwy_stn_lyr =subwy_stn.layers[0]
subwy_rt_lyr = subwy_rt.layers[0] 
railroad_lyr = railroad.layers[0]
busi_distrs_lyr = busi_distr.layers[0]

# Avoid warning for chain operation
pd.set_option('mode.chained_assignment', None) 

# Estimating Tract to hotel distances
tract_hotel_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                              hotels_lyr,
                                              measurement_type='StraightLine',
                                              max_count=1,
                                              output_name='ny_tract_hotel_dist1' + str(dt.now().microsecond))

{"cost": 2.555}

tract_hotel_dist.layers

[<FeatureLayer url:"https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/ny_tract_hotel_dist1396959/FeatureServer/0">,
 <FeatureLayer url:"https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/ny_tract_hotel_dist1396959/FeatureServer/1">]

tract_hotel_dist_lyr = tract_hotel_dist.layers[1]
sdf_tract_hotel_dist_lyr = pd.DataFrame.spatial.from_layer(tract_hotel_dist_lyr)
sdf_tract_hotel_dist_lyr.head()

	From_ID	From_Name	From_aland	From_countyfp	From_funcstat	From_geoid	From_intptlat	From_intptlon	From_mtfcc	From_namelsad	From_statefp	From_tractce	NearRank	OBJECTID	SHAPE	To_BLDG_DESC	To_BLDG_STYLE	To_CALC_ACRES	To_COUNTY	To_CT_NAME	To_CT_SWIS	To_DEPTH	To_FRONT	To_GRID_EAST	To_GRID_NORTH	To_ID	To_LAND_AV	To_LOC_STREET	To_LOC_ST_NBR	To_LOC_ZIP	To_MUNI_NAME	To_ORIG_FID	To_OWNER_TYPE	To_PARCELADDR	To_PRMY_OWNER	To_PROP_CLASS	To_ROLL_YR	To_SBL	To_SCH_CODE	To_SPATIAL_YR	To_SQFT_LIV	To_SQ_FT	To_SWIS	To_SWISSBLID	To_Shape__Area	To_Shape__Length	To_TOTAL_AV	To_YR_BLT	Total_Miles
0	1332	227.03	41502	005	S	36005022703	+40.8440198	-073.9104999	G5020	Census Tract 227.03	36	022703	1	1	{"paths": [[[-8227552.1448, 4989516.0602], [-8...	Miscellaneous	H9	0.162924	Bronx	Bronx	600100	159	42	1011886.0	248352.0	237	202500.0	WEBSTER AVENUE	1930	10457	Bronx	30	8	1930 WEBSTER AVENUE	WEBSTER TREMONT EQUIT	05	2017	2030270010	09	2017	19110.0	6753.0	600100	6001002030270010	1154.0625	166.305662	1531800.0	1931	0.519338
1	501	207.01	47273	061	S	36061020701	+40.8089775	-073.9584600	G5020	Census Tract 207.01	36	020701	1	2	None	Dormitories	H8	0.237481	New York	Manhattan	620100	100	101	995721.0	234272.0	231	389700.0	AMSTERDAM AVENUE	1235	10027	Manhattan	701	8	1235 AMSTERDAM AVENUE	BARNARD COLLEGE	08	2017	1019630030	03	2017	79036.0	10092.0	620100	6201001019630030	1680.277344	163.99296	6787350.0	1968	0.0
2	469	174.02	50730	061	S	36061017402	+40.7968026	-073.9471624	G5020	Census Tract 174.02	36	017402	1	3	None	Dormitories	H8	0.338975	New York	Manhattan	620100	101	142	998966.0	229316.0	273	1291050.0	EAST 110 STREET	55	10029	Manhattan	695	8	55 EAST 110 STREET	EDWIN GOULD RESIDENCE	08	2017	1016160024	04	2017	37570.0	14347.0	620100	6201001016160024	2397.410156	198.095165	3051450.0	2004	0.0
3	454	160.02	51422	061	S	36061016002	+40.7878787	-073.9536853	G5020	Census Tract 160.02	36	016002	1	4	None	Dormitories	H8	0.742213	New York	Manhattan	620100	101	305	997421.0	226355.0	366	6925950.0	EAST 98 STREET	50	10029	Manhattan	689	8	50 EAST 98 STREET	MSMC RESIDENTIAL REAL	08	2017	1016030039	02	2017	240000.0	30781.0	620100	6201001016030039	5248.046875	332.868992	8974350.0	1984	0.0
4	440	150.01	51643	061	S	36061015001	+40.7801987	-073.9592834	G5020	Census Tract 150.01	36	015001	1	5	{"paths": [[[-8232878.2271, 4979973.2013], [-8...	Transient Occupancy - Midtown Manhattan Area	H3	0.036426	New York	Manhattan	620100	101	26	996841.0	223341.0	29	1395000.0	EAST 87 STREET	164	10128	Manhattan	687	8	164 EAST 87 STREET	164 EAST 87TH ST LLC	05	2017	1015150045	02	2017	18300.0	2571.0	620100	6201001015150045	257.5	94.764453	4627350.0	1930	0.139874

In the above dataframe the Total_Miles field returns the distances of the tract from hotels in miles. Hence this field is converted into feet and retained. This is then repeated for each of the other distance estimation.

# Final hotel Distances in feet — Here in each row column "hotel_dist" returns the distance of the nearest hotel from that tract indicated by its geoids.
# For example in the first row the tract with ID 36005000100 has a nearest hotel at 5571.75 feet away from it. 
sdf_tract_hotel_dist_lyr_new = sdf_tract_hotel_dist_lyr[['From_geoid', 'Total_Miles']]

# 1 mile = 5280 feet
sdf_tract_hotel_dist_lyr_new['hotel_dist'] = round(sdf_tract_hotel_dist_lyr_new['Total_Miles'] * 5280, 2)
sdf_tract_hotel_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	hotel_dist
2095	36005000100	1.055256	5571.75
2059	36005000200	1.039099	5486.44
2067	36005000400	0.472664	2495.67
1668	36005001600	0.585977	3093.96
2127	36005001900	0.0	0.0

# Estimating Busstop distances from tracts
tract_bustop_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                               bus_stop_lyr,
                                               measurement_type='StraightLine',
                                               max_count=1,
                                               output_name='ny_tract_bus_stop_dist'+ str(dt.now().microsecond))
tract_bustop_dist_lyr = tract_bustop_dist.layers[1]
sdf_tract_bustop_dist_lyr =tract_bustop_dist_lyr.query().sdf

{"cost": 3.846}

sdf_tract_bustop_dist_lyr

	OBJECTID	From_ID	To_ID	NearRank	From_statefp	From_countyfp	From_tractce	From_geoid	From_Name	From_namelsad	From_mtfcc	From_funcstat	From_aland	From_awater	From_intptlat	From_intptlon	To_CounDist	To_BoroCD	To_AssemDist	To_the_geom	To_CongDist	To_StSenDist	To_SHELTER_ID	To_LOCATION	To_AT_BETWEEN	To_LONGITUDE	To_LATITUDE	To_AssetID	To_BoroCode	To_BoroName	To_Street	To_SegmentID	To_PhysicalID	To_NODEID	To_ORIG_FID	Total_Miles	SHAPE
0	1	1332	1049	1	36	005	022703	36005022703	227.03	Census Tract 227.03	G5020	S	41502	0	+40.8440198	-073.9104999	16	204	77	POINT (-73.91154799999998 40.84352500000006)	15	33	BX0293	Grand Concourse	MT EDEN AV EAST	-73.911548	40.843525	966	2	Bronx	GRAND CONCOURSE	190988	166310	0	1969	0.0	None
1	2	501	35	1	36	061	020701	36061020701	207.01	Census Tract 207.01	G5020	S	47273	0	+40.8089775	-073.9584600	7	109	70	POINT (-73.95978799999995 40.80835700000006)	10	30	MN0273	AMSTERDAM AV	W 118 ST	-73.959788	40.808357	2049	1	Manhattan	AMSTERDAM AVENUE	38361	3837	0	49	0.0	None
2	3	469	413	1	36	061	017402	36061017402	174.02	Census Tract 174.02	G5020	S	50730	0	+40.7968026	-073.9471624	8	111	68	POINT (-73.94691999999998 40.796970000000044)	13	30	MN01334	MADISON AV	E 111 ST	-73.94692	40.79697	1783	1	Manhattan	MADISON AVENUE	37970	3538	0	784	0.0	None
3	4	454	108	1	36	061	016002	36061016002	160.02	Census Tract 160.02	G5020	S	51422	0	+40.7878787	-073.9536853	4	111	68	POINT (-73.95376199999998 40.787168000000065)	12	29	MN0868	E 96 ST	MADISON AV	-73.953762	40.787168	2201	1	Manhattan	EAST 96 STREET	294131	172107	9045175	201	0.0	None
4	5	440	398	1	36	061	015001	36061015001	150.01	Census Tract 150.01	G5020	S	51643	0	+40.7801987	-073.9592834	5	108	76	POINT (-73.95346799999999 40.779304000000025)	12	28	MN01271	3 AV	E 87 ST	-73.953468	40.779304	1739	1	Manhattan	3 AVENUE	37669	3682	23647	740	0.204742	{"paths": [[[-8232878.2271, 4979973.2013], [-8...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2162	2163	2037	1653	1	36	081	071600	36081071600	716	Census Tract 716	G5020	S	18242414	1117802	+40.6476943	-073.7860589	31	413	31	POINT (-73.77444099999997 40.660487000000046)	5	10	QN04677	ROCKAWAY BLVD	147 AV	-73.774441	40.660487	2913	4	Queens	ROCKAWAY BOULEVARD	59422	8417	0	3341	0.035317	{"paths": [[[-8212568.4035, 4962325.3532], [-8...
2163	2164	287	973	1	36	047	990100	36047990100	9901	Census Tract 9901	G5020	S	0	17793513	+40.5649933	-074.0148865	43	311	46	POINT (-73.99949999999995 40.595282000000054)	11	23	BR0471	SHORE PKWY	BAY PKWY	-73.9995	40.595282	845	3	Brooklyn	SHORE PARKWAY	9009667	167800	0	1847	0.33537	{"paths": [[[-8238249.5277, 4952568.7722], [-8...
2164	2165	2145	238	1	36	081	107202	36081107202	1072.02	Census Tract 1072.02	G5020	S	7504745	16636750	+40.6252538	-073.8137646	32	414	23	POINT (-73.82648099999994 40.58301400000005)	5	15	QN04053	ROCKAWAY BEACH BLVD	BEACH 105 ST	-73.826481	40.583014	2390	4	Queens	ROCKAWAY BEACH BOULEVARD	147266	14870	0	390	0.332147	{"paths": [[[-8218474.4488, 4951720.8592], [-8...
2165	2166	691	1445	1	36	085	990100	36085990100	9901	Census Tract 9901	G5020	S	0	80255169	+40.5255512	-074.1085829	50	595	64	POINT (-74.07749299999995 40.57932000000005)	11	24	SI05038	FR CAPODANNO BLVD	SEAVIEW AV	-74.077493	40.57932	3280	5	Staten Island	FR CAPODANNO BOULEVARD	145481	44856	92770	2780	0.433019	{"paths": [[[-8245592.2067, 4949864.8834], [-8...
2166	2167	581	1620	1	36	081	990100	36081990100	9901	Census Tract 9901	G5020	S	0	122602887	+40.5401732	-073.8909698	32	414	23	POINT (-73.86499899999995 40.56884100000008)	5	15	QN04568	ROCKAWAY BEACH BLVD	BEACH 149 ST	-73.864999	40.568841	2820	4	Queens	ROCKAWAY BEACH BOULEVARD	114195	15135	0	3248	0.54502	{"paths": [[[-8222119.6803, 4947903.8282], [-8...

2167 rows × 37 columns

# Final Bustop Distances in feet — Here in each row column "busstop_dist" returns the distance of the nearest bus stop 
# from that tract indicated by its geoids 
sdf_tract_bustop_dist_lyr_new = sdf_tract_bustop_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_bustop_dist_lyr_new['busstop_dist'] = round(sdf_tract_bustop_dist_lyr_new['Total_Miles'] * 5280, 2)
sdf_tract_bustop_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	busstop_dist
2095	36005000100	0.744344	3930.14
2059	36005000200	0.005983	31.59
2067	36005000400	0.0	0.0
1668	36005001600	0.0	0.0
2127	36005001900	0.0	0.0

# estimating number of bus stops per tract
num_bustops_tracts = summarize_data.aggregate_points(point_layer=bus_stop_lyr,
                                                   polygon_layer=nyc_tracts_layer,
                                                   output_name='bustops_by_tracts'+ str(dt.now().microsecond))

{"cost": 3.846}

num_bustops_tracts_lyr = num_bustops_tracts.layers[0]
sdf_num_bustops_tracts_lyr = pd.DataFrame.spatial.from_layer(num_bustops_tracts_lyr)
sdf_num_bustops_tracts_lyr.head()

	AnalysisArea	OBJECTID	Point_Count	SHAPE	aland	countyfp	funcstat	geoid	intptlat	intptlon	mtfcc	name	namelsad	statefp	tractce
0	0.016024	1	1	{"rings": [[[-8227813.3004, 4989345.3624], [-8...	41502	005	S	36005022703	+40.8440198	-073.9104999	G5020	227.03	Census Tract 227.03	36	022703
1	0.018252	2	1	{"rings": [[[-8233183.0202, 4984115.3687], [-8...	47273	061	S	36061020701	+40.8089775	-073.9584600	G5020	207.01	Census Tract 207.01	36	020701
2	0.019587	3	1	{"rings": [[[-8231989.6748, 4982433.1457], [-8...	50730	061	S	36061017402	+40.7968026	-073.9471624	G5020	174.02	Census Tract 174.02	36	017402
3	0.019854	4	1	{"rings": [[[-8232691.8783, 4981159.0609], [-8...	51422	061	S	36061016002	+40.7878787	-073.9536853	G5020	160.02	Census Tract 160.02	36	016002
4	0.019939	5	0	{"rings": [[[-8233292.0018, 4980071.8459], [-8...	51643	061	S	36061015001	+40.7801987	-073.9592834	G5020	150.01	Census Tract 150.01	36	015001

# Number of Bus stops per tract — Here in each row column "num_bustop" returns the number of bus stops inside respective tracts 
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr[['geoid', 'Point_Count']] 
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr_new.rename(columns={'Point_Count':'num_bustop'})
sdf_num_bustops_tracts_lyr_new.sort_values('geoid').head()

	geoid	num_bustop
2095	36005000100	0
2059	36005000200	0
2067	36005000400	1
1668	36005001600	1
2127	36005001900	1

# estimating tracts distances from CBD 
tract_cbd_dist=use_proximity.find_nearest(nyc_tracts_layer,
                                          cbd_lyr,
                                          measurement_type='StraightLine',
                                          max_count=1,
                                          output_name='ny_tract_cbd_dist'+ str(dt.now().microsecond))
tract_cbd_dist_lyr = tract_cbd_dist.layers[1]
sdf_tract_cbd_dist_lyr = tract_cbd_dist_lyr.query().sdf
sdf_tract_cbd_dist_lyr.head()

{"cost": 2.168}

	OBJECTID	From_ID	To_ID	NearRank	From_statefp	From_countyfp	From_tractce	From_geoid	From_Name	From_namelsad	From_mtfcc	From_funcstat	From_aland	From_intptlat	From_intptlon	To_bid	To_borough	To_date_creat	To_time_creat	To_date_modif	To_time_modif	To_objectid	To_shape_area	To_shape_len	Total_Miles	SHAPE
0	1	1332	1	1	36	005	022703	36005022703	227.03	Census Tract 227.03	G5020	S	41502	+40.8440198	-073.9104999	Bryant Park BID	Manhattan	2008-11-19	00:00:00.000	2016-10-31	00:00:00.000	58.0	1225783.27084	5837.603557	7.102363	{"paths": [[[-8227840.685, 4989242.9453], [-82...
1	2	501	1	1	36	061	020701	36061020701	207.01	Census Tract 207.01	G5020	S	47273	+40.8089775	-073.9584600	Bryant Park BID	Manhattan	2008-11-19	00:00:00.000	2016-10-31	00:00:00.000	58.0	1225783.27084	5837.603557	3.809966	{"paths": [[[-8233012.3673, 4983978.5918], [-8...
2	3	469	1	1	36	061	017402	36061017402	174.02	Census Tract 174.02	G5020	S	50730	+40.7968026	-073.9471624	Bryant Park BID	Manhattan	2008-11-19	00:00:00.000	2016-10-31	00:00:00.000	58.0	1225783.27084	5837.603557	3.363737	{"paths": [[[-8231927.236, 4982389.3817], [-82...
3	4	454	1	1	36	061	016002	36061016002	160.02	Census Tract 160.02	G5020	S	51422	+40.7878787	-073.9536853	Bryant Park BID	Manhattan	2008-11-19	00:00:00.000	2016-10-31	00:00:00.000	58.0	1225783.27084	5837.603557	2.658677	{"paths": [[[-8232719.7081, 4981110.3946], [-8...
4	5	440	1	1	36	061	015001	36061015001	150.01	Census Tract 150.01	G5020	S	51643	+40.7801987	-073.9592834	Bryant Park BID	Manhattan	2008-11-19	00:00:00.000	2016-10-31	00:00:00.000	58.0	1225783.27084	5837.603557	2.055165	{"paths": [[[-8233341.205, 4979982.316], [-823...

# Final CBD distances in feet — Here in each row the column "cbd_dst" returns the distance of the CBD from respective tracts
sdf_tract_cbd_dist_lyr_new = sdf_tract_cbd_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_cbd_dist_lyr_new['cbd_dist'] = round(sdf_tract_cbd_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_cbd_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	cbd_dist
2095	36005000100	4.999247	26396.02
2059	36005000200	6.858514	36212.95
2067	36005000400	7.321927	38659.77
1668	36005001600	7.525535	39734.83
2127	36005001900	4.33359	22881.35

# Estimating NYCSubwayStation distances from tracts 
tract_subwy_stn_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                                  subwy_stn_lyr,
                                                  measurement_type='StraightLine',
                                                  max_count=1,
                                                  output_name='ny_tract_subway_station_dist'+ str(dt.now().microsecond))
tract_subwy_stn_dist_lyr = tract_subwy_stn_dist.layers[1]
sdf_tract_subwy_stn_dist_lyr = pd.DataFrame.spatial.from_layer(tract_subwy_stn_dist_lyr)
sdf_tract_subwy_stn_dist_lyr.head()

{"cost": 2.599}

	From_ID	From_Name	From_aland	From_countyfp	From_funcstat	From_geoid	From_intptlat	From_intptlon	From_mtfcc	From_namelsad	From_statefp	From_tractce	NearRank	OBJECTID	SHAPE	To_ID	To_Name	To_ORIG_FID	To_line	To_notes	To_objectid	To_url	Total_Miles
0	1332	227.03	41502	005	S	36005022703	+40.8440198	-073.9104999	G5020	Census Tract 227.03	36	022703	1	1	{"paths": [[[-8227646.1872, 4989522.1588], [-8...	21	174th-175th Sts	21	B-D	B-rush hours, D-all times, skips rush hours AM...	21.0	http://web.mta.info/nyct/service/	0.054525
1	501	207.01	47273	061	S	36061020701	+40.8089775	-073.9584600	G5020	Census Tract 207.01	36	020701	1	2	{"paths": [[[-8233201.4992, 4984081.6891], [-8...	159	116th St - Columbia University	167	1	1-all times	167.0	http://web.mta.info/nyct/service/	0.211254
2	469	174.02	50730	061	S	36061017402	+40.7968026	-073.9471624	G5020	Census Tract 174.02	36	017402	1	3	{"paths": [[[-8231617.2328, 4982254.7552], [-8...	413	110th St	450	4-6-6 Express	4 nights, 6-all times, 6 Express-weekdays AM s...	450.0	http://web.mta.info/nyct/service/	0.09727
3	454	160.02	51422	061	S	36061016002	+40.7878787	-073.9536853	G5020	Census Tract 160.02	36	016002	1	4	{"paths": [[[-8232360.48, 4980909.7037], [-823...	33	96th St	33	4-6-6 Express	4 nights, 6-all times, 6 Express-weekdays AM s...	33.0	http://web.mta.info/nyct/service/	0.098659
4	440	150.01	51643	061	S	36061015001	+40.7801987	-073.9592834	G5020	Census Tract 150.01	36	015001	1	5	{"paths": [[[-8232878.9945, 4979971.8169], [-8...	414	86th St	451	4-5-6-6 Express	4,6-all times, 5-all times exc nights, 6 Expre...	451.0	http://web.mta.info/nyct/service/	0.09711

# Final Tract to NYC Subway Station distances in feet — Here in each row, column "subwy_stn_dist" returns the distance of
# the nearest subway station from that tract
sdf_tract_subwy_stn_dist_lyr_new = sdf_tract_subwy_stn_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_subwy_stn_dist_lyr_new['subwy_stn_dist'] = round(sdf_tract_subwy_stn_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_subwy_stn_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	subwy_stn_dist
2095	36005000100	0.946226	4996.07
2059	36005000200	1.108173	5851.15
2067	36005000400	1.191505	6291.15
1668	36005001600	0.729661	3852.61
2127	36005001900	0.080063	422.73

# Estimating distances to NYCSubwayRoutes
tract_subwy_rt_dist=use_proximity.find_nearest(nyc_tracts_layer,
                                               subwy_rt_lyr,
                                               measurement_type='StraightLine',
                                               max_count=1,
                                               output_name='ny_tract_subway_routes_dist'+ str(dt.now().microsecond))
tract_subwy_rt_dist_lyr = tract_subwy_rt_dist.layers[1]
sdf_tract_subwy_rt_dist_lyr = tract_subwy_rt_dist_lyr.query().sdf
sdf_tract_subwy_rt_dist_lyr.head()

{"cost": 2.191}

	OBJECTID	From_ID	To_ID	NearRank	From_statefp	From_countyfp	From_tractce	From_geoid	From_Name	From_namelsad	From_mtfcc	From_funcstat	From_aland	From_intptlat	From_intptlon	To_route_id	To_route_shor	To_route_long	To_group_	To_Shape__Length	To_ORIG_FID	Total_Miles	SHAPE
0	1	1332	19	1	36	005	022703	36005022703	227.03	Census Tract 227.03	G5020	S	41502	+40.8440198	-073.9104999	B	B	6 Avenue Express	BDFM	51293.284426	12	0.0	None
1	2	501	9	1	36	061	020701	36061020701	207.01	Census Tract 207.01	G5020	S	47273	+40.8089775	-073.9584600	1	1	Broadway - 7 Avenue Local	123	31208.137831	3	0.169658	{"paths": [[[-8233150.5149, 4984174.9334], [-8...
2	3	469	10	1	36	061	017402	36061017402	174.02	Census Tract 174.02	G5020	S	50730	+40.7968026	-073.9471624	6	6	Lexington Avenue Express/Local	456	31863.545339	9	0.09692	{"paths": [[[-8231635.9013, 4982220.9583], [-8...
3	4	454	10	1	36	061	016002	36061016002	160.02	Census Tract 160.02	G5020	S	51422	+40.7878787	-073.9536853	6	6	Lexington Avenue Express/Local	456	31863.545339	9	0.096942	{"paths": [[[-8232305.5995, 4981008.652], [-82...
4	5	440	10	1	36	061	015001	36061015001	150.01	Census Tract 150.01	G5020	S	51643	+40.7801987	-073.9592834	6	6	Lexington Avenue Express/Local	456	31863.545339	9	0.096767	{"paths": [[[-8232878.2271, 4979973.2013], [-8...

# Final Tract to NYCSubwayRoutes distances in feet — Here in each row, column "subwy_rt_dist" returns the distance of
# the nearest subway route from that tract
sdf_tract_subwy_rt_dist_lyr_new = sdf_tract_subwy_rt_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_subwy_rt_dist_lyr_new['subwy_rt_dist'] = round(sdf_tract_subwy_rt_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_subwy_rt_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	subwy_rt_dist
2095	36005000100	0.90531	4780.04
2059	36005000200	1.108726	5854.07
2067	36005000400	1.192022	6293.88
1668	36005001600	0.724321	3824.42
2127	36005001900	0.002853	15.06

# Estimating distances to NYCRailroad
tract_railroad_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                           railroad_lyr,
                                           measurement_type='StraightLine',
                                           max_count=1,
                                           output_name='tract_railroad_dist'+ str(dt.now().microsecond))
tract_railroad_dist_lyr = tract_railroad_dist.layers[1]
sdf_tract_railroad_dist_lyr = pd.DataFrame.spatial.from_layer(tract_railroad_dist_lyr)
sdf_tract_railroad_dist_lyr.head()

{"cost": 2.168}

	From_ID	From_Name	From_aland	From_countyfp	From_funcstat	From_geoid	From_intptlat	From_intptlon	From_mtfcc	From_namelsad	From_statefp	From_tractce	NearRank	OBJECTID	SHAPE	To_ID	To_ORIG_FID	To_Shape__Length	Total_Miles
0	1332	227.03	41502	005	S	36005022703	+40.8440198	-073.9104999	G5020	Census Tract 227.03	36	022703	1	1	{"paths": [[[-8227770.665, 4989475.2983], [-82...	1	1	2194198.77038	0.140554
1	501	207.01	47273	061	S	36061020701	+40.8089775	-073.9584600	G5020	Census Tract 207.01	36	020701	1	2	{"paths": [[[-8232997.3392, 4984450.7008], [-8...	1	1	2194198.77038	0.166535
2	469	174.02	50730	061	S	36061017402	+40.7968026	-073.9471624	G5020	Census Tract 174.02	36	017402	1	3	None	1	1	2194198.77038	0.0
3	454	160.02	51422	061	S	36061016002	+40.7878787	-073.9536853	G5020	Census Tract 160.02	36	016002	1	4	None	1	1	2194198.77038	0.0
4	440	150.01	51643	061	S	36061015001	+40.7801987	-073.9592834	G5020	Census Tract 150.01	36	015001	1	5	{"paths": [[[-8232883.8965, 4979976.3551], [-8...	1	1	2194198.77038	0.559931

# Final Tract to NYCRailroad distances in feet — Here in each row, column "railroad_dist" returns the distance of
# the nearest rail road route from that tract
sdf_tract_railroad_dist_lyr_new = sdf_tract_railroad_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_railroad_dist_lyr_new['railroad_dist'] = round(sdf_tract_railroad_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_railroad_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	railroad_dist
2095	36005000100	0.403054	2128.12
2059	36005000200	0.215395	1137.29
2067	36005000400	0.708551	3741.15
1668	36005001600	0.614506	3244.59
2127	36005001900	0.0	0.0

# Estimating distances to NYC Business Districts
tract_busi_distrs_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                                      busi_distrs_lyr,
                                                      measurement_type='StraightLine',
                                                      max_count=1,
                                                      output_name='tract_busi_distrs_dist'+ str(dt.now().microsecond))
tract_busi_distrs_dist_lyr = tract_busi_distrs_dist.layers[1]
sdf_tract_busi_distrs_dist_lyr = pd.DataFrame.spatial.from_layer(tract_busi_distrs_dist_lyr)
sdf_tract_busi_distrs_dist_lyr.head()

{"cost": 2.241}

	From_ID	From_Name	From_aland	From_countyfp	From_funcstat	From_geoid	From_intptlat	From_intptlon	From_mtfcc	From_namelsad	From_statefp	From_tractce	NearRank	OBJECTID	SHAPE	To_ID	To_ORIG_FID	To_Shape__Area	To_Shape__Length	To_bid	To_borough	To_date_creat	To_date_modif	To_objectid	To_time_creat	To_time_modif	Total_Miles
0	1332	227.03	41502	005	S	36005022703	+40.8440198	-073.9104999	G5020	Census Tract 227.03	36	022703	1	1	{"paths": [[[-8227809.0396, 4989358.3475], [-8...	15	14	180282.226562	3281.5747	Washington Heights BID	Manhattan	2008-11-19	2016-10-25	69.0	00:00:00.000	00:00:00.000	1.034897
1	501	207.01	47273	061	S	36061020701	+40.8089775	-073.9584600	G5020	Census Tract 207.01	36	020701	1	2	{"paths": [[[-8232865.8707, 4984369.9553], [-8...	29	16	269468.507812	4849.117421	125th Street BID	Manhattan	2008-11-19	2016-10-25	67.0	00:00:00.000	00:00:00.000	0.159359
2	469	174.02	50730	061	S	36061017402	+40.7968026	-073.9471624	G5020	Census Tract 174.02	36	017402	1	3	{"paths": [[[-8231888.4853, 4982612.3984], [-8...	29	16	269468.507812	4849.117421	125th Street BID	Manhattan	2008-11-19	2016-10-25	67.0	00:00:00.000	00:00:00.000	0.604451
3	454	160.02	51422	061	S	36061016002	+40.7878787	-073.9536853	G5020	Census Tract 160.02	36	016002	1	4	{"paths": [[[-8232539.9271, 4981009.3871], [-8...	65	70	409639.414062	8507.99606	Madison Avenue BID	Manhattan	2008-11-19	2016-10-26	64.0	00:00:00.000	00:00:00.000	0.502002
4	440	150.01	51643	061	S	36061015001	+40.7801987	-073.9592834	G5020	Census Tract 150.01	36	015001	1	5	None	65	70	409639.414062	8507.99606	Madison Avenue BID	Manhattan	2008-11-19	2016-10-26	64.0	00:00:00.000	00:00:00.000	0.0

# Final Tract to NYC Businesss Districts distances in feet — Here in each row, column "busi_distr_dist" returns the distance of the CBD from respective tracts
sdf_tract_busi_distrs_dist_lyr_new = sdf_tract_busi_distrs_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_busi_distrs_dist_lyr_new['busi_distr_dist'] = round(sdf_tract_busi_distrs_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_busi_distrs_dist_lyr_new.sort_values('From_geoid').head()

	From_geoid	Total_Miles	busi_distr_dist
2095	36005000100	1.308636	6909.6
2059	36005000200	1.292505	6824.43
2067	36005000400	1.596395	8428.97
1668	36005001600	1.23762	6534.63
2127	36005001900	0.510611	2696.02

Importing Borough Info for each Tracts

# Name of the borough, inside which the tracts are located 
ny_tract_boro = gis.content.search('NYCTractBorough owner:api_data_owner', 'feature layer')[0]
ny_tract_boro_lyr = ny_tract_boro.layers[0]
sdf_ny_tract_boro_lyr = pd.DataFrame.spatial.from_layer(ny_tract_boro_lyr)
sdf_ny_tract_boro_lyr_new = sdf_ny_tract_boro_lyr[['geoid', 'boro_name']]
sdf_ny_tract_boro_lyr_new.sort_values('geoid').head()

	geoid	boro_name
0	36005000100	Bronx
2	36005000200	Bronx
5	36005000400	Bronx
7	36005001600	Bronx
9	36005001900	Bronx

Merging all the above estimated data set of features

tract_merge_dist = sdf_tract_hotel_dist_lyr_new.merge(sdf_tract_subwy_rt_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left1', '_right1')).merge(sdf_tract_railroad_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left2', '_right2')).merge(sdf_tract_subwy_stn_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left3', '_right3')).merge(sdf_tract_busi_distrs_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left4', '_right4')).merge(sdf_tract_cbd_dist_lyr_new, on='From_geoid')
tract_merge_dist_new = tract_merge_dist[['From_geoid',
                                         'hotel_dist',
                                         'subwy_rt_dist',
                                         'railroad_dist',
                                         'subwy_stn_dist',
                                         'busi_distr_dist',
                                         'cbd_dist']]
tract_merge_dist_new = tract_merge_dist_new.rename(columns={'From_geoid':'geoid'})
tract_merge_dist_new.sort_values('geoid').head()

	geoid	hotel_dist	subwy_rt_dist	railroad_dist	subwy_stn_dist	busi_distr_dist	cbd_dist
2095	36005000100	5571.75	4780.04	2128.12	4996.07	6909.6	26396.02
2059	36005000200	5486.44	5854.07	1137.29	5851.15	6824.43	36212.95
2067	36005000400	2495.67	6293.88	3741.15	6291.15	8428.97	38659.77
1668	36005001600	3093.96	3824.42	3244.59	3852.61	6534.63	39734.83
2127	36005001900	0.0	15.06	0.0	422.73	2696.02	22881.35

# merging number of bus stop and borough name
tract_merge_dist_new = tract_merge_dist_new.merge(sdf_num_bustops_tracts_lyr_new,
                                                 on='geoid').merge(sdf_ny_tract_boro_lyr_new,
                                                 on='geoid') 
tract_merge_dist_new = tract_merge_dist_new.sort_values('geoid')
tract_merge_dist_new.head()

	geoid	hotel_dist	subwy_rt_dist	railroad_dist	subwy_stn_dist	busi_distr_dist	cbd_dist	num_bustop	boro_name
2095	36005000100	5571.75	4780.04	2128.12	4996.07	6909.6	26396.02	0	Bronx
2059	36005000200	5486.44	5854.07	1137.29	5851.15	6824.43	36212.95	0	Bronx
2067	36005000400	2495.67	6293.88	3741.15	6291.15	8428.97	38659.77	1	Bronx
1668	36005001600	3093.96	3824.42	3244.59	3852.61	6534.63	39734.83	1	Bronx
2127	36005001900	0.0	15.06	0.0	422.73	2696.02	22881.35	1	Bronx

# Accessing the airbnb count for each tract
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract[['geoid','Point_Count']]
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract_new.rename(columns={'Point_Count':'total_airbnb'})
sdf_airbnb_count_by_tract_new.head()

	geoid	total_airbnb
2095	36005000100	0
2059	36005000200	0
2067	36005000400	15
1668	36005001600	1
2127	36005001900	24

# preparing the final distance table with airbnb count by tract
tract_merge_dist_all = sdf_airbnb_count_by_tract_new.merge(tract_merge_dist_new, on='geoid')
tract_merge_dist_all.head()

	geoid	total_airbnb	hotel_dist	subwy_rt_dist	railroad_dist	subwy_stn_dist	busi_distr_dist	cbd_dist	num_bustop	boro_name
0	36005000100	0	5571.75	4780.04	2128.12	4996.07	6909.6	26396.02	0	Bronx
1	36005000200	0	5486.44	5854.07	1137.29	5851.15	6824.43	36212.95	0	Bronx
2	36005000400	15	2495.67	6293.88	3741.15	6291.15	8428.97	38659.77	1	Bronx
3	36005001600	1	3093.96	3824.42	3244.59	3852.61	6534.63	39734.83	1	Bronx
4	36005001900	24	0.0	15.06	0.0	422.73	2696.02	22881.35	1	Bronx

tract_merge_dist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2167 entries, 0 to 2166
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   geoid            2167 non-null   string 
 1   total_airbnb     2167 non-null   Int32  
 2   hotel_dist       2167 non-null   Float64
 3   subwy_rt_dist    2167 non-null   Float64
 4   railroad_dist    2167 non-null   Float64
 5   subwy_stn_dist   2167 non-null   Float64
 6   busi_distr_dist  2167 non-null   Float64
 7   cbd_dist         2167 non-null   Float64
 8   num_bustop       2167 non-null   Int32  
 9   boro_name        2167 non-null   string 
dtypes: Float64(6), Int32(2), string(2)
memory usage: 169.4 KB

Borough column being an important location indicator is converted into numerical variable and inlcuded in the feature data

tract_merge_dist_final = pd.get_dummies(tract_merge_dist_all, columns=['boro_name'])
tract_merge_dist_final.head()

	geoid	total_airbnb	hotel_dist	subwy_rt_dist	railroad_dist	subwy_stn_dist	busi_distr_dist	cbd_dist	num_bustop	boro_name_Bronx	boro_name_Brooklyn	boro_name_Manhattan	boro_name_Queens	boro_name_Staten Island
0	36005000100	0	5571.75	4780.04	2128.12	4996.07	6909.6	26396.02	0	True	False	False	False	False
1	36005000200	0	5486.44	5854.07	1137.29	5851.15	6824.43	36212.95	0	True	False	False	False	False
2	36005000400	15	2495.67	6293.88	3741.15	6291.15	8428.97	38659.77	1	True	False	False	False	False
3	36005001600	1	3093.96	3824.42	3244.59	3852.61	6534.63	39734.83	1	True	False	False	False	False
4	36005001900	24	0.0	15.06	0.0	422.73	2696.02	22881.35	1	True	False	False	False	False

Adding census data 2019 obtained using geoenrichment

The above distance data set is now added with the census data to form the final feature set for the model

sdf_airbnb_count_by_tract_enriched_sorted_new = sdf_airbnb_count_by_tract_enriched_sorted.drop(['AnalysisArea',
                                                                                                'ENRICH_FID',
                                                                                                'HasData',
                                                                                                'ID',
                                                                                                'OBJECTID',
                                                                                                'Point_Count',
                                                                                                'SHAPE',                      
                                                                                                'aggregationMethod',
                                                                                                'aland',
                                                                                                'apportionmentConfidence',
                                                                                                'awater',
                                                                                                'countyfp',
                                                                                                'funcstat',
                                                                                                'intptlat',
                                                                                                'intptlon',
                                                                                                'mtfcc',
                                                                                                'name',
                                                                                                'namelsad',
                                                                                                'populationToPolygonSizeRating',
                                                                                                'sourceCountry',
                                                                                                'statefp','tractce'], axis=1)
sdf_airbnb_count_by_tract_enriched_sorted_new.shape

(2167, 84)

# checking the rows of the table for nan values
row_with_null = sdf_airbnb_count_by_tract_enriched_sorted_new.isnull().any(axis=1)

# printing the row which has nan values
sdf_airbnb_count_by_tract_enriched_sorted_new[row_with_null]

	geoid	Employed Population Age 16+	Employed Male Pop Age 16+	Male 16+Priv Nonprofit	Male 16+:Priv Profit Empl	Male 16+:Priv Profit Self Empl	Male 16+:State Govt Wrkr	Male 16+:Fed Govt Wrkr	Male 16+:Self-Emp Not Inc	Male 16+:Unpaid Family Wrkr	Female Pop Age 16+	Female 16+:Priv Profit Empl	Female 16+:Priv Profit Self Empl	Female 16+:Priv Nonprofit	Female 16+:Local Govt Wrkr	Female 16+:State Govt Wrkr	Female 16+:Fed Govt Wrkr	Female 16+:Self-Emp Not Inc	Female 16+:Unpaid Family Wrkr	2019 Median Age	2019 Generation Alpha Population	2019 Generation Z Population	2019 Millennial Population	2019 Generation X Population	2019 Baby Boomer Population	2019 Silent & Greatest Generations Population	2019 Population by Generation Base	2019 Population Density	2019 Total Daytime Population	2019 White Population	2019 Black Population	2019 American Indian Population	2019 Asian Population	2019 Pacific Islander Population	2019 Other Race Population	2019 Diversity Index	HHs: Inc Below Poverty Level	HHs:Inc at/Above Poverty Level	Total Businesses (SIC)	Construction Businesses (NAICS)	Retail Trade Businesses (NAICS)	Transportation/Warehouse Bus (NAICS)	Own any tablet	Own any e-reader	Have access to Internet at home	Index: Spend 0.5-0.9 hrs online(excl email/IM .	Spend <0.5 hrs online (excl email/IM time) daily	2019 Total Population	2019 Male Population	2019 Female Population	2019 Employed Civilian Pop 16+	2019 Unemployed Population 16+	2019 Unemployment Rate	ACS Workers Age 16+	ACS Workers 16+: Drove Alone	ACS Workers 16+: Carpooled	ACS Workers 16+: Public Transportation	ACS Workers 16+: Bus	ACS Workers 16+: Streetcar	ACS Workers 16+: Subway	ACS Workers 16+: Ferryboat	ACS Workers 16+: Taxicab	ACS Workers 16+: Motorcycle	ACS Workers 16+: Bicycle	ACS Workers 16+: Walked	ACS Workers 16+: Other Means	ACS Wrkrs 16+: Worked at Home	2019 Owner Occupied HUs	2019 Renter Occupied HUs	2019 Vacant Housing Units	2019 Median Home Value	2019 Total Housing Units	ACS Median Year Structure Built: HUs	2019 Annual Budget Exp	2019 Transportation	2019 Daytime Pop: Workers	2019 Daytime Pop: Residents	2019 Daytime Pop Density	2019 Occupation: Protective Service	2019 Occupation: Food Preperation	2019 Occupation: Personal Care	2019 Occupation: Office/Admin	2019 Occupation: Construction/Extraction	2019 Occupation: Production
2163	36047990100	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
2165	36085990100	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>

# checking total number of nan values
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_new.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()

These two tracts area actually are water areas within NYC, hence have nan values and are filled with zeros

sdf_airbnb_count_by_tract_enriched_sorted_fill = sdf_airbnb_count_by_tract_enriched_sorted_new.fillna(0)

#nan rechecked
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_fill.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()

Merging the distance data with the enriched data

final_df = pd.merge(tract_merge_dist_final,
                    sdf_airbnb_count_by_tract_enriched_sorted_fill,
                    left_on = 'geoid',
                    right_on = 'geoid',
                    how = 'left')

print(final_df.shape)
final_df.head()

(2167, 97)

	geoid	total_airbnb	hotel_dist	subwy_rt_dist	railroad_dist	subwy_stn_dist	busi_distr_dist	cbd_dist	num_bustop	boro_name_Bronx	boro_name_Brooklyn	boro_name_Manhattan	boro_name_Queens	boro_name_Staten Island	Employed Population Age 16+	Employed Male Pop Age 16+	Male 16+Priv Nonprofit	Male 16+:Priv Profit Empl	Male 16+:Priv Profit Self Empl	Male 16+:State Govt Wrkr	Male 16+:Fed Govt Wrkr	Male 16+:Self-Emp Not Inc	Male 16+:Unpaid Family Wrkr	Female Pop Age 16+	Female 16+:Priv Profit Empl	Female 16+:Priv Profit Self Empl	Female 16+:Priv Nonprofit	Female 16+:Local Govt Wrkr	Female 16+:State Govt Wrkr	Female 16+:Fed Govt Wrkr	Female 16+:Self-Emp Not Inc	2019 Median Age	2019 Generation Alpha Population	2019 Generation Z Population	2019 Millennial Population	2019 Generation X Population	2019 Baby Boomer Population	2019 Silent & Greatest Generations Population	2019 Population by Generation Base	2019 Population Density	2019 Total Daytime Population	2019 White Population	2019 Black Population	2019 American Indian Population	2019 Asian Population	2019 Pacific Islander Population	2019 Other Race Population	2019 Diversity Index	HHs: Inc Below Poverty Level	HHs:Inc at/Above Poverty Level	Total Businesses (SIC)	Construction Businesses (NAICS)	Retail Trade Businesses (NAICS)	Transportation/Warehouse Bus (NAICS)	Own any tablet	Own any e-reader	Have access to Internet at home	Index: Spend 0.5-0.9 hrs online(excl email/IM .	Spend <0.5 hrs online (excl email/IM time) daily	2019 Total Population	2019 Male Population	2019 Female Population	2019 Employed Civilian Pop 16+	2019 Unemployed Population 16+	2019 Unemployment Rate	ACS Workers Age 16+	ACS Workers 16+: Drove Alone	ACS Workers 16+: Carpooled	ACS Workers 16+: Public Transportation	ACS Workers 16+: Bus	ACS Workers 16+: Streetcar	ACS Workers 16+: Subway	ACS Workers 16+: Ferryboat	ACS Workers 16+: Taxicab	ACS Workers 16+: Bicycle	ACS Workers 16+: Walked	ACS Workers 16+: Other Means	ACS Wrkrs 16+: Worked at Home	2019 Owner Occupied HUs	2019 Renter Occupied HUs	2019 Vacant Housing Units	2019 Median Home Value	2019 Total Housing Units	ACS Median Year Structure Built: HUs	2019 Annual Budget Exp	2019 Transportation	2019 Daytime Pop: Workers	2019 Daytime Pop: Residents	2019 Daytime Pop Density	2019 Occupation: Protective Service	2019 Occupation: Food Preperation	2019 Occupation: Personal Care	2019 Occupation: Office/Admin	2019 Occupation: Construction/Extraction	2019 Occupation: Production
0	36005000100	0	5571.75	4780.04	2128.12	4996.07	6909.6	26396.02	0	True	False	False	False	False	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	34.5	0.0	698.0	1997.0	921.0	127.0	5.0	3748.0	3588.6	663.0	629.0	2330.0	17.0	67.0	1.0	672.0	75.2	0.0	0.0	24.0	3.0	4.0	1.0	0.0	0.0	0.0	0.0	0.0	3748.0	3407.0	341.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	663.0	0.0	634.8	0.0	0.0	0.0	0.0	0.0	0.0
1	36005000200	0	5486.44	5854.07	1137.29	5851.15	6824.43	36212.95	0	True	False	False	False	False	1664.0	1068.0	51.0	709.0	0.0	43.0	0.0	38.0	0.0	596.0	316.0	15.0	78.0	115.0	60.0	12.0	0.0	35.8	430.0	1218.0	1245.0	913.0	823.0	168.0	4797.0	8988.5	3210.0	438.0	1195.0	69.0	217.0	7.0	2150.0	83.6	278.0	1114.0	42.0	2.0	2.0	5.0	2177.0	437.0	3469.0	86.0	123.0	4797.0	2249.0	2548.0	1834.0	192.0	9.5	1664.0	766.0	13.0	619.0	235.0	0.0	362.0	22.0	39.0	0.0	200.0	0.0	27.0	767.0	766.0	78.0	558333.0	1611.0	1957.0	132707872.0	13576654.0	196.0	3014.0	6014.9	48.0	106.0	15.0	162.0	33.0	130.0
2	36005000400	15	2495.67	6293.88	3741.15	6291.15	8428.97	38659.77	1	True	False	False	False	False	3128.0	1675.0	253.0	932.0	21.0	38.0	104.0	75.0	0.0	1453.0	804.0	37.0	276.0	281.0	16.0	39.0	0.0	35.7	506.0	1589.0	1589.0	1226.0	945.0	166.0	6021.0	10291.0	3758.0	641.0	1803.0	60.0	240.0	6.0	2278.0	85.4	107.0	2092.0	57.0	3.0	9.0	3.0	2754.0	730.0	4455.0	103.0	172.0	6021.0	2937.0	3084.0	2807.0	120.0	4.1	3082.0	1628.0	61.0	1077.0	250.0	0.0	776.0	33.0	0.0	0.0	130.0	64.0	122.0	1548.0	537.0	108.0	552034.0	2193.0	2001.0	197321413.0	20238565.0	471.0	3287.0	6423.1	239.0	83.0	96.0	429.0	99.0	42.0
3	36005001600	1	3093.96	3824.42	3244.59	3852.61	6534.63	39734.83	1	True	False	False	False	False	2513.0	959.0	31.0	620.0	29.0	0.0	18.0	20.0	0.0	1554.0	927.0	7.0	159.0	296.0	90.0	51.0	24.0	36.6	477.0	1384.0	1313.0	996.0	997.0	350.0	5517.0	29457.3	5334.0	497.0	2127.0	70.0	123.0	3.0	1900.0	84.6	526.0	1661.0	57.0	0.0	7.0	2.0	2392.0	491.0	3950.0	80.0	114.0	5517.0	2394.0	3123.0	2087.0	183.0	8.1	2422.0	880.0	73.0	1154.0	604.0	0.0	538.0	12.0	19.0	12.0	110.0	29.0	145.0	295.0	1712.0	103.0	650463.0	2110.0	1973.0	117903160.0	11830562.0	1831.0	3503.0	28480.2	101.0	104.0	35.0	299.0	3.0	21.0
4	36005001900	24	0.0	15.06	0.0	422.73	2696.02	22881.35	1	True	False	False	False	False	1790.0	887.0	144.0	580.0	46.0	0.0	7.0	41.0	17.0	903.0	585.0	15.0	133.0	64.0	17.0	0.0	89.0	36.1	362.0	946.0	1286.0	885.0	553.0	134.0	4166.0	3876.6	8595.0	637.0	1656.0	62.0	67.0	5.0	1236.0	85.8	323.0	938.0	409.0	75.0	45.0	26.0	1836.0	347.0	3093.0	80.0	94.0	4166.0	2182.0	1984.0	2152.0	114.0	5.0	1751.0	279.0	89.0	1054.0	142.0	14.0	898.0	0.0	16.0	21.0	75.0	35.0	182.0	118.0	1486.0	124.0	381395.0	1728.0	1964.0	118882651.0	12251786.0	6516.0	2079.0	7998.0	32.0	127.0	162.0	177.0	58.0	64.0

# rechecking nan values of the final dataframe
final_nan_test = final_df.drop('geoid', axis=1)
np.isnan(final_nan_test).sum().sum()

Model Building

The goal here is to find the factors contributing towards the development of new Airbnb properties in New York City. Thus a model is fitted predicting the number of Airbnb properties per tract with the feature set composed of the distance and demographics characteristics of each tract. Once a good fit is obtained the most important predictors of the model are estimated which is our main ask.

# Creating feature data 
X = final_df.drop(['geoid','total_airbnb'], axis=1)

# Creating target data  -- the number airbnb per tract
y = pd.DataFrame(final_df['total_airbnb'])

split the dataframe into train - test of 90% to 10%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 20)

print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

# Converting the target into 1d array
y_train_array = y_train.values.flatten()
y_test_array = y_test.values.flatten() 

print(y_train_array.shape)
print(y_test_array.shape)

(1950, 95)
(1950, 1)
(217, 95)
(217, 1)
(1950,)
(217,)

As a best practice since scaled data performs well for model fitting, the features are normalized using Robust scaler

scaler = preprocessing.RobustScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns) 
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)

RandomForest Regressor Model

The modelling is first started using a linear regression. However the linear model was failing to fit the data well. Hence it was carried out with a non linear algorithm as follows. This could be tested by the user to see the improvement of using Random Forest over a linear regression.

The accuracy metrics of mean absoute error and r-square is used

# Random forest with scaled data
# for the best parameters a grid search could be done which could take some time
# however this model uses the default parameters of RF algorithm, while the estimators are changed till the best fit is obtained
model_RF = RandomForestRegressor(n_estimators = 500, random_state=43)

# Train the model
model_RF.fit(X_train_scaled, y_train_array)

# Training metrics for Random forest model
print('Training metrics for Random forest model using scaled data')
ypred_RF_train = model_RF.predict(X_train_scaled)
print('r-square_RF_Train: ', round(model_RF.score(X_train_scaled, y_train_array), 2))

mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_RF_train)  
print('RMSE_RF_train: ', round(np.sqrt(mse_RF_train),4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_RF_train)
print('MAE_RF_train: ', round(mean_absolute_error_RF_train, 4)) 

# Test metrics for Random Forest model
print('\nTest metrics for Random Forest model scaled data')
ypred_RF_test = model_RF.predict(X_test_scaled)
print('r-square_RF_test: ', round(model_RF.score(X_test_scaled, y_test_array), 2))

mse_RF_test = metrics.mean_squared_error(y_test_array, ypred_RF_test) 
print('RMSE_RF_test: ', round(np.sqrt(mse_RF_test), 4))

mean_absolute_error_RF_test = metrics.mean_absolute_error(y_test_array, ypred_RF_test)
print('MAE_RF_test: ', round(mean_absolute_error_RF_test, 4))

Training metrics for Random forest model using scaled data
r-square_RF_Train:  0.97
RMSE_RF_train:  6.9747
MAE_RF_train:  3.4844

Test metrics for Random Forest model scaled data
r-square_RF_test:  0.77
RMSE_RF_test:  22.2146
MAE_RF_test:  10.2689

The result shows that the model is returning an r-square of 0.85 with a mean absolute error of 9.28

Feature importance for the RF model

feature_imp_RF = model_RF.feature_importances_

#relative feature importance  
rel_feature_imp = 100 * (feature_imp_RF / max(feature_imp_RF)) 
rel_feature_imp = pd.DataFrame({'features':list(X_train.columns),
                                'rel_importance':rel_feature_imp })

rel_feature_imp = rel_feature_imp.sort_values('rel_importance', ascending=False)


#plotting the top twenty important features
top20_features = rel_feature_imp.head(20) 

plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features",
                 data=top20_features,
                 palette="Accent_r")

plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()

rel_feature_imp.head()

	features	rel_importance
5	cbd_dist	100.000000
77	ACS Wrkrs 16+: Worked at Home	54.125825
33	2019 Millennial Population	38.377447
74	ACS Workers 16+: Bicycle	18.789531
80	2019 Vacant Housing Units	11.898179

The feature importance plot reveals that distance from the city centre (cbd_dist) is the most important predictor of the number of Airbnb formation in NYC. This is expected since hotel rents near the cbd are quite high, rental income from Airbnb properties would be high as well, hence setting up Airbnb property would be a lucrative option, compared to long term rental income in areas near the cbd.

This is followed by the number of millennial population, or the tracts having most number of people in the age group of 25 to 40 years old. One reason might be that these group of population are more active online and are comfortable with internet technologies which is in a way a necessary prerequisite for setting up Airbnb properties. This is supported by the presence of another interesting predictor variable of -- 0.5-0.9 hrs online activity, in the top twenty.

This is followed by the tracts having workers who commute by bicycle and is the third most important predictor, which is followed by the number of generation alpha population, who are person born after 2011, and then by tracts having people commuting by subway, and so on. The median home value of the tracts is also an interesting predictor.

Gradient Boosting Regressor Model

Here trial shows that the gradient boosting model performs better with non scale data

# GradientBoosting with non scaled data
# this model uses the default parameters of GB algorithm, while the estimators are changed to obtain the best fit 
model_GB_nonscale = GradientBoostingRegressor(n_estimators=500, random_state=60)

# Train the model
model_GB_nonscale.fit(X_train, y_train_array)

# Training metrics for Gradient Boosting Regressor model
print('Training metrics for Gradient Boosting Regressor model using scaled data')

ypred_GB_train = model_GB_nonscale.predict(X_train)
print('r-square_GB_Train: ', round(model_GB_nonscale.score(X_train, y_train_array), 2))

mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_GB_train)
print('RMSE_GB_Train: ', round(np.sqrt(mse_RF_train), 4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_GB_train)
print('MAE_GB_Train: ', round(mean_absolute_error_RF_train, 4))

#Test metrics for Gradient Boosting Regressor model
print('\nTest metrics for Gradient Boosting Regressor model using scaled data')

ypred_GB_test = model_GB_nonscale.predict(X_test)
print('r-square_GB_Test: ', round(model_GB_nonscale.score(X_test, y_test_array),2))

mse_RF_Test = metrics.mean_squared_error(y_test_array, ypred_GB_test)  
print('RMSE_GB_Test: ', round(np.sqrt(mse_RF_Test),4))

mean_absolute_error_GB_Test = metrics.mean_absolute_error(y_test_array, ypred_GB_test)
print('MAE_GB_Test: ', round(mean_absolute_error_GB_Test, 4))

Training metrics for Gradient Boosting Regressor model using scaled data
r-square_GB_Train:  0.99
RMSE_GB_Train:  3.1363
MAE_GB_Train:  2.3217

Test metrics for Gradient Boosting Regressor model using scaled data
r-square_GB_Test:  0.81
RMSE_GB_Test:  20.0925
MAE_GB_Test:  9.5355

The result shows that the Gradient boosting regressor model is performing slightly better both in terms of Mean Absolute error and r-square than the random forest model.

Feature Importance of Gradient Boosting Model

#checking the feature importance for the Gradient Boosting regressor
feature_imp_GB = model_GB_nonscale.feature_importances_
rel_feature_imp_GB = 100 * feature_imp_GB / max(feature_imp_GB)
rel_feature_imp_GB = pd.DataFrame({'features':list(X_train.columns),
                                   'rel_importance':rel_feature_imp_GB})
rel_feature_imp_GB = rel_feature_imp_GB.sort_values('rel_importance', ascending=False)
rel_feature_imp_GB.head()

	features	rel_importance
5	cbd_dist	100.000000
77	ACS Wrkrs 16+: Worked at Home	57.804037
74	ACS Workers 16+: Bicycle	48.524487
33	2019 Millennial Population	37.625884
80	2019 Vacant Housing Units	24.284889

# Plot  feature importance for the Gradient Boosting regressor
top20_features_GB = rel_feature_imp_GB.head(20) 

plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features", data = top20_features_GB, palette="Accent_r")
plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()

The feature importance shown by the Gradient boosting model are almost identical to the one returned by the random forest model, which is expected.

Running cross validation

The above model is fitted and accuracy measured on a particular train and test split of the data. However the model accuracy for multiple split of the data remains to be seen. This is accomplished using k fold cross validation which splits the data into k different train-test splits and fit the model for each of them. Hence a 10 fold cross validation is run to check the overall model accuracy which is measured here as the mean absolute error for model fit accross the 10 different splits.

# Validating with a 10 fold cross validation for the Gradient Boosting models
y_array = y.values.flatten()

modelGB_cross_val = GradientBoostingRegressor(n_estimators=500, random_state=60) 

modelGB_cross_val_scores = cross_val_score(modelGB_cross_val,
                                           X, 
                                           y_array,
                                           cv=10,
                                           scoring='neg_mean_absolute_error')

print("All Model Scores: ", modelGB_cross_val_scores)

print("Negative Mean Absolute Error: {}".format(np.mean(modelGB_cross_val_scores)))

All Model Scores:  [ -6.8058649  -11.62520912 -11.02025161 -21.53356819  -4.63056795
 -41.11872138 -14.46560863  -9.15669676  -6.10330726  -4.9869393 ]
Negative Mean Absolute Error: -13.1446735095419

# Validating with a 10 fold cross validation for the Random forest models
y_array = y.values.flatten()

modelRF_cross_val = RandomForestRegressor(n_estimators=500, random_state=43)

modelRF_cross_val_scores = cross_val_score(modelRF_cross_val,
                                           X, 
                                           y_array,
                                           cv=10,
                                           scoring='neg_mean_absolute_error')

print("All Model Scores: ", modelRF_cross_val_scores)

print("Negative Mean Absolute Error: {}".format(np.mean(modelRF_cross_val_scores)))

All Model Scores:  [-10.648       -9.98169585 -10.22411982 -22.20942857  -4.53489401
 -37.69979724 -16.13404608 -12.30612963  -6.35166667  -4.42249074]
Negative Mean Absolute Error: -13.45122686038573

Final Result Visualization

# Plotting a kernel density map of the predicted vs. observed data
plt.figure(figsize=[15,5])

# plotting the prediction
sns.kdeplot(ypred_RF_test, label = 'Predictions', color = 'orange')
y_observed = np.array(y_test).reshape((-1, ))
sns.kdeplot(y_observed, label = 'Observation', color = 'green')

# label the plot
plt.xlabel('No. of Airbnb listings per census tract', fontsize=15)
plt.ylabel('Density', fontsize=15)
plt.title('Density Plot: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,500,25), fontsize=10)
plt.yticks(fontsize=10)
plt.legend(fontsize=15)
plt.show()

# Converting the predicted and observed values to dataframe and plotting the observed vs predicted
y_test_df = y_test.copy()
y_test_df['Predicted'] = (ypred_RF_test)  
y_test_df.head()

	total_airbnb	Predicted
20	4	6.006
910	6	3.980
685	2	4.572
450	1	2.000
1044	2	8.080

# plotting the actual observed vs predicted airbnb properties by tract
plt.figure(figsize = [25,12])
sns.set(style = 'whitegrid')
sns.lineplot(data = y_test_df, markers=True) 

#label the plot
plt.xlabel('Tract ID', fontsize=15)
plt.ylabel('Total No. of Airbnb', fontsize=15)
plt.title('Actual No. of Airbnb by Tract: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,2000,100), fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='x-large', title_fontsize='10')
plt.legend(fontsize=15)
plt.show()

The plot shows that the predicted values closely matches the observed values. However there are instances of underprediction for tracts with extremely high number of airbnb properties, and also overprediction instances for some tracts with comparatively lower number of airbnb properties.

Conclusion

The study shows that the location factor of distance from CBD is the foremost important factor which stimulates creation of Airbnb properties.

The proximity tool from the ArcGIS API for Python was used to perform this significant task for all the distance estimation. Other factors as returned by the feature importance result could be dealt individually. Another interesting capability of Esri utilized in the study is that of Esri's data repository, elaborated here via the geoenrichment services. The data enrichment service could provide the analyst an wide array of data that could be used for critical analysis. Further analysis would be done in the next study on this dataset.

Summary of methods used

Method	Question	Examples
aggregate_points	How many points within each polygon?	Counting the number of airbnb rentals within each NYC tracts
Data Enrichment	Which demographic attribute are relevant for the problem?	Population of Millennials for each tract
find_nearest	Which distances from city features are relevant for the problem?	Distance of the CBD from each tract

Data resources

Shapefile	Source	Link
airbnb_nyc2019	NYC Airbnb Data Inside Airbnb:Get the Data	http://insideairbnb.com/get-the-data.html
nyc_tract_fulll	NYC Open Data: 2010 Census Tracts (water areas included)	https://data.cityofnewyork.us/City-Government/2010-Census-Tracts-water-areas-included-/gx7x-82rk
busi_distr	NYC Open Data: Business Improvement Districts	https://data.cityofnewyork.us/Business/Business-Improvement-Districts/ejxk-d93y
cbd	NYC Open Data: Business Improvement Districts	https://data.cityofnewyork.us/Business/Business-Improvement-Districts/ejxk-d93y
bus_stop	NYC Open Data: Bus Stop Shelters	https://data.cityofnewyork.us/Transportation/Bus-Stop-Shelters/qafz-7myz
hotels	NYC Open Data: Facilities Database	https://data.cityofnewyork.us/City-Government/Facilities-Database-Shapefile/2fpa-bnsx
railroad	NYC Open Data: Railroad Line	https://data.cityofnewyork.us/Transportation/Railroad-Line/i7a5-bsik
subwy_rt	NYC Open Data: Subway Lines	https://data.cityofnewyork.us/Transportation/Subway-Lines/3qz8-muuu
subwy_stn	NYC Open Data: Subway Stations	https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49