Introduction
COVID-19 forecasting has been vital for efficiently planning health care policy during the pandemic. There are many forecasting models, a few of which require explanatory variables like population, social distancing, etc. This notebook uses the deep learning TimeSeriesModel
from arcgis.learn
for data modeling and is helpful in the prediction of future trends.
To demonstrate the utility of this method, this notebook will analyze confirmed cases for all counties in Alabama. The dataset contains the unique county FIPS ID, county Name, State ID, and cumulative confirmed cases datewise for each county. The dataset ranges from January 2020 to February 2022, with the data from January 2022 to February 2022 being used to validate the quality of the forecast. The approach utilized in this analysis for forecasting future COVID-19 cases involves: (a) Data Processing (calculating the seven day moving average for removing the noise and vertically stacking the county data), (b) creating functions for test-train splitting, tabular data preparation, model fitting using Inception Time for a sequence length of 60, and forecasting, and (c) validation and visualization of the predicted data and results.
Importing Libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
Connecting to your GIS
gis = GIS("home")
Accessing the dataset
The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
# Access the data table
data_table = gis.content.get("b222748b885e4741839f3787f207b2b1")
data_table
# Download the csv and saving it in local folder
data_path = data_table.get_data()
# # Read the csv file
confirmed = pd.read_csv(data_path)
confirmed.head()
countyFIPS | County Name | State | StateFIPS | 2020-01-22 | 2020-01-23 | 2020-01-24 | 2020-01-25 | 2020-01-26 | 2020-01-27 | ... | 2022-01-23 | 2022-01-24 | 2022-01-25 | 2022-01-26 | 2022-01-27 | 2022-01-28 | 2022-01-29 | 2022-01-30 | 2022-01-31 | 2022-02-01 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Statewide Unallocated | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1001 | Autauga County | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 13019 | 13251 | 13251 | 13251 | 13251 | 13251 | 13251 | 13251 | 13251 | 14826 |
2 | 1003 | Baldwin County | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 49168 | 50313 | 50313 | 50313 | 50313 | 50313 | 50313 | 50313 | 50313 | 53083 |
3 | 1005 | Barbour County | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 4902 | 5054 | 5054 | 5054 | 5054 | 5054 | 5054 | 5054 | 5054 | 5297 |
4 | 1007 | Bibb County | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 5663 | 5795 | 5795 | 5795 | 5795 | 5795 | 5795 | 5795 | 5795 | 6158 |
5 rows × 746 columns
Raw data cleaning
# Extract the data of Alabama State
confirmed_AL = confirmed.loc[
(confirmed["countyFIPS"] >= 1000) & (confirmed["countyFIPS"] <= 1133)]
# Stack the table for cumulative confirmed cases
confirmed_AL = confirmed_AL.set_index(["countyFIPS"])
confirmed_AL = confirmed_AL.drop(columns=["State", "County Name", "StateFIPS"])
confirmed_stacked_df = (
confirmed_AL.stack()
.reset_index()
.rename(columns={"level_1": "OriginalDate", 0: "ConfirmedCases"})
)
confirmed_stacked_df
countyFIPS | OriginalDate | ConfirmedCases | |
---|---|---|---|
0 | 1001 | 2020-01-22 | 0 |
1 | 1001 | 2020-01-23 | 0 |
2 | 1001 | 2020-01-24 | 0 |
3 | 1001 | 2020-01-25 | 0 |
4 | 1001 | 2020-01-26 | 0 |
... | ... | ... | ... |
49709 | 1133 | 2022-01-28 | 6323 |
49710 | 1133 | 2022-01-29 | 6323 |
49711 | 1133 | 2022-01-30 | 6323 |
49712 | 1133 | 2022-01-31 | 6323 |
49713 | 1133 | 2022-02-01 | 7057 |
49714 rows × 3 columns
# Converting into date time field format
confirmed_stacked_df["DateTime"] = pd.to_datetime(
confirmed_stacked_df["OriginalDate"], infer_datetime_format=True
)
confirmed_stacked_df = confirmed_stacked_df.drop(columns=["OriginalDate"])
confirmed_stacked_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49714 entries, 0 to 49713 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 countyFIPS 49714 non-null int64 1 ConfirmedCases 49714 non-null int64 2 DateTime 49714 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(2) memory usage: 1.1 MB
Calculate Moving Average for Confirmed cases
Here, we calculate a 7-day simple moving average to smooth out the data and remove noise caused by spikes in testing results.
# Set moving average window = 7 days
SMA_Window = 7
# Copy the dataframe and set columns need to be calculated
df = confirmed_stacked_df
cols = {1: "ConfirmedCases"}
SMA_Window = 7
for fips in df.countyFIPS.unique():
for col in cols:
field = f"{cols[col]}_SMA{SMA_Window}"
df.loc[df["countyFIPS"] == fips, field] = (
df.loc[df["countyFIPS"] == fips]
.iloc[:, col]
.rolling(window=SMA_Window)
.mean()
)
Cut off first 6 day's date
As the first moving average value starts from the seventh day, we will disregard the first 6 days.
firstMADay = df["DateTime"].iloc[0] + pd.DateOffset(days=SMA_Window - 1)
firstMADay
Timestamp('2020-01-28 00:00:00')
df_FirstMADay = df.loc[df["DateTime"] >= firstMADay]
df_FirstMADay.reset_index(drop=True, inplace=True)
df_FirstMADay
countyFIPS | ConfirmedCases | DateTime | ConfirmedCases_SMA7 | |
---|---|---|---|---|
0 | 1001 | 0 | 2020-01-28 | 0.000000 |
1 | 1001 | 0 | 2020-01-29 | 0.000000 |
2 | 1001 | 0 | 2020-01-30 | 0.000000 |
3 | 1001 | 0 | 2020-01-31 | 0.000000 |
4 | 1001 | 0 | 2020-02-01 | 0.000000 |
... | ... | ... | ... | ... |
49307 | 1133 | 6323 | 2022-01-28 | 6248.714286 |
49308 | 1133 | 6323 | 2022-01-29 | 6285.857143 |
49309 | 1133 | 6323 | 2022-01-30 | 6323.000000 |
49310 | 1133 | 6323 | 2022-01-31 | 6323.000000 |
49311 | 1133 | 7057 | 2022-02-01 | 6427.857143 |
49312 rows × 4 columns
Time series data preprocessing
The preprocessing of the data for multivariate time series modeling includes the selection of required columns, converting time into the date-time format, and collecting all the counties of the state.
# Selecting the required columns for modeling
df = df_FirstMADay[["DateTime", "ConfirmedCases_SMA7", "countyFIPS"]].copy()
df.columns = ["date", "cases", "countyFIPS"]
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
df.tail()
date | cases | countyFIPS | |
---|---|---|---|
49307 | 2022-01-28 | 6248.714286 | 1133 |
49308 | 2022-01-29 | 6285.857143 | 1133 |
49309 | 2022-01-30 | 6323.000000 | 1133 |
49310 | 2022-01-31 | 6323.000000 | 1133 |
49311 | 2022-02-01 | 6427.857143 | 1133 |
Collecting the counties of Alabama
# This cell collects all counties by their Unique FIPS IDs.
counties = df.countyFIPS.unique()
counties = [county for county in counties if county != 0]
len(counties)
67
The next cell can be used to forecast for a specific county. You can declare the county to forecast by using its FIPS ID.
# counties = df.countyFIPS.unique()
# counties = [county for county in counties if county == 1001]
Time series modeling and forecasting
Here, we will create the different functions for preparing tabular data, modeling, and forecasting that will later be called for each county.
# This function selects the specified county data and splits the train and test data
def CountyData(county, test_size):
data_file = df[df["countyFIPS"] == county]
data_file.reset_index(inplace=True, drop=True)
train, test = train_test_split(data_file, test_size=test_size, shuffle=False)
return train, test
The next function prepares the tabular data and initializes the model from the available set of backbones (InceptionTime, ResCNN, Resnet, and FCN). The sequence length here is provided as 15, which was found by performing a grid search. To train the model, the model.fit
method is used and is provided with the number of training epochs and the learning rate.
def Model(train, seq_len, test_size):
data = prepare_tabulardata(
train, variable_predict="cases", index_field="date", seed=42
) # Preparing the tabular data
tsmodel = TimeSeriesModel(
data, seq_len=seq_len, model_arch="InceptionTime"
) # Model initialization
lr_rate = tsmodel.lr_find() # Finding the Learning rate
tsmodel.fit(100, lr=lr_rate, checkpoint=False) # Model training
sdf_forecasted = tsmodel.predict(
train, prediction_type="dataframe", number_of_predictions=test_size
) # Forecasting using the trained TimeSeriesModel
return sdf_forecasted
# This function evalutes the model metrics and returns the dictionary
def evaluate(test, sdf_forecasted):
r2_test = r2_score(test["cases"], sdf_forecasted["cases_results"][-14:])
mse = metrics.mean_squared_error(
test["cases"], sdf_forecasted["cases_results"][-14:]
)
mae = metrics.mean_absolute_error(
test["cases"], sdf_forecasted["cases_results"][-14:]
)
return {
"DATE": test["date"],
"cases_actual": test["cases"],
"cases_predicted": sdf_forecasted["cases_results"][-14:],
"R-square": round(r2_test, 2),
"V_RMSE": round(np.sqrt(mse), 4),
"MAE": round(mae, 4),
}
# This class calls all the defined functions
class CovidModel(object):
seq_len = 15
test_size = 14
def __init__(self, county):
self.county = county
def CountyData(self):
self.train, self.test = CountyData(self.county, self.test_size)
def Model(self):
self.sdf_forecasted = Model(self.train, self.seq_len, self.test_size)
def evaluate(self):
return evaluate(self.test, self.sdf_forecasted)
Training the model for all counties and saving the metrics in the dictionary.
dct = {}
for i, county in enumerate(counties):
covidmodel = CovidModel(county)
covidmodel.CountyData()
covidmodel.Model()
dct[county] = covidmodel.evaluate()
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.022296 | 0.072083 | 00:00 |
1 | 0.021528 | 0.056567 | 00:00 |
2 | 0.019676 | 0.056948 | 00:00 |
3 | 0.017958 | 0.063523 | 00:00 |
4 | 0.016059 | 0.051571 | 00:00 |
5 | 0.014255 | 0.022249 | 00:00 |
6 | 0.012279 | 0.007439 | 00:00 |
7 | 0.010936 | 0.004679 | 00:00 |
8 | 0.010135 | 0.003242 | 00:00 |
9 | 0.008722 | 0.002114 | 00:00 |
10 | 0.007479 | 0.001448 | 00:00 |
11 | 0.006344 | 0.000820 | 00:00 |
12 | 0.005539 | 0.000626 | 00:00 |
13 | 0.004810 | 0.000156 | 00:00 |
14 | 0.004176 | 0.000091 | 00:00 |
15 | 0.003613 | 0.000100 | 00:00 |
16 | 0.003132 | 0.000077 | 00:00 |
17 | 0.002894 | 0.000230 | 00:00 |
18 | 0.002629 | 0.000161 | 00:00 |
19 | 0.002283 | 0.000087 | 00:00 |
20 | 0.002073 | 0.000320 | 00:00 |
21 | 0.001855 | 0.000131 | 00:00 |
22 | 0.001651 | 0.000139 | 00:00 |
23 | 0.001483 | 0.000202 | 00:00 |
24 | 0.001333 | 0.000078 | 00:00 |
25 | 0.001308 | 0.000239 | 00:00 |
26 | 0.001160 | 0.000315 | 00:00 |
27 | 0.001031 | 0.000212 | 00:00 |
28 | 0.000994 | 0.000078 | 00:00 |
29 | 0.001058 | 0.000162 | 00:00 |
30 | 0.000948 | 0.000865 | 00:00 |
31 | 0.000862 | 0.000587 | 00:00 |
32 | 0.000797 | 0.000192 | 00:00 |
33 | 0.000722 | 0.000051 | 00:00 |
34 | 0.000668 | 0.000104 | 00:00 |
35 | 0.000648 | 0.000066 | 00:00 |
36 | 0.000594 | 0.000543 | 00:00 |
37 | 0.000642 | 0.000080 | 00:00 |
38 | 0.000594 | 0.000081 | 00:00 |
39 | 0.000528 | 0.000192 | 00:00 |
40 | 0.000490 | 0.000520 | 00:00 |
41 | 0.000479 | 0.000175 | 00:00 |
42 | 0.000474 | 0.000509 | 00:00 |
43 | 0.000467 | 0.000559 | 00:00 |
44 | 0.000547 | 0.000214 | 00:00 |
45 | 0.000499 | 0.000077 | 00:00 |
46 | 0.000451 | 0.000152 | 00:00 |
47 | 0.000444 | 0.001184 | 00:00 |
48 | 0.000469 | 0.000031 | 00:00 |
49 | 0.000413 | 0.000155 | 00:00 |
50 | 0.000397 | 0.000246 | 00:00 |
51 | 0.000362 | 0.000137 | 00:00 |
52 | 0.000329 | 0.000027 | 00:00 |
53 | 0.000295 | 0.000019 | 00:00 |
54 | 0.000265 | 0.000049 | 00:00 |
55 | 0.000239 | 0.000032 | 00:00 |
56 | 0.000224 | 0.000048 | 00:00 |
57 | 0.000247 | 0.000253 | 00:00 |
58 | 0.000257 | 0.000026 | 00:00 |
59 | 0.000266 | 0.000036 | 00:00 |
60 | 0.000248 | 0.000051 | 00:00 |
61 | 0.000224 | 0.000021 | 00:00 |
62 | 0.000213 | 0.000013 | 00:00 |
63 | 0.000207 | 0.000080 | 00:00 |
64 | 0.000229 | 0.000025 | 00:00 |
65 | 0.000210 | 0.000038 | 00:00 |
66 | 0.000195 | 0.000048 | 00:00 |
67 | 0.000183 | 0.000049 | 00:00 |
68 | 0.000206 | 0.000037 | 00:00 |
69 | 0.000188 | 0.000019 | 00:00 |
70 | 0.000172 | 0.000014 | 00:00 |
71 | 0.000169 | 0.000020 | 00:00 |
72 | 0.000158 | 0.000017 | 00:00 |
73 | 0.000149 | 0.000028 | 00:00 |
74 | 0.000158 | 0.000022 | 00:00 |
75 | 0.000174 | 0.000066 | 00:00 |
76 | 0.000180 | 0.000097 | 00:00 |
77 | 0.000176 | 0.000171 | 00:00 |
78 | 0.000170 | 0.000046 | 00:00 |
79 | 0.000161 | 0.000037 | 00:00 |
80 | 0.000170 | 0.000012 | 00:00 |
81 | 0.000164 | 0.000024 | 00:00 |
82 | 0.000158 | 0.000030 | 00:00 |
83 | 0.000155 | 0.000033 | 00:00 |
84 | 0.000147 | 0.000022 | 00:00 |
85 | 0.000137 | 0.000019 | 00:00 |
86 | 0.000136 | 0.000018 | 00:00 |
87 | 0.000126 | 0.000011 | 00:00 |
88 | 0.000122 | 0.000013 | 00:00 |
89 | 0.000117 | 0.000016 | 00:00 |
90 | 0.000109 | 0.000013 | 00:00 |
91 | 0.000107 | 0.000010 | 00:00 |
92 | 0.000106 | 0.000008 | 00:00 |
93 | 0.000099 | 0.000010 | 00:00 |
94 | 0.000103 | 0.000011 | 00:00 |
95 | 0.000099 | 0.000019 | 00:00 |
96 | 0.000099 | 0.000036 | 00:00 |
97 | 0.000103 | 0.000018 | 00:00 |
98 | 0.000097 | 0.000012 | 00:00 |
99 | 0.000109 | 0.000014 | 00:00 |
Result Visualization
Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.
# Specifying few counties for visualizing the results
viz_counties = [1007,1113]
for i, county in enumerate(viz_counties):
result_df = pd.DataFrame(dct[county])
plt.figure(figsize=(20, 5))
plt.plot(result_df["DATE"], result_df[["cases_actual", "cases_predicted"]])
plt.xlabel("Date")
plt.ylabel("Covid Cases")
plt.legend(["Cases_Actual", "Cases_Predicted"], loc="upper left")
plt.title(str(county) + ": Covid Forecast Result")
plt.show()
# Here the Alabama counties feature layer is accessed and converted to spatial dataframe
item = gis.content.get("41e8eb46285d4e1f85ee6e826b05e077")
flayer = item.layers[0]
f_sdf = flayer.query().sdf
# Adding the RMSE and MAE from the output dictionary to the spatial dataframe
RMSE = []
MAE = []
for i, county in enumerate(counties):
MAE.append(dct[county]["MAE"])
RMSE.append(dct[county]["V_RMSE"])
f_sdf = f_sdf.assign(RMSE=RMSE, MAE=MAE)
Next, we will publish this spatial dataframe as a feature layer.
published_sdf = gis.content.import_data(f_sdf, title='Alabama Covid Time Series Model Metrics')
published_sdf
Next, we will open the published web layer and input the item id of the published output layer.
item = gis.content.get("9d197a4870a1479c81ddfd6b739816da")
map1 = gis.map("Alabama")
map1.content.add(item)
map1.legend.enabled = True
map1
From the map, it can be seen that most of the counties have RMSE ranging from 18-400 cases, represented by the blue polygons. The fewer green and cream colored counties have higher RMSE, and the one red county has the maximum RMSE. This indicates that InceptionTime
is performing well for this state, and that other backbones can be introduced to further reduce the RMSE in the counties that have higher RMSE.
Conclusion
This study conducted a univariate time series analysis using the Deep learning TimeSeriesModel
from the arcgis.learn
library and forecasted the COVID-19 confirmed cases for the counties in Alabama. The initial raw data was averaged over 7 days using the seven-day moving average method to avoid sudden spikes. The methodology also included preparing a time series dataset using the prepare_tabulardata()
method, followed by modeling, predicting, and validating the test dataset. The TimeSeriesModel from arcgis.learn
includes backbones, such as InceptionTime
, ResCNN
, ResNet
, and FCN
, that do not need fine-tuning of multiple hyperparameters before fitting the model. Our method produced reasonably accurate results, and users can change the sequence length or backbone for forecasting in other areas.