Covid case forecasting Using TimeSeriesModel from arcgis.learn

Introduction

COVID-19 forecasting has been vital for efficiently planning health care policy during the pandemic. There are many forecasting models, a few of which require explanatory variables like population, social distancing, etc. This notebook uses the deep learning TimeSeriesModel from arcgis.learn for data modeling and is helpful in the prediction of future trends.

To demonstrate the utility of this method, this notebook will analyze confirmed cases for all counties in Alabama. The dataset contains the unique county FIPS ID, county Name, State ID, and cumulative confirmed cases datewise for each county. The dataset ranges from January 2020 to February 2022, with the data from January 2022 to February 2022 being used to validate the quality of the forecast. The approach utilized in this analysis for forecasting future COVID-19 cases involves: (a) Data Processing (calculating the seven day moving average for removing the noise and vertically stacking the county data), (b) creating functions for test-train splitting, tabular data preparation, model fitting using Inception Time for a sequence length of 60, and forecasting, and (c) validation and visualization of the predicted data and results.

Importing Libraries

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata

Connecting to your GIS

gis = GIS("home")

Accessing the dataset

The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

# Access the data table
data_table = gis.content.get("b222748b885e4741839f3787f207b2b1")
data_table

USA Covid Confirmed Cases Dataset
This data contains the confirmed covid cases from 01/22/2020 to 02/01/2022 for USA counties. The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

CSV by api_data_owner
Last Modified: May 12, 2022
0 comments, 2 views

# Download the csv and saving it in local folder
data_path = data_table.get_data()

# # Read the csv file
confirmed = pd.read_csv(data_path)
confirmed.head()

	countyFIPS	County Name	State	StateFIPS	...	2022-01-23	2022-01-24	2022-01-25	2022-01-26	2022-01-27	2022-01-28	2022-01-29	2022-01-30	2022-01-31	2022-02-01
0	0	Statewide Unallocated	AL	1	...	0	0	0	0	0	0	0	0	0	0
1	1001	Autauga County	AL	1	...	13019	13251	13251	13251	13251	13251	13251	13251	13251	14826
2	1003	Baldwin County	AL	1	...	49168	50313	50313	50313	50313	50313	50313	50313	50313	53083
3	1005	Barbour County	AL	1	...	4902	5054	5054	5054	5054	5054	5054	5054	5054	5297
4	1007	Bibb County	AL	1	...	5663	5795	5795	5795	5795	5795	5795	5795	5795	6158

5 rows × 746 columns

Raw data cleaning

# Extract the data of Alabama State
confirmed_AL = confirmed.loc[
    (confirmed["countyFIPS"] >= 1000) & (confirmed["countyFIPS"] <= 1133)]

# Stack the table for cumulative confirmed cases
confirmed_AL = confirmed_AL.set_index(["countyFIPS"])
confirmed_AL = confirmed_AL.drop(columns=["State", "County Name", "StateFIPS"])
confirmed_stacked_df = (
    confirmed_AL.stack()
    .reset_index()
    .rename(columns={"level_1": "OriginalDate", 0: "ConfirmedCases"})
)
confirmed_stacked_df

	countyFIPS	OriginalDate	ConfirmedCases
0	1001	2020-01-22	0
1	1001	2020-01-23	0
2	1001	2020-01-24	0
3	1001	2020-01-25	0
4	1001	2020-01-26	0
...	...	...	...
49709	1133	2022-01-28	6323
49710	1133	2022-01-29	6323
49711	1133	2022-01-30	6323
49712	1133	2022-01-31	6323
49713	1133	2022-02-01	7057

49714 rows × 3 columns

# Converting into date time field format
confirmed_stacked_df["DateTime"] = pd.to_datetime(
    confirmed_stacked_df["OriginalDate"], infer_datetime_format=True
)
confirmed_stacked_df = confirmed_stacked_df.drop(columns=["OriginalDate"])
confirmed_stacked_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49714 entries, 0 to 49713
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   countyFIPS      49714 non-null  int64         
 1   ConfirmedCases  49714 non-null  int64         
 2   DateTime        49714 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.1 MB

Calculate Moving Average for Confirmed cases

Here, we calculate a 7-day simple moving average to smooth out the data and remove noise caused by spikes in testing results.

# Set moving average window = 7 days
SMA_Window = 7
# Copy the dataframe and set columns need to be calculated
df = confirmed_stacked_df
cols = {1: "ConfirmedCases"}

SMA_Window = 7
for fips in df.countyFIPS.unique():
    for col in cols:
        field = f"{cols[col]}_SMA{SMA_Window}"
        df.loc[df["countyFIPS"] == fips, field] = (
            df.loc[df["countyFIPS"] == fips]
            .iloc[:, col]
            .rolling(window=SMA_Window)
            .mean()
        )

Cut off first 6 day's date

As the first moving average value starts from the seventh day, we will disregard the first 6 days.

firstMADay = df["DateTime"].iloc[0] + pd.DateOffset(days=SMA_Window - 1)
firstMADay

Timestamp('2020-01-28 00:00:00')

df_FirstMADay = df.loc[df["DateTime"] >= firstMADay]
df_FirstMADay.reset_index(drop=True, inplace=True)
df_FirstMADay

	countyFIPS	ConfirmedCases	DateTime	ConfirmedCases_SMA7
0	1001	0	2020-01-28	0.000000
1	1001	0	2020-01-29	0.000000
2	1001	0	2020-01-30	0.000000
3	1001	0	2020-01-31	0.000000
4	1001	0	2020-02-01	0.000000
...	...	...	...	...
49307	1133	6323	2022-01-28	6248.714286
49308	1133	6323	2022-01-29	6285.857143
49309	1133	6323	2022-01-30	6323.000000
49310	1133	6323	2022-01-31	6323.000000
49311	1133	7057	2022-02-01	6427.857143

49312 rows × 4 columns

Time series data preprocessing

The preprocessing of the data for multivariate time series modeling includes the selection of required columns, converting time into the date-time format, and collecting all the counties of the state.

# Selecting the required columns for modeling
df = df_FirstMADay[["DateTime", "ConfirmedCases_SMA7", "countyFIPS"]].copy()
df.columns = ["date", "cases", "countyFIPS"]
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")

df.tail()

	date	cases	countyFIPS
49307	2022-01-28	6248.714286	1133
49308	2022-01-29	6285.857143	1133
49309	2022-01-30	6323.000000	1133
49310	2022-01-31	6323.000000	1133
49311	2022-02-01	6427.857143	1133

Collecting the counties of Alabama

# This cell collects all counties by their Unique FIPS IDs.
counties = df.countyFIPS.unique()
counties = [county for county in counties if county != 0]
len(counties)

The next cell can be used to forecast for a specific county. You can declare the county to forecast by using its FIPS ID.

# counties = df.countyFIPS.unique()
# counties = [county for county in counties if county == 1001]

Time series modeling and forecasting

Here, we will create the different functions for preparing tabular data, modeling, and forecasting that will later be called for each county.

# This function selects the specified county data and splits the train and test data
def CountyData(county, test_size):
    data_file = df[df["countyFIPS"] == county]
    data_file.reset_index(inplace=True, drop=True)
    train, test = train_test_split(data_file, test_size=test_size, shuffle=False)
    return train, test

The next function prepares the tabular data and initializes the model from the available set of backbones (InceptionTime, ResCNN, Resnet, and FCN). The sequence length here is provided as 15, which was found by performing a grid search. To train the model, the model.fit method is used and is provided with the number of training epochs and the learning rate.

def Model(train, seq_len, test_size):
    data = prepare_tabulardata(
        train, variable_predict="cases", index_field="date", seed=42
    )  # Preparing the tabular data
    tsmodel = TimeSeriesModel(
        data, seq_len=seq_len, model_arch="InceptionTime"
    )  # Model initialization
    lr_rate = tsmodel.lr_find()  # Finding the Learning rate
    tsmodel.fit(100, lr=lr_rate, checkpoint=False)  # Model training
    sdf_forecasted = tsmodel.predict(
        train, prediction_type="dataframe", number_of_predictions=test_size
    )  # Forecasting using the trained TimeSeriesModel
    return sdf_forecasted

# This function evalutes the model metrics and returns the dictionary
def evaluate(test, sdf_forecasted):
    r2_test = r2_score(test["cases"], sdf_forecasted["cases_results"][-14:])
    mse = metrics.mean_squared_error(
        test["cases"], sdf_forecasted["cases_results"][-14:]
    )
    mae = metrics.mean_absolute_error(
        test["cases"], sdf_forecasted["cases_results"][-14:]
    )
    return {
        "DATE": test["date"],
        "cases_actual": test["cases"],
        "cases_predicted": sdf_forecasted["cases_results"][-14:],
        "R-square": round(r2_test, 2),
        "V_RMSE": round(np.sqrt(mse), 4),
        "MAE": round(mae, 4),
    }

# This class calls all the defined functions
class CovidModel(object):
    seq_len = 15
    test_size = 14

    def __init__(self, county):
        self.county = county

    def CountyData(self):
        self.train, self.test = CountyData(self.county, self.test_size)

    def Model(self):
        self.sdf_forecasted = Model(self.train, self.seq_len, self.test_size)

    def evaluate(self):
        return evaluate(self.test, self.sdf_forecasted)

Training the model for all counties and saving the metrics in the dictionary.

dct = {}

for i, county in enumerate(counties):
    covidmodel = CovidModel(county)
    covidmodel.CountyData()
    covidmodel.Model()
    dct[county] = covidmodel.evaluate()

epoch	train_loss	valid_loss	time
0	0.022296	0.072083	00:00
1	0.021528	0.056567	00:00
2	0.019676	0.056948	00:00
3	0.017958	0.063523	00:00
4	0.016059	0.051571	00:00
5	0.014255	0.022249	00:00
6	0.012279	0.007439	00:00
7	0.010936	0.004679	00:00
8	0.010135	0.003242	00:00
9	0.008722	0.002114	00:00
10	0.007479	0.001448	00:00
11	0.006344	0.000820	00:00
12	0.005539	0.000626	00:00
13	0.004810	0.000156	00:00
14	0.004176	0.000091	00:00
15	0.003613	0.000100	00:00
16	0.003132	0.000077	00:00
17	0.002894	0.000230	00:00
18	0.002629	0.000161	00:00
19	0.002283	0.000087	00:00
20	0.002073	0.000320	00:00
21	0.001855	0.000131	00:00
22	0.001651	0.000139	00:00
23	0.001483	0.000202	00:00
24	0.001333	0.000078	00:00
25	0.001308	0.000239	00:00
26	0.001160	0.000315	00:00
27	0.001031	0.000212	00:00
28	0.000994	0.000078	00:00
29	0.001058	0.000162	00:00
30	0.000948	0.000865	00:00
31	0.000862	0.000587	00:00
32	0.000797	0.000192	00:00
33	0.000722	0.000051	00:00
34	0.000668	0.000104	00:00
35	0.000648	0.000066	00:00
36	0.000594	0.000543	00:00
37	0.000642	0.000080	00:00
38	0.000594	0.000081	00:00
39	0.000528	0.000192	00:00
40	0.000490	0.000520	00:00
41	0.000479	0.000175	00:00
42	0.000474	0.000509	00:00
43	0.000467	0.000559	00:00
44	0.000547	0.000214	00:00
45	0.000499	0.000077	00:00
46	0.000451	0.000152	00:00
47	0.000444	0.001184	00:00
48	0.000469	0.000031	00:00
49	0.000413	0.000155	00:00
50	0.000397	0.000246	00:00
51	0.000362	0.000137	00:00
52	0.000329	0.000027	00:00
53	0.000295	0.000019	00:00
54	0.000265	0.000049	00:00
55	0.000239	0.000032	00:00
56	0.000224	0.000048	00:00
57	0.000247	0.000253	00:00
58	0.000257	0.000026	00:00
59	0.000266	0.000036	00:00
60	0.000248	0.000051	00:00
61	0.000224	0.000021	00:00
62	0.000213	0.000013	00:00
63	0.000207	0.000080	00:00
64	0.000229	0.000025	00:00
65	0.000210	0.000038	00:00
66	0.000195	0.000048	00:00
67	0.000183	0.000049	00:00
68	0.000206	0.000037	00:00
69	0.000188	0.000019	00:00
70	0.000172	0.000014	00:00
71	0.000169	0.000020	00:00
72	0.000158	0.000017	00:00
73	0.000149	0.000028	00:00
74	0.000158	0.000022	00:00
75	0.000174	0.000066	00:00
76	0.000180	0.000097	00:00
77	0.000176	0.000171	00:00
78	0.000170	0.000046	00:00
79	0.000161	0.000037	00:00
80	0.000170	0.000012	00:00
81	0.000164	0.000024	00:00
82	0.000158	0.000030	00:00
83	0.000155	0.000033	00:00
84	0.000147	0.000022	00:00
85	0.000137	0.000019	00:00
86	0.000136	0.000018	00:00
87	0.000126	0.000011	00:00
88	0.000122	0.000013	00:00
89	0.000117	0.000016	00:00
90	0.000109	0.000013	00:00
91	0.000107	0.000010	00:00
92	0.000106	0.000008	00:00
93	0.000099	0.000010	00:00
94	0.000103	0.000011	00:00
95	0.000099	0.000019	00:00
96	0.000099	0.000036	00:00
97	0.000103	0.000018	00:00
98	0.000097	0.000012	00:00
99	0.000109	0.000014	00:00

Result Visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.

# Specifying few counties for visualizing the results
viz_counties = [1007,1113]

for i, county in enumerate(viz_counties):
    result_df = pd.DataFrame(dct[county])
    plt.figure(figsize=(20, 5))
    plt.plot(result_df["DATE"], result_df[["cases_actual", "cases_predicted"]])
    plt.xlabel("Date")
    plt.ylabel("Covid Cases")
    plt.legend(["Cases_Actual", "Cases_Predicted"], loc="upper left")
    plt.title(str(county) + ": Covid Forecast Result")
    plt.show()

# Here the Alabama counties feature layer is accessed and converted to spatial dataframe
item = gis.content.get("41e8eb46285d4e1f85ee6e826b05e077")
flayer = item.layers[0]
f_sdf = flayer.query().sdf

# Adding the RMSE and MAE from the output dictionary to the spatial dataframe
RMSE = []
MAE = []
for i, county in enumerate(counties):
    MAE.append(dct[county]["MAE"])
    RMSE.append(dct[county]["V_RMSE"])

f_sdf = f_sdf.assign(RMSE=RMSE, MAE=MAE)

Next, we will publish this spatial dataframe as a feature layer.

published_sdf = gis.content.import_data(f_sdf, title='Alabama Covid Time Series Model Metrics')
published_sdf

Alabama Covid Time Series Model Metrics
This is the feature layer containing the RMSE and MAE errors resulted from time series analysis of COVID cases using TimeSeriesModel from arcgis.learn

Feature Layer Collection by api_data_owner
Last Modified: May 12, 2022
0 comments, 3 views

Next, we will open the published web layer and input the item id of the published output layer.

item = gis.content.get("9d197a4870a1479c81ddfd6b739816da")
map1 = gis.map("Alabama")
map1.content.add(item)
map1.legend.enabled = True
map1

From the map, it can be seen that most of the counties have RMSE ranging from 18-400 cases, represented by the blue polygons. The fewer green and cream colored counties have higher RMSE, and the one red county has the maximum RMSE. This indicates that InceptionTime is performing well for this state, and that other backbones can be introduced to further reduce the RMSE in the counties that have higher RMSE.

Conclusion

This study conducted a univariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the COVID-19 confirmed cases for the counties in Alabama. The initial raw data was averaged over 7 days using the seven-day moving average method to avoid sudden spikes. The methodology also included preparing a time series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet, and FCN, that do not need fine-tuning of multiple hyperparameters before fitting the model. Our method produced reasonably accurate results, and users can change the sequence length or backbone for forecasting in other areas.