Introduction
Recently there has been a great emphasis on reducing carbon footprint by moving away from fossil fuel to renewable energy sources for running our cities. Various local city governments across the world like in this case the City of Calgary in Canada is leading this change by becoming energy independent by installing solar power plants either on rooftops or within the site area of their city utilities for running its operation.
In view of the scenario here is a notebook that would predict the daily hence annual solar energy generation by a solar power station at a site using local weather information and site characteristics. The hypothesis is that various weather parameters such as temperature, wind speed, vapor pressure, solar radiation, day length, precipitation, snowfall along with altitude of a place would impact the daily generation of solar energy.
Accordingly, these variables are used to train a model on actual solar power generated by solar stations located in Calgary, which could then be used to predict solar generation for probable solar plants at other locations. Besides the total energy generation would also depend on the capacity of the solar station established. For example, a 100kwp solar plant will generate more energy than a 50kwp plant, hence for the final output, the capacity of the plant is to be taken into consideration.
Imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from pandas import read_csv
from datetime import datetime
from IPython.display import Image, HTML
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.metrics import r2_score
import arcgis
from arcgis.gis import GIS
from arcgis.learn import FullyConnectedNetwork, MLModel, prepare_tabulardata
Connecting to ArcGIS
gis = GIS(profile="your_online_profile")
Accessing & Visualizing datasets
The primary data used for this sample are as follows:
Out of the several solar photovoltaic power plants in the City of Calgary, 11 were selected for the study. The dataset contains two components:
1) Daily solar energy production for each power plant from September 2015 to December 2019.
2) Corresponding daily weather measurements for the given sites.
The datasets were obtained from multiple sources as mentioned here (Data resources) and preprocessed to obtain the main dataset used here. Two feature layers was subsequently created out of them one for training and the other for validating.
Training Set
It consists of data from 10 solar sites for training the model. The feature layer containing the data is accessed here from Arcgis portal and visualized as follows:
# Access Solar Dataset feature layer for Training, without the Southland Solar Plant which is hold out for validation
calgary_no_southland_solar = gis.content.search('calgary_no_southland_solar owner:api_data_owner', 'feature layer')[0]
calgary_no_southland_solar
# Access the layer from the feature layer
calgary_no_southland_solar_layer = calgary_no_southland_solar.layers[0]
# Plot location of the 10 Solar sites in Calgary to be used for training
m1 = gis.map('calgary', zoomlevel=10)
m1.add_layer(calgary_no_southland_solar_layer)
m1
The map above shows the 10 power plant locations that are used for collecting the training data.
# Visualize the dataframe
calgary_no_southland_solar_layer_sdf = calgary_no_southland_solar_layer.query().sdf
calgary_no_southland_solar_layer_sdf=calgary_no_southland_solar_layer_sdf[['FID','date','ID','solar_plan','altitude_m',
'latitude','longitude','wind_speed','dayl__s_',
'prcp__mm_d','srad__W_m_','swe__kg_m_', 'tmax__deg',
'tmin__deg','vp__Pa_','kWh_filled','capacity_f',
'SHAPE']]
calgary_no_southland_solar_layer_sdf.head()
FID | date | ID | solar_plan | altitude_m | latitude | longitude | wind_speed | dayl__s_ | prcp__mm_d | srad__W_m_ | swe__kg_m_ | tmax__deg | tmin__deg | vp__Pa_ | kWh_filled | capacity_f | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2017-12-24 | 355827 | Glenmore Water Treatment Plant | 1095 | 51.003078 | -114.100571 | 7.20467 | 27648.0 | 1 | 108.800003 | 12 | -10.5 | -21.0 | 120 | 1.242357 | 0.000177 | {"x": -12701617.407282012, "y": 6621838.159138... |
1 | 2 | 2017-12-25 | 355827 | Glenmore Water Treatment Plant | 1095 | 51.003078 | -114.100571 | 3.385235 | 27648.0 | 1 | 115.199997 | 12 | -18.0 | -29.5 | 40 | 2.477714 | 0.000354 | {"x": -12701617.407282012, "y": 6621838.159138... |
2 | 3 | 2017-12-26 | 355827 | Glenmore Water Treatment Plant | 1095 | 51.003078 | -114.100571 | 5.076316 | 27648.0 | 0 | 118.400002 | 12 | -20.0 | -32.0 | 40 | 3.713071 | 0.00053 | {"x": -12701617.407282012, "y": 6621838.159138... |
3 | 4 | 2017-12-27 | 355827 | Glenmore Water Treatment Plant | 1095 | 51.003078 | -114.100571 | 5.617623 | 27648.0 | 0 | 96.0 | 12 | -18.0 | -26.5 | 80 | 4.948429 | 0.000707 | {"x": -12701617.407282012, "y": 6621838.159138... |
4 | 5 | 2017-12-28 | 355827 | Glenmore Water Treatment Plant | 1095 | 51.003078 | -114.100571 | 2.561512 | 27648.0 | 0 | 118.400002 | 12 | -17.0 | -28.5 | 40 | 6.183786 | 0.000883 | {"x": -12701617.407282012, "y": 6621838.159138... |
In the above table, each row represents each day starting from September 2015 to December 2019, with the corresponding dates shown in the column named date, while the field solar_plan contains name of the solar sites.
The primary information consists of the daily generation of energy in kilowatt-hour(KWh) given here in the field name kWh_filled for each of the selected 10 solar photovoltaic power plants in the City of Calgary. The field capacity_f indicates the capacity factor which is obtained after normalizing the kWh_filled by the peak capacity of each solar photovoltaic sites, which will be used here as the dependent variable.
In addition it contains data about weather variables for each day for the related solar plant, all of which except wind speed, was obtained from MODIS, Daymet observations. These variables are as follows:
- wind_speed : wind speed(m/sec)
- dayl_s : Day length (sec/day)
- prcp__mm_d : Precipitation (mm/day)
- srad_W_m : Shortwave radiation (W/m^2)
- swe_kg_m : Snow water equivalent (kg/m^2)
- tmax__deg : Maximum air temperature (degrees C)
- tmin__deg : Minimum air temperature (degrees C)
- vp_Pa : Water vapor pressure (Pa)
Now to understand the distribution of the variables over the last few years and their respective relationship with the dependent variable of daily energy produced for that stations, data from one of the station is plotted in the following.
# plot and Visualize the variables from the training set for one solar station - Hillhurst Sunnyside Community Association
hillhurst_solar = calgary_no_southland_solar_layer_sdf[calgary_no_southland_solar_layer_sdf['solar_plan']=='Hillhurst Sunnyside Community Association'].copy()
hillhurst_datetime = hillhurst_solar.set_index(hillhurst_solar['date'])
hillhurst_datetime = hillhurst_datetime.sort_index()
for i in range(7,hillhurst_datetime.shape[1]-1):
plt.figure(figsize=(20,3))
plt.title(hillhurst_datetime.columns[i])
plt.plot(hillhurst_datetime[hillhurst_datetime.columns[i]])
plt.show()
In the above plots it can be seen that each of the variables has high seasonality and it seems that there is some relationship between the dependent variable of kWh_filled and the rest. Hence this is followed by creating a correlation plot to check the correlation between the variables.
# checking the correlation matrix between the predictors and the dependent variable of capacity_factor
corr_test = calgary_no_southland_solar_layer_sdf.drop(['FID','date','ID','latitude','longitude','solar_plan','kWh_filled','SHAPE'], axis=1)
corr = corr_test.corr()
corr.style.background_gradient(cmap='Greens').format(precision=2)
altitude_m | wind_speed | dayl__s_ | prcp__mm_d | srad__W_m_ | swe__kg_m_ | tmax__deg | tmin__deg | vp__Pa_ | capacity_f | |
---|---|---|---|---|---|---|---|---|---|---|
altitude_m | 1.00 | -0.01 | 0.04 | 0.01 | 0.03 | 0.02 | 0.02 | 0.02 | 0.02 | 0.03 |
wind_speed | -0.01 | 1.00 | -0.41 | -0.17 | -0.26 | 0.02 | -0.03 | -0.06 | -0.13 | -0.24 |
dayl__s_ | 0.04 | -0.41 | 1.00 | 0.20 | 0.78 | -0.18 | 0.72 | 0.73 | 0.60 | 0.77 |
prcp__mm_d | 0.01 | -0.17 | 0.20 | 1.00 | -0.18 | -0.07 | -0.03 | 0.10 | 0.20 | -0.04 |
srad__W_m_ | 0.03 | -0.26 | 0.78 | -0.18 | 1.00 | 0.04 | 0.69 | 0.50 | 0.28 | 0.82 |
swe__kg_m_ | 0.02 | 0.02 | -0.18 | -0.07 | 0.04 | 1.00 | -0.45 | -0.48 | -0.46 | -0.19 |
tmax__deg | 0.02 | -0.03 | 0.72 | -0.03 | 0.69 | -0.45 | 1.00 | 0.93 | 0.75 | 0.75 |
tmin__deg | 0.02 | -0.06 | 0.73 | 0.10 | 0.50 | -0.48 | 0.93 | 1.00 | 0.85 | 0.65 |
vp__Pa_ | 0.02 | -0.13 | 0.60 | 0.20 | 0.28 | -0.46 | 0.75 | 0.85 | 1.00 | 0.45 |
capacity_f | 0.03 | -0.24 | 0.77 | -0.04 | 0.82 | -0.19 | 0.75 | 0.65 | 0.45 | 1.00 |
The plot shows that the variable of shortwave radiation per meter square (srad_W_m) received at the site has the maximum correlation with the dependent variable of total solar energy produced expressed in terms of capacity factor(capacity_f), which is self-explanatory. This is followed by the variable of day length(dayl_s) which means that longer the day more the produced energy. These two are closely followed by max(tmax__deg) and min(tmin__deg) daily temperatures, and lastly the other variables.
Validation Set
This set consists of daily solar generation data dated from September, 2015 to December, 2019 of one solar site known as Southland Leisure Centre for the purpose of validating the trained model:-
# Access the Southland Solar Plant Dataset feature layer for validation
southland_solar = gis.content.search('southland_solar owner:api_data_owner', 'feature layer')[0]
southland_solar
# Access the layer from the feature layer
southland_solar_layer = southland_solar.layers[0]
# Plot location of the Southalnd Solar site in Calgary to be used for validation
m1 = gis.map('calgary', zoomlevel=10)
m1.add_layer(southland_solar_layer)
m1
# visualize the southland dataframe here
southland_solar_layer_sdf = southland_solar_layer.query().sdf
southland_solar_layer_sdf.head(2)
FID | Field1 | ID | solar_plan | altitude_m | latitude | longitude | wind_speed | dayl__s_ | prcp__mm_d | ... | tmin__deg | vp__Pa_ | kWh_filled | capacity_f | GlobalID | CreationDate | Creator | EditDate | Editor | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2019-10-03 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 5.332239 | 40089.601562 | 0.0 | ... | -3.0 | 480.0 | 309.644 | 0.084326 | e9b0f671-d6ba-4560-b912-d635a0a129f8 | 2020-04-27 11:58:02.992 | arcgis_python | 2020-04-27 11:58:02.992 | arcgis_python | {"x": -12702497.020502415, "y": 6614660.374377... |
1 | 2 | 2019-10-04 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 6.304829 | 40089.601562 | 0.0 | ... | -1.0 | 560.0 | 679.785 | 0.185127 | 7bde5210-a8c2-4731-9c23-e5f77c1ebc56 | 2020-04-27 11:58:02.992 | arcgis_python | 2020-04-27 11:58:02.992 | arcgis_python | {"x": -12702497.020502415, "y": 6614660.374377... |
2 rows × 23 columns
# check the total number of samples
southland_solar_layer_sdf.shape
(1590, 23)
Model Building
Once the training and the validation dataset is processed and analyzed, it is ready to be used for modeling.
In this sample two types of methodology are used for modeling:
1) FullyConnectedNetwork
- First a deep learning framework called FullyConnectedNetwork
available in the arcgis.learn
module in ArcGIS API for Python is used.
2) MLModel
- In the second option, a regression model from scikit-learn is implemented via the MLModel
framework in arcgis.learn
. This framework can deploy any regression or classification model from the library just by passing the name of the algorithm and its relevant parameters as keyword arguments.
Finally, performance between the two methods will be compared in terms of model training and validation accuracy.
Further details on FullyConnectedNetwork
& MLModel
are available here in the Deep Learning with ArcGIS section.
1 — FullyConnectedNetwork
This is an Artificial Neural Network model from the arcgis.learn
module which is used here for modeling.
Data Preprocessing
First a list is made consisting of the feature data that will be used for predicting daily solar energy generation. By default, it will receive continuous variables, while in case of categorical variables the True value should be passed inside a tuple along with the variable. Here all the variables are continuous.
# Here a list is created naming all fields containing the predictors from the input feature layer
X = ['altitude_m', 'wind_speed', 'dayl__s_', 'prcp__mm_d','srad__W_m_','swe__kg_m_','tmax__deg','tmin__deg','vp__Pa_']
# importing the libraries from arcgis.learn for data preprocessing
from arcgis.learn import prepare_tabulardata
Once the explanatory variables are identified the main preprocessing of the data is carried out by the prepare_tabulardata
method from the arcgis.learn
module in the ArcGIS API for Python. The function takes a feature layer or a spatial dataframe containing the dataset as input and returns a TabularDataObject that can be fed into the model.
The input parameters required for the tool are:
- input_features : feature layer or spatial dataframe having the primary dataset
- variable_predict : field name containing the y-variable from the input feature layer/dataframe
- explanatory_variables : list of the field names as 2-sized tuples containing the explanatory variables as mentioned above
# precrocessing data using prepare data method - it handles imputing missing values, normalization and train-test split
data = prepare_tabulardata(calgary_no_southland_solar_layer,
'capacity_f',
explanatory_variables=X)
# visualizing the prepared data
data.show_batch()
altitude_m | capacity_f | dayl__s_ | prcp__mm_d | srad__W_m_ | swe__kg_m_ | tmax__deg | tmin__deg | vp__Pa_ | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|
1640 | 1055 | 0.106612 | 37324.800781 | 0 | 281.600006 | 0 | 18.5 | 3.5 | 800 | 6.391703 |
1788 | 1095 | 0.070813 | 55641.601562 | 7 | 144.0 | 0 | 7.5 | 2.5 | 720 | 5.120847 |
1825 | 1095 | 0.224948 | 58752.0 | 1 | 387.200012 | 0 | 20.0 | 7.5 | 880 | 3.322512 |
4140 | 1070 | 0.016279 | 40780.800781 | 0 | 300.799988 | 36 | -11.0 | -19.5 | 120 | 3.128044 |
7897 | 1096 | 0.237411 | 50457.601562 | 0 | 441.600006 | 0 | 27.0 | 4.0 | 680 | 4.591178 |
Model Initialization
Once the data has been prepared by the prepare_tabulardata
method it is ready to be passed to the ANN for training. First the ANN known as FullyConnectedNetwork
is imported from arcgis.learn
and initialized as follows:
# importing the model from arcgis.learn
from arcgis.learn import FullyConnectedNetwork
# Initialize the model with the data where the weights are randomly allocated
fcn = FullyConnectedNetwork(data, layers=[200,130])
Learning Rate Search
# searching for an optimal learning rate using the lr_find for passing it to the final model fitting
lr = fcn.lr_find()
lr
0.0005754399373371565
Here the suggested learning rate by the lr_find
method is around 0.0012. The automatic lr_finder will take a conservative estimate of the learning rate, but some experts can interpret the graph more appropriately and find a better learning rate to be used for final training of the model.
Model Training
Finally the model is now ready for training, and the model.fit
method is used which is given the number of epochs for training and the estimated learning rate selected based on the lr_find
returned in the previous step:
# the model is trained for 100 epochs
fcn.fit(100, lr=lr)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.003993 | 0.002826 | 00:00 |
1 | 0.002635 | 0.002453 | 00:00 |
2 | 0.002427 | 0.002265 | 00:00 |
3 | 0.002233 | 0.002168 | 00:00 |
4 | 0.002187 | 0.002147 | 00:00 |
5 | 0.002235 | 0.002035 | 00:00 |
6 | 0.002045 | 0.001987 | 00:00 |
7 | 0.002112 | 0.001990 | 00:00 |
8 | 0.002057 | 0.001926 | 00:00 |
9 | 0.001978 | 0.001881 | 00:00 |
10 | 0.002029 | 0.001880 | 00:00 |
11 | 0.002024 | 0.001987 | 00:00 |
12 | 0.002015 | 0.001850 | 00:00 |
13 | 0.002048 | 0.001914 | 00:00 |
14 | 0.001991 | 0.001904 | 00:00 |
15 | 0.001982 | 0.001887 | 00:00 |
16 | 0.002081 | 0.002071 | 00:00 |
17 | 0.002040 | 0.001817 | 00:00 |
18 | 0.002070 | 0.001904 | 00:00 |
19 | 0.002146 | 0.001811 | 00:00 |
20 | 0.001904 | 0.001841 | 00:00 |
21 | 0.001959 | 0.001968 | 00:00 |
22 | 0.002009 | 0.001970 | 00:00 |
23 | 0.002022 | 0.002442 | 00:00 |
24 | 0.001946 | 0.001834 | 00:00 |
25 | 0.001912 | 0.001784 | 00:00 |
26 | 0.001892 | 0.001913 | 00:00 |
27 | 0.001904 | 0.001747 | 00:00 |
28 | 0.001798 | 0.001908 | 00:00 |
29 | 0.001891 | 0.001750 | 00:00 |
30 | 0.001889 | 0.001807 | 00:00 |
31 | 0.001788 | 0.001696 | 00:00 |
32 | 0.001762 | 0.001752 | 00:00 |
33 | 0.001781 | 0.001764 | 00:00 |
34 | 0.001835 | 0.001880 | 00:00 |
35 | 0.001821 | 0.001746 | 00:00 |
36 | 0.001795 | 0.001697 | 00:00 |
37 | 0.001764 | 0.001691 | 00:00 |
38 | 0.001736 | 0.001668 | 00:00 |
39 | 0.001741 | 0.001779 | 00:00 |
40 | 0.001739 | 0.001725 | 00:00 |
41 | 0.001673 | 0.001603 | 00:00 |
42 | 0.001782 | 0.001699 | 00:00 |
43 | 0.001685 | 0.001604 | 00:00 |
44 | 0.001685 | 0.001614 | 00:00 |
45 | 0.001684 | 0.001674 | 00:00 |
46 | 0.001691 | 0.001676 | 00:00 |
47 | 0.001638 | 0.001619 | 00:00 |
48 | 0.001606 | 0.001607 | 00:00 |
49 | 0.001650 | 0.001614 | 00:00 |
50 | 0.001567 | 0.001628 | 00:00 |
51 | 0.001589 | 0.001608 | 00:00 |
52 | 0.001630 | 0.001535 | 00:00 |
53 | 0.001618 | 0.001553 | 00:00 |
54 | 0.001570 | 0.001650 | 00:00 |
55 | 0.001576 | 0.001602 | 00:00 |
56 | 0.001517 | 0.001692 | 00:00 |
57 | 0.001590 | 0.001583 | 00:00 |
58 | 0.001536 | 0.001549 | 00:00 |
59 | 0.001552 | 0.001539 | 00:00 |
60 | 0.001490 | 0.001507 | 00:00 |
61 | 0.001539 | 0.001529 | 00:00 |
62 | 0.001535 | 0.001516 | 00:00 |
63 | 0.001519 | 0.001523 | 00:00 |
64 | 0.001490 | 0.001523 | 00:00 |
65 | 0.001423 | 0.001511 | 00:00 |
66 | 0.001533 | 0.001707 | 00:00 |
67 | 0.001460 | 0.001521 | 00:00 |
68 | 0.001451 | 0.001519 | 00:00 |
69 | 0.001492 | 0.001487 | 00:00 |
70 | 0.001487 | 0.001496 | 00:00 |
71 | 0.001455 | 0.001467 | 00:00 |
72 | 0.001456 | 0.001469 | 00:00 |
73 | 0.001427 | 0.001491 | 00:00 |
74 | 0.001451 | 0.001449 | 00:00 |
75 | 0.001385 | 0.001458 | 00:00 |
76 | 0.001434 | 0.001466 | 00:00 |
77 | 0.001484 | 0.001487 | 00:00 |
78 | 0.001406 | 0.001484 | 00:00 |
79 | 0.001442 | 0.001484 | 00:00 |
80 | 0.001355 | 0.001475 | 00:00 |
81 | 0.001411 | 0.001499 | 00:00 |
82 | 0.001367 | 0.001466 | 00:00 |
83 | 0.001446 | 0.001466 | 00:00 |
84 | 0.001396 | 0.001467 | 00:00 |
85 | 0.001399 | 0.001462 | 00:00 |
86 | 0.001330 | 0.001451 | 00:00 |
87 | 0.001401 | 0.001473 | 00:00 |
88 | 0.001393 | 0.001444 | 00:00 |
89 | 0.001313 | 0.001443 | 00:00 |
90 | 0.001350 | 0.001472 | 00:00 |
91 | 0.001339 | 0.001466 | 00:00 |
92 | 0.001339 | 0.001468 | 00:00 |
93 | 0.001404 | 0.001437 | 00:00 |
94 | 0.001313 | 0.001497 | 00:00 |
95 | 0.001341 | 0.001470 | 00:00 |
96 | 0.001346 | 0.001494 | 00:00 |
97 | 0.001317 | 0.001455 | 00:00 |
98 | 0.001334 | 0.001459 | 00:00 |
99 | 0.001373 | 0.001492 | 00:00 |
The train vs valid losses is plotted to check if the model is overfitting. It shows that the model has trained well and though the losses are still gradually decreasing but not significantly.
# the train vs valid losses is plotted to check quality of the trained model
fcn.plot_losses()
Finally, the training results are printed to assess the prediction on the test set by the trained model.
# the predicted values by the trained model is printed for the test set
fcn.show_results()
altitude_m | capacity_f | dayl__s_ | prcp__mm_d | srad__W_m_ | swe__kg_m_ | tmax__deg | tmin__deg | vp__Pa_ | wind_speed | prediction_results | |
---|---|---|---|---|---|---|---|---|---|---|---|
3150 | 1095 | 0.256616 | 58406.398438 | 0 | 480.0 | 0 | 22.0 | 6.0 | 520 | 3.570236 | 0.205817 |
158 | 1095 | 0.120955 | 35942.398438 | 0 | 240.0 | 0 | 13.0 | -3.0 | 480 | 7.964495 | 0.124888 |
5927 | 1070 | 0.288044 | 58752.0 | 2 | 470.399994 | 0 | 23.0 | 6.5 | 560 | 3.339548 | 0.243402 |
7428 | 1094 | 0.08343 | 58752.0 | 23 | 201.600006 | 0 | 20.5 | 12.5 | 1360 | 6.224127 | 0.111588 |
8900 | 1051 | 0.007171 | 41126.398438 | 11 | 41.599998 | 0 | -1.5 | -3.0 | 480 | 7.412553 | 0.001019 |
In the above table, the predicted values by the model on the test set in the last column named prediction_results and the actual values in the column named capacity_f of the target variable are highly similar.
Accordingly, the model metrics of the trained model is now estimated using the model.score
function. It returns the r-square of the model fit as follows:
# the model.score method from the tabular learner returns r-square
r_Square_fcn_test = fcn.score()
print('r_Square_fcn_test: ', round(r_Square_fcn_test,5))
r_Square_fcn_test: 0.84108
The high r-square value indicates that the model has been trained well
Solar Energy Generation Forecast & Validation
The trained model(FullyConnectedNetwork
) will now be used to predict the daily lifetime solar energy generation for the solar plant installed at the Southland Leisure Centre since it was installed during 2015. The aim is to validate the trained model and measure its performance of solar output estimation using only weather variables from the Southland Leisure Center.
Accordingly the model.predict
method from arcgis.learn
is used with the daily weather variables as input for the mentioned site ranging from September, 2015 to December, 2019 to predict daily solar energy output in KWh for the same time period. The predictors are automatically chosen from the input feature layer of southland_layer by the trained model without mentioning them explicitly, since their names are exactly same as used during training the model.
# predicting using the predict function
southland_solar_layer_predicted = fcn.predict(southland_solar_layer, output_layer_name='prediction_layer')
# print the predicted layer
southland_solar_layer_predicted
# Access & visualize the dataframe from the predicted layer
test_pred_layer = southland_solar_layer_predicted.layers[0]
test_pred_layer_sdf = test_pred_layer.query().sdf
test_pred_layer_sdf.head()
FID | FID_1 | Field1 | ID | solar_plan | altitude_m | latitude | longitude | wind_speed | dayl__s_ | ... | Creator | EditDate | Editor | zone3_id | zone4_id | zone5_id | zone6_id | zone7_id | prediction | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1201.0 | 2018-09-10 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.71462 | 45619.199219 | ... | arcgis_python | 2020-04-27 | arcgis_python | 8312ccfffffffff | 8412ccdffffffff | 8512ccc3fffffff | 8612ccd57ffffff | 8712ccd52ffffff | 0.129638 | {"x": -12702497.020502415, "y": 6614660.374377... |
1 | 2 | 1202.0 | 2018-09-11 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.663262 | 45619.199219 | ... | arcgis_python | 2020-04-27 | arcgis_python | 8312ccfffffffff | 8412ccdffffffff | 8512ccc3fffffff | 8612ccd57ffffff | 8712ccd52ffffff | 0.149016 | {"x": -12702497.020502415, "y": 6614660.374377... |
2 | 3 | 1203.0 | 2018-09-12 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.847847 | 45273.601562 | ... | arcgis_python | 2020-04-27 | arcgis_python | 8312ccfffffffff | 8412ccdffffffff | 8512ccc3fffffff | 8612ccd57ffffff | 8712ccd52ffffff | 0.0457 | {"x": -12702497.020502415, "y": 6614660.374377... |
3 | 4 | 1204.0 | 2018-09-13 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.958236 | 44928.0 | ... | arcgis_python | 2020-04-27 | arcgis_python | 8312ccfffffffff | 8412ccdffffffff | 8512ccc3fffffff | 8612ccd57ffffff | 8712ccd52ffffff | 0.04043 | {"x": -12702497.020502415, "y": 6614660.374377... |
4 | 5 | 1205.0 | 2018-09-14 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 4.275449 | 44582.398438 | ... | arcgis_python | 2020-04-27 | arcgis_python | 8312ccfffffffff | 8412ccdffffffff | 8512ccc3fffffff | 8612ccd57ffffff | 8712ccd52ffffff | 0.046646 | {"x": -12702497.020502415, "y": 6614660.374377... |
5 rows × 30 columns
test_pred_layer_sdf.shape
(1590, 30)
The table above returns the predicted values for the Southland photovoltaic power plant stored in the field called prediction_results which has the model estimated daily capacity factor of energy generation, whereas the actual capacity factor is in the field named capacity_f.
The capacity factor is a normalized value which is now rescaled back to the original unit of KWh in the following, using the peak capacity of the Southland photovoltaic power plant which is 153KWp.
test_pred_layer_sdf.columns
Index(['FID', 'FID_1', 'Field1', 'ID', 'solar_plan', 'altitude_m', 'latitude', 'longitude', 'wind_speed', 'dayl__s_', 'prcp__mm_d', 'srad__W_m_', 'swe__kg_m_', 'tmax__deg', 'tmin__deg', 'vp__Pa_', 'kWh_filled', 'capacity_f', 'GlobalID', 'CreationDa', 'Creator', 'EditDate', 'Editor', 'zone3_id', 'zone4_id', 'zone5_id', 'zone6_id', 'zone7_id', 'prediction', 'SHAPE'], dtype='object')
optional_columns = ['prediction_results','prediction']
pred_col = None
for opt_col in optional_columns:
if opt_col in test_pred_layer_sdf.columns:
pred_col = opt_col
break
# inverse scaling from capcacity factor to actual generation in KWh - peak capcity of Southland Leisure Centre is 153KWp
test_pred_datetime = test_pred_layer_sdf[['Field1','capacity_f',pred_col]].copy()
test_pred_datetime = test_pred_datetime.rename(columns={'Field1':'date'})
test_pred_datetime['date'] = pd.to_datetime(test_pred_datetime['date'])
test_pred_datetime = test_pred_datetime.set_index(test_pred_datetime['date'])
test_pred_datetime['Actual_generation(KWh)'] = test_pred_datetime['capacity_f']*24*153
test_pred_datetime['predicted_generation(KWh)'] = test_pred_datetime[pred_col]*24*153
test_pred_datetime = test_pred_datetime.drop(['date','capacity_f',pred_col], axis=1).sort_index()
test_pred_datetime
Actual_generation(KWh) | predicted_generation(KWh) | |
---|---|---|
date | ||
2015-09-01 | 286.013 | 645.847283 |
2015-09-02 | 681.646 | 573.81391 |
2015-09-03 | 647.906 | 550.952354 |
2015-09-04 | 102.448 | 188.131957 |
2015-09-05 | 93.432 | 105.872621 |
... | ... | ... |
2019-12-27 | 1.349 | 40.980099 |
2019-12-28 | 1.965 | 22.588846 |
2019-12-29 | 1.616 | 74.11202 |
2019-12-30 | 7.44 | 110.594964 |
2019-12-31 | 8.323 | 57.998556 |
1590 rows × 2 columns
The table above shows the actual versus the model predicted daily solar energy generated for the Southland plant for the duration of late 2015 to the end of 2019. These values are now used to estimate the various model metrics to understand the prediction power of the model.
# estimate model metrics of r-square, rmse and mse for the actual and predicted values for daily energy generation
from sklearn.metrics import r2_score
r2_test = r2_score(test_pred_datetime['Actual_generation(KWh)'],test_pred_datetime['predicted_generation(KWh)'])
print('R-Square: ', round(r2_test, 2))
R-Square: 0.86
The comparison returns a considerably high r-square of 0.86 showing high similarity between actual and predicted values.
# Comparison between the actual sum of the total energy generated to the total predicted values
actual = (test_pred_datetime['Actual_generation(KWh)'].sum()/4/1000).round(2)
predicted = (test_pred_datetime['predicted_generation(KWh)'].sum()/4/1000).round(2)
print('Actual annual Solar Energy Generated by Southland Solar Station: {} MWh'.format(actual))
print('Predicted annual Solar Energy Generated by Southland Solar Stations: {} MWh'.format(predicted))
Actual annual Solar Energy Generated by Southland Solar Station: 170.03 MWh Predicted annual Solar Energy Generated by Southland Solar Stations: 170.44 MWh
Result Visualization
Finally, the actual and predicted values are plotted to visualize their distribution across the entire lifetime of the power plant.
plt.figure(figsize=(30,6))
plt.plot(test_pred_datetime['Actual_generation(KWh)'], linewidth=1, label= 'Actual')
plt.plot(test_pred_datetime['predicted_generation(KWh)'], linewidth=1, label= 'Predicted')
plt.ylabel('Solar Energy in KWh', fontsize=14)
plt.legend(fontsize=14,loc='upper right')
plt.title('Actual Vs Predicted Solar Energy Generated by Southland Solar-FulyConnectedNetwork Model', fontsize=14)
plt.grid()
plt.show()
Summarizing the values, it is seen that the actual average annual energy generated by the solar plant is very close to the predicted annual average generated energy which reveals high precision.
In the plot above the blue line indicates the actual generation and the orange line shows the predicted values, both of which overlaps each other to a high degree, showing a high predictive capacity of the model.
MLModel
In the second methodology a machine learning model is applied to model the same data using the MLModel
framework from arcgis.learn
. This framework could be used to import and apply any machine learning model from the scikit-learn library on the data returned by the prepare_tabulardata
function from arcgis.learn
.
# importing the libraries from arcgis.learn for data preprocessing for the Machine Learning Model
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import make_column_transformer
Data Preprocessing
Like the data preparation process for the neural network, first a list is made consisting of the feature data that will be used for predicting daily solar energy generation. By default, it will receive continuous variables, whereas for categorical variables the True value should be passed inside a tuple along with the variables. These variables are then transformed by the RobustScaler function from scikit-learn by passing it along with the variable list into the column transformer function as follows:
# scaling the feature data using MinMaxScaler(), the default is Normalizer from scikit learn
X = ['altitude_m', 'wind_speed', 'dayl__s_', 'prcp__mm_d','srad__W_m_','swe__kg_m_','tmax__deg','tmin__deg','vp__Pa_']
numerical_transformer = make_pipeline(MinMaxScaler())
preprocessors = make_column_transformer((numerical_transformer, X))
Once the explanatory variables list is defined and the precrocessors are computed these are now used as input for the prepare_tabulardata
method in arcgis.learn
. The method takes a feature layer or a spatial dataframe containing the dataset and returns a TabularDataObject that can be fed into the model.
The input parameters required for the tool are similar to the ones mentioned previously:
# importing the library from arcgis.learn for prepare data
from arcgis.learn import prepare_tabulardata
# precrocessing data using prepare data method for MLModel
data = prepare_tabulardata(calgary_no_southland_solar_layer,
'capacity_f',
explanatory_variables=X,
preprocessors=preprocessors)
Model Initialization
Once the data has been prepared by the prepare_tabulardata
method it is ready to be passed to the selected machine learning model for training. Here the GradientBoostingRegressor model from scikit-learn is used which is passed into the MLModel
function, along with its parameters as follows:
# importing the MLModel framework from arcgis.learn and the model from scikit learn
from arcgis.learn import MLModel
# defining the model along with the parameters
model = MLModel(data, 'sklearn.ensemble.GradientBoostingRegressor',loss ='absolute_error', learning_rate=0.02, n_estimators=117, random_state=43)
Model Training
Finally, the model is now ready for training, and the model.fit
method is used for fitting the machine learning model with its defined parameters mentioned in the previous step.
model.fit()
The training results are printed to compute some model metrics and assess the quality of the trained model.
model.show_results()
altitude_m | capacity_f | dayl__s_ | prcp__mm_d | srad__W_m_ | swe__kg_m_ | tmax__deg | tmin__deg | vp__Pa_ | wind_speed | capacity_f_results | |
---|---|---|---|---|---|---|---|---|---|---|---|
1489 | 1055 | 0.019555 | 29376.0 | 0 | 96.0 | 0 | -4.5 | -12.0 | 240 | 5.819128 | 0.015684 |
3502 | 1112 | 0.253015 | 53568.0 | 0 | 473.600006 | 0 | 24.5 | 8.0 | 680 | 5.097813 | 0.227703 |
4304 | 1070 | 0.248061 | 50112.0 | 0 | 422.399994 | 0 | 29.0 | 7.5 | 800 | 3.733651 | 0.213505 |
5491 | 1090 | 0.018597 | 34905.601562 | 0 | 265.600006 | 28 | 3.0 | -14.5 | 200 | 8.435382 | 0.047954 |
7679 | 1096 | 0.112015 | 44582.398438 | 0 | 288.0 | 0 | 22.5 | 10.5 | 1280 | 4.886889 | 0.152136 |
In the above table the last column named capacity_f_results returns the predicted values by the model on the test set which is highly similar to the actual values in the column named capacity_f for the target variable.
Subsequently, the model metrics of the trained model is now estimated using the model.score()
function which currently returns the r-square of the model fit as follows:
# r-square is estimated using the inbuilt model.score() from the tabular learner
print('r_square_test_rf: ', round(model.score(), 5))
r_square_test_rf: 0.76784
The high R-squared value indicates that the model has been trained well.
feature_imp_RF = model.feature_importances_
Solar Energy Generation Forecast & Validation
The trained GradientBoostingRegressor model implemented via the MLModel
will now be used to predict the daily lifetime solar energy generation for the solar plant installed at the Southland Leisure Centre similarly since it was installed during 2015. The aim is to compare and validate its performance as obtained by the FullyConnectedNetwork
model.
To recapitulate the model.predict
method from arcgis.learn
is used with the daily weather variables as input for the mentioned site ranging from September, 2015 to December, 2019 to predict daily solar energy output in KWh for the same time period. The predictors are automatically chosen from the input feature layer of southland_layer by the trained model without mentioning them explicitly, since their names are exactly same as are used for training the model.
southland_solar_layer_predicted_rf = model.predict(southland_solar_layer, output_layer_name='prediction_layer_rf')
# print the predicted layer
southland_solar_layer_predicted_rf
# Access & visualize the dataframe from the predicted layer
valid_pred_layer = southland_solar_layer_predicted_rf.layers[0]
valid_pred_layer_sdf = valid_pred_layer.query().sdf
valid_pred_layer_sdf.head()
FID | FID_1 | Field1 | ID | solar_plan | altitude_m | latitude | longitude | wind_speed | dayl__s_ | ... | vp__Pa_ | kWh_filled | capacity_f | GlobalID | CreationDa | Creator | EditDate | Editor | prediction | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 401.0 | 2016-07-05 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.787571 | 58060.800781 | ... | 920.0 | 593.275 | 0.161567 | 3cbc6ed1-6504-4f41-8e4b-b95c62816070 | 2020-04-27 | arcgis_python | 2020-04-27 | arcgis_python | 0.15771 | {"x": -12702497.020502415, "y": 6614660.374377... |
1 | 2 | 402.0 | 2016-07-06 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.30231 | 58060.800781 | ... | 920.0 | 575.397 | 0.156699 | d19858dc-1caa-4893-9e98-1540753eec4a | 2020-04-27 | arcgis_python | 2020-04-27 | arcgis_python | 0.206897 | {"x": -12702497.020502415, "y": 6614660.374377... |
2 | 3 | 403.0 | 2016-07-07 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 3.923609 | 58060.800781 | ... | 880.0 | 886.423 | 0.241401 | 0ede54fa-3541-45be-9b64-5cdf259f69aa | 2020-04-27 | arcgis_python | 2020-04-27 | arcgis_python | 0.222833 | {"x": -12702497.020502415, "y": 6614660.374377... |
3 | 4 | 404.0 | 2016-07-08 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 4.37531 | 57715.199219 | ... | 1000.0 | 976.136 | 0.265832 | c7259240-55a6-40fe-8700-844fafc12b8f | 2020-04-27 | arcgis_python | 2020-04-27 | arcgis_python | 0.221457 | {"x": -12702497.020502415, "y": 6614660.374377... |
4 | 5 | 405.0 | 2016-07-09 | 164440 | Southland Leisure Centre | 1100.0 | 50.962485 | -114.108472 | 2.816725 | 57715.199219 | ... | 1000.0 | 490.25 | 0.13351 | 6e59869a-e4e3-473a-b98f-a57ec6a5b480 | 2020-04-27 | arcgis_python | 2020-04-27 | arcgis_python | 0.205337 | {"x": -12702497.020502415, "y": 6614660.374377... |
5 rows × 25 columns
The table above returns the MLModel
predicted values for the Southland plant stored in the field prediction whereas the actual capacity factor is in the field named capacity_f.
The capacity factor is a normalized value which is now rescaled back to the original unit of KWh in the following, using the peak capacity of the Southland photovoltaic power plant which is 153KWp.
valid_pred_layer_sdf.columns
Index(['FID', 'FID_1', 'Field1', 'ID', 'solar_plan', 'altitude_m', 'latitude', 'longitude', 'wind_speed', 'dayl__s_', 'prcp__mm_d', 'srad__W_m_', 'swe__kg_m_', 'tmax__deg', 'tmin__deg', 'vp__Pa_', 'kWh_filled', 'capacity_f', 'GlobalID', 'CreationDa', 'Creator', 'EditDate', 'Editor', 'prediction', 'SHAPE'], dtype='object')
# inverse scaling from capcacity factor to actual generation in KWh - peak capcity of Southland Leisure Centre is 153KWp
valid_pred_datetime = valid_pred_layer_sdf[['Field1','capacity_f',pred_col]].copy()
valid_pred_datetime = valid_pred_datetime.rename(columns={'Field1':'date'})
valid_pred_datetime['date'] = pd.to_datetime(valid_pred_datetime['date'])
valid_pred_datetime = valid_pred_datetime.set_index(valid_pred_datetime['date'])
valid_pred_datetime['Actual_generation(KWh)'] = valid_pred_datetime['capacity_f']*24*153
valid_pred_datetime['predicted_generation(KWh)'] = valid_pred_datetime[pred_col]*24*153
valid_pred_datetime = valid_pred_datetime.drop(['date','capacity_f',pred_col], axis=1)
valid_pred_datetime = valid_pred_datetime.sort_index()
valid_pred_datetime.head()
Actual_generation(KWh) | predicted_generation(KWh) | |
---|---|---|
date | ||
2015-09-01 | 286.013 | 736.742935 |
2015-09-02 | 681.646 | 673.932795 |
2015-09-03 | 647.906 | 516.336426 |
2015-09-04 | 102.448 | 200.73517 |
2015-09-05 | 93.432 | 179.647032 |
The table above shows the actual versus the MLModel
predicted daily solar energy generated for the Southland plant for the duration of late 2015 to the end of 2019. These values are now used to estimate the various model metrics to understand the prediction power of the MLModel
.
# estimate model metrics of r-square, rmse and mse for the actual and predicted values for daily energy generation
from sklearn.metrics import r2_score
r2_test = r2_score(valid_pred_datetime['Actual_generation(KWh)'],valid_pred_datetime['predicted_generation(KWh)'])
print('R-Square: ', round(r2_test, 2))
R-Square: 0.81
The comparison returns a considerably high R-squared showing high similarity between actual and predicted values.
# Comparison between the actual sum of the total energy generated to the total predicted values by the MLModel
actual = (valid_pred_datetime['Actual_generation(KWh)'].sum()/4/1000).round(2)
predicted = (valid_pred_datetime['predicted_generation(KWh)'].sum()/4/1000).round(2)
print('Actual annual Solar Energy Generated by Southland Solar Station: {} MWh'.format(actual))
print('Predicted annual Solar Energy Generated by Southland Solar Stations: {} MWh'.format(predicted))
Actual annual Solar Energy Generated by Southland Solar Station: 170.03 MWh Predicted annual Solar Energy Generated by Southland Solar Stations: 170.04 MWh
Summarizing the values, it is seen that the actual average annual energy generated by the solar plant is very close to the predicted annual average generated energy which reveals high precision.
Result Visualization
Finally, the actual and predicted values are plotted to visualize their distribution across the entire lifetime of the power plant.
plt.figure(figsize=(30,6))
plt.plot(valid_pred_datetime['Actual_generation(KWh)'], linewidth=1, label= 'Actual')
plt.plot(valid_pred_datetime['predicted_generation(KWh)'], linewidth=1, label= 'Predicted')
plt.ylabel('Solar Energy in KWh', fontsize=14)
plt.legend(fontsize=14,loc='upper right')
plt.title('Actual Vs Predicted Solar Energy Generated by Southland Solar-FulyConnectedNetwork Model', fontsize=14)
plt.grid()
plt.show()
Conclusion
The goal of the project is to create a model that could predict the daily solar energy efficiency hence actual output of a photovoltaic solar plant at a location using daily weather variables of the site as input, and thereby demonstrate the application of the newly implemented artificial neural network of FullyConnectedNetwork
and machine learning models called MLModel
available in the arcgis.learn
module in ArcGIS API for Python.
Accordingly, data from 10 solar energy installation sites in the City of Calgary in Canada are used to train two different models — first the FullyConnectedNetwork
model and second the MLModel
framework from the arcgis.learn
module. These were eventually used to predict the daily solar output of a different solar plant in Calgary which is held out from the training set. The steps for implementing these models are discussed and elaborated in the notebook including data preprocessing, model training and final inferencing.
Comparison of the result shows that both the models successfully predicted the solar energy output of the test solar plant with predicted values of 171.76 MWh and 171.51 MWh by the FullyConnectedNetwork
and the MLModel
algorithm respectively, compared to the actual value of average annual solar generation of 170.74 MWh for the station.
Finally going further, it would be interesting to apply this model on other solar generation plants located across different geographies and record its performances to understand the generalizability of the model.
Summary of methods used
Method | Description | Examples |
---|---|---|
prepare_tabulardata | prepare data including imputation, normalization and train-test split | prepare data ready for fitting a MLModel or FullyConnectedNetwork model |
FullyConnectedNetwork() | set a fully connected neural network to a data | initialize a FullyConnectedNetwork model with prepared data |
model.lr_find() | find an optimal learning rate | finalize a good learning rate for training the FullyConnectedNetwork model |
MLModel() | select the ML algorithm to be used for fitting | any regression or classification model from scikit learn can be used |
model.fit() | train a model with epochs & learning rate as input | training the FullyConnectedNetwork model with sutiable input |
model.score() | find the model metric of R-squared of the trained model | returns R-squared value after training the MLModel and FullyConnectedNetwork model |
model.predict() | predict on a test set | predict values using the trained models on test input |
Data resources
Dataset | Source | Link |
---|---|---|
Calgary solar energy | Calgary daily solar energy generation | https://data.calgary.ca/Environment/Solar-Energy-Production/ytdn-2qsp |
Calgary Photovoltaic Sites | Location of Calgary Solar sites in Lat & Lon | https://data.calgary.ca/dataset/City-of-Calgary-Solar-Photovoltaic-Sites/vrdj-ycb5 |
Calgary Daily weather data | MODIS - Daily Surface Weather Data on a 1-km Grid for North America, Version 3 | https://daac.ornl.gov/DAYMET/guides/Daymet_V3_CFMosaics.html |