Overview
Questions
Objectives
This is a time series prediction tutorial using Pytorch. Pytorch is an open source software used in machine learning particularly for training neural networks. This tutorial will use a Pytorch recurrent neural network model to step through the basic workflow of a machine learning project:
Installing Pytorch
Installing deep learning software stacks can be tricky. Pytorch and other deep learning frameworks like Tensorflow and JAX are under rapid development and often depend on particular versions of dependencies.
Google Colab is the easiest way to do the following activity because Pytorch comes preinstalled. Otherwise, use Anaconda to install Pytorch in a new conda environment.
UH Professor Yinphan Tsang’s lab monitors rainfall and stream flow in Manoa Valley. Sensors at Lyon’s arboretum take data at regular 15 minute intervals. Raw data can be downloaded from her website.
A subset of the data has already been downloaded for this workshop. In particular, we will use the pyranometer data which quantifies the amount of solar radiation hitting the ground. We will perform timeseries forecasting with a recurrent neural network. Forecasting solar irradiation is important for managing the variability in renewable energy production.
Exercise: Import dataset, check configuration
To start, import all the relevant libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.preprocessing import MinMaxScaler
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
%matplotlib inline
Download the data from this link. This .csv file contains three columns: a timestamp, a measurement of shortwave radiation (in Watts/meter^2) from the sun, and a measurement of rainfall (in mm). Measurements are every 15 minutes from 2018-02-23 12:30:00 to 2020-12-03 12:30:00.
Move or upload the file into your jupyterlab workspace. Load the data into a pandas dataframe.
df = pd.read_csv('./lyons_radiation.csv')
Now let’s preprocess the data. First we set the index of the dataframe to be a series of DateTime objects. This lets pandas know that we are working with timestamps. We then fill in any missing timesteps using the forward fill method, which simply copies previous observations forward to fill any gaps. This is probably not the best method; usually we would inspect the data and use a more careful method, but this is simple, fast, and good enough for this demo.
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index(['timestamp'])
df = df.resample('15min').last().ffill()
Now we convert the radiation column into a numpy vector (removing meta data), and scale the values to be between zero and one. Large data values make it much more difficult to train the neural network, so it is critical to make the data (both inputs and outputs) neural network friendly. The Scikit-Learn package has nice tools for doing this. Remember that after training, any model predictions will need to undergo the inverse transform to get useful predictions.
We also convert the timeseries vector into a torch FloatTensor, since that is the data format expected by Pytorch.
data = df['radiation'].to_numpy(dtype='float32')
scaler = MinMaxScaler(feature_range=(0, 1))
data_normalized = scaler.fit_transform(data.reshape(-1, 1))
data_normalized = torch.FloatTensor(data_normalized).view(-1)
Training a neural network model is done by presenting sets of (input, output) example pairs. We need to specify what those examples should look like. Our inputs can be of arbitrary length, but we choose to only consider a finite length of 100 timesteps because this is approximately one day (15min * 4 * 24 hrs/day = 1 day).
def create_in_out_pairs(input_data, tw):
inout_seq = []
L = len(input_data)
for i in range(L-tw):
train_seq = input_data[i:i+tw]
train_label = input_data[i+tw:i+tw+1]
inout_seq.append((train_seq ,train_label))
return inout_seq
train_window = 100
in_out_data = create_in_out_pairs(data_normalized, train_window)
We select a small sample of the total dataset as our training set, leaving some examples for evaluating the model.
train_in_out_data = in_out_data[100:1000:10]
Creating Train/Test Splits
In machine learning, it is critical to split the data into “training” and “test” sets. The training set is used to train the model, while the test set is used to evaluate the model. Machine learning models tend to memorize the training data, so the training error always goes to zero — the only way to know if the model generalizes to new data is by evaluating the performance on the test set, a method known as cross-validation.
In timeseries data, there is a danger of information leakage where the training set is highly correlated to the test set (even if the exact examples are different). Thus, it is common use chrono-cross-validation, where the training data consists of examples before a certain time T, while the test set consists of examples after time T.
A Long-Short Term Memory (LSTM) neural network is a particular neural network architecture adept at processing sequence data. Because of their recurrent structure, they can take input sequences of arbitrary length (though in practice we limit the length due to computational constraints). See this blog post for a good introduction to recurrent neural networks (RNNs).
Pytorch provides a convenient way to define a complex model in a couple lines of code.
class LSTM(nn.Module):
def __init__(self, input_size=1, hidden_layer_size=100, output_size=1):
super().__init__()
self.hidden_layer_size = hidden_layer_size
self.lstm = nn.LSTM(input_size, hidden_layer_size)
self.linear = nn.Linear(hidden_layer_size, output_size)
self.hidden_cell = (torch.zeros(1,1,self.hidden_layer_size),
torch.zeros(1,1,self.hidden_layer_size))
def forward(self, input_seq):
lstm_out, self.hidden_cell = self.lstm(input_seq.view(len(input_seq) ,1, -1), self.hidden_cell)
predictions = self.linear(lstm_out.view(len(input_seq), -1))
return predictions[-1]
model = LSTM(input_size=1, hidden_layer_size=10, output_size=1)
We now have a Pytorch neural network model that takes in a sequence of values, and outputs a prediction (a single scalar). The model parameters have been initialized randomly, so the predictions are random. Now we need to train the model using our dataset.
Designing a neural network architecture
The neural network architecture can have a large impact on performance. The inductive bias of a model refers to the set of assumptions and prior beliefs that are inherent in this choice. All machine learning models have inductive bias. This is not a bad thing, but it means some models are more appropriate for some problems than others. This is formalized in what is known as The No Free Lunch Theorem.
Experienced practicioners know which architectures tend to work best for certain types of problems. Bigger models usually work better, but are slower because of computational constraints. In practice, it is common to try lots of different models and perform model selection to pick the best based on how well the model performs on a test set.
Training the model is an iterative process. Starting from randomly initialized parameters, we iterate through the training examples, make predictions, and update the weight parameters to minimize the loss function. In this case, we use the mean squared error (MSE) loss. We keep track of the average loss over the training set, and report it every time we iterate through the training dataset — each iteration through the training set is called an epoch.
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
epochs = 20
for i in range(epochs):
for x, y in train_in_out_data:
optimizer.zero_grad()
model.hidden_cell = (torch.zeros(1, 1, model.hidden_layer_size),
torch.zeros(1, 1, model.hidden_layer_size))
y_pred = model(x)
single_loss = loss_function(y_pred, y)
single_loss.backward()
optimizer.step()
if True: #i%1 == 1:
print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')
print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')
epoch: 0 loss: 0.04469438 epoch: 1 loss: 0.04004482 epoch: 2 loss: 0.03381182 epoch: 3 loss: 0.02361814 epoch: 4 loss: 0.01961462 epoch: 5 loss: 0.01725977 epoch: 6 loss: 0.01528491 epoch: 7 loss: 0.01404397 epoch: 8 loss: 0.01311346 epoch: 9 loss: 0.01235864 epoch: 10 loss: 0.01172257 epoch: 11 loss: 0.01118899 epoch: 12 loss: 0.01075610 epoch: 13 loss: 0.01042009 epoch: 14 loss: 0.01016921 epoch: 15 loss: 0.00998601 epoch: 16 loss: 0.00985228 epoch: 17 loss: 0.00975257 epoch: 18 loss: 0.00967518 epoch: 19 loss: 0.00961193 epoch: 19 loss: 0.0096119326
How do I know if it’s working?
Training a neural network is notoriously difficult. The problem is that we are performing stochastic gradient descent optimization on a very non-convex function. You should look to see that the loss function decreases during the first few epochs, indicating that the model is improving. If it doesn’t, try decreasing the learning rate. If that doesn’t help, then you have a bug or the model is not appropriate for your problem.
How long do I train?
It is hard to know when to stop training, because even when the model seems to have converged, it might suddenly have “a revelation” that allows it to explain the training data perfectly. To build your intuition, we highly recommend playing with the toy neural networks in Tensorflow Playground.
In practice, we train until we run out of patience or when we start overfitting to a held out test set (known as early stopping).
Typically we will evaluate the model on a clean, held-out test set that has not been used for training. Evaluation is done by comparing the test set predictions with ground truth labels and computing metrics such as the MSE. Here, we simply demonstrate predictions and visualize them.
The model should be set to eval mode, then we can pass in some test input sequence to get a prediction for the next value. Iteratively passing in the previous input gives a predicted sequence.
test_inputs = data_normalized[-train_window:].tolist()
print(test_inputs)
model.eval()
fut_pred = 12 # Number of predictions to make.
for i in range(fut_pred):
seq = torch.FloatTensor(test_inputs[-train_window:])
with torch.no_grad():
model.hidden = (torch.zeros(1, 1, model.hidden_layer_size),
torch.zeros(1, 1, model.hidden_layer_size))
test_inputs.append(model(seq).item())
# Predictions need to be scaled back into W/m^2
actual_predictions = scaler.inverse_transform(np.array(test_inputs[train_window:] ).reshape(-1, 1))
We can plot the observed and predicted values. Pandas datetimes can be used to label the x-axis.
plt.title('')
plt.ylabel('Solar Irradiance $(W/m^2)$')
plt.grid(True)
plt.autoscale(axis='x', tight=True)
plt.plot(data[-train_window:], label='Observed')
plt.plot(np.arange(train_window, train_window +fut_pred), actual_predictions, label='Predicted')
plt.xticks(ticks=np.arange(0, train_window, 20),
labels=df['radiation'][-train_window:].index[np.arange(0, 100, 20)],
rotation=45)
plt.legend()
plt.show()
Improving the model
If we want to improve the model, the first thing to ask ourselves is whether we are underfitting or overfitting. If we are not overfitting, then we are underfitting and we should increase the size of the model and train longer. If we are overfitting, then we should add regularization or get more training data.