0% found this document useful (0 votes)
17 views8 pages

TCAN: Advanced Time Series Forecasting

The paper introduces Temporal Convolutional Attention Neural Networks (TCAN), a novel approach for time series forecasting that combines temporal convolutions with an attention mechanism to improve interpretability and accuracy. TCAN effectively addresses the limitations of traditional Temporal Convolutional Neural Networks (TCNNs) by requiring fewer convolutional layers while maintaining a large receptive field and providing insights into important timesteps for predictions. Evaluation on solar power forecasting datasets demonstrates that TCAN outperforms several state-of-the-art models, including TCNN, in terms of accuracy and training speed.

Uploaded by

Sagar Vathar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

TCAN: Advanced Time Series Forecasting

The paper introduces Temporal Convolutional Attention Neural Networks (TCAN), a novel approach for time series forecasting that combines temporal convolutions with an attention mechanism to improve interpretability and accuracy. TCAN effectively addresses the limitations of traditional Temporal Convolutional Neural Networks (TCNNs) by requiring fewer convolutional layers while maintaining a large receptive field and providing insights into important timesteps for predictions. Evaluation on solar power forecasting datasets demonstrates that TCAN outperforms several state-of-the-art models, including TCNN, in terms of accuracy and training speed.

Uploaded by

Sagar Vathar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Temporal Convolutional Attention Neural Networks

for Time Series Forecasting


Yang Lin Irena Koprinska Mashud Rana
School of Computer Science School of Computer Science Data61
University of Sydney University of Sydney CSIRO
Sydney, Australia Sydney, Australia Sydney, Australia
ylin4015@[Link] [Link]@[Link] [Link]@[Link]

Abstract—Temporal Convolutional Neural Networks (TCNNs) layers, and can access any part of the historical sequence
have been applied for various sequence modelling tasks including regardless of temporal distance. Li et al. [5] proposed the
time series forecasting. However, TCNNs may require many LogSparse Transformer, which solves the locality-agnostics
convolutional layers if the input sequence is long and are not
able to provide interpretable results. In this paper, we present and memory bottleneck problems of the Transformer.
TCAN, a novel deep learning approach that employs attention Another prominent class of deep learning methods, convolu-
mechanism with temporal convolutions for probabilistic forecast- tional neural networks, have also been applied for time series
ing, and demonstrate its performance in a case study for solar forecasting. They are attractive due to their ability to represent
power forecasting. TCAN uses the hierarchical convolutional repeated patterns in the time series via convolutional filters and
structure of TCNN to extract temporal dependencies and then
uses sparse attention to focus on the important timesteps. The extract useful features from raw data without prior knowledge
sparse attention layer of TCAN enables an extended receptive or feature engineering [6]–[8].
field without requiring a deeper architecture and allows for Temporal Convolutional Neural Networks (TCNNs) [9] are
interpretability of the forecasting results. An evaluation using specifically designed for sequence modeling tasks and have
three large solar power data sets demonstrates that TCAN been applied for different types of data - image, language and
outperforms several state-of-the-art deep learning forecasting
models including TCNN in terms of accuracy. TCAN requires
music [9], solar power forecasting [8], retail, electricity and
less number of convolutional layers than TCNN for an extended traffic [10]. The probabilistic TCNNs were proposed in [10].
receptive field, is faster to train and is able to visualize the most TCNN is a hierarchical architecture consisting of several
important timesteps for the prediction. convolutional layers. It uses casual convolutions, dilated con-
Index Terms—time series forecasting, deep learning, temporal volutions and residual connections to enable a larger receptive
convolutional neural network, sparse attention
field, reduce the unstable gradient problem and boost training
speed [9]. However, when the input sequence is long, TCNN
I. I NTRODUCTION may need many temporal convolutional layers in order to
Time series forecasting is an essential task in many areas, have a sufficiently large receptive field that covers the input
e.g. in industry - predicting electricity demand, in finance - sequence. In addition, TCNN is a black-box architecture, not
predicting stock prices and exchange rate, in retail - predicting able to provide interpretable results. Many applications of
sales, supply and demand, in health - predicting immune time series forecasting involve critical decisions; providing
response, disease progression and hospital length of stay. explanations improves the confidence of decision makers and
Statistical methods such as linear regression, ARIMA and is hence highly desirable. In this paper, we present a new
exponential smoothing [1] are well-established and widely approach to address these issues.
used by industry forecasters. However, they require domain The contributions of our work are as follows:
knowledge for model selection, fit each time series indepen- 1) We propose Temporal Convolutional Attention Neural
dently and are not able to infer shared patterns from related Network (TCAN), which employs convolutional architec-
time series [2], [3]. ture and attention mechanism. TCAN learns the temporal
On the other hand, deep learning methods have been in- dependency via hierarchical convolutional architecture
creasingly applied for time series forecasting, showing very and generates forecasting results with a sparse attention
promising results. They are able to learn from raw data with layer. The use of sparse attention layer enables TCAN to
less domain knowledge and feature engineering, and can ex- access all historical input steps regardless of the sequence
tract complex patterns, including shared patterns, from related length, to focus on the most important timesteps from the
time series. For example, Salinas et al. [4] proposed DeepAR, input sequence and to provide visualization of the results
a probabilistic forecasting model based on Long Short Term for interpretability.
Memory (LSTM) networks. The Transformer architecture [3] 2) We evaluate TCAN for time series forecasting on three
is a new sequence model that uses only attention mechanism real-world solar power data sets (two sets contain mul-
for data processing, without any recurrent or convolutional tivariate series and one set contains related series). The
results show that TCAN outperforms the state-of-the-art yt
deep learning models DeepAR, LogSparse Transformer,
N-BEATS and TCNN, and a persistence baseline. The
dilation=4
attention mappings visualize the important timesteps for
interpretability of the results and demonstrate that ac-
cessing longer previous history is important. Our results dilation=2
show that TCAN with the sparse attention layer requires
less number of convolutional layers than TCNN for an
extended receptive field and performs better than TCNN dilation=1
in terms of accuracy and training speed.
II. C ASE S TUDY: S OLAR P OWER F ORECASTING y0;x1 y1;x2 yt-2;xt-1 yt-1;xt
As a case study we consider solar power forecasting: given
Fig. 1. TCNN architecture
an input sequence of previous PV solar power generation data
(e.g. hourly or half-hourly), predict the solar power for the
next time step. and calendar features {Zi,1:Tl +Th }N
i=1 , while the covari-
Solar power forecasting is needed for optimal scheduling of ates for the Solar dataset include calendar features only.
generators and for the integration of solar into the electricity
grid. The penetration of solar energy into the electricity Task: Predict {Yi,Tl +1:Tl +Th }N
i=1 , the PV power for the

grid is rapidly increasing but since solar has a variable and next Th time steps after Tl .
intermittent nature, there is a need for accurate forecasting The input of TCAN at step t is the concatenation of yi,t−1
in order to to ensure stability and efficient operation of the and xi,t . TCAN produces the probability distribution of future
electricity grid. values, given the past history:

A. Data p (Yi,Tl +1:Tl +Th | Yi,1:Tl , Xi,1:Tl +Th ; Φ)


TlY
+Th
We use three publicly available data sets: Sanyo [11], (1)
= p (yi,t | Yi,1:t−1 , Xi,1:t ; Φ)
Hanergy [12] and Solar [13].
t=Tl +1
Sanyo and Hanergy contain solar power generation data
from two PV plants in Australia - from 1/2011 to 12/2016 where Φ denotes the parameters of TCAN.
(6 years) for Hanergy and and 1/2011 to 12/2017 (7 years) Note that the subscript i is omitted in the rest of the paper
for Sanyo. Only the data between 7am and 5pm was consid- for simplicity.
ered and it was aggregated at half-hourly intervals. For both
datasets, weather and weather forecast data was also collected III. BACKGROUND
(see [14] for more details) and used as covariates.
A. Temporal Convolutional Neural Network
Solar contains solar power data from 137 PV plants in
Alabama, USA, from 01/2006 to 08/2006. The Solar data is Bai et al. [9] proposed a generic architecture of TCNN,
aggregated into 1-hour intervals. which is informed by CNN architectures for sequential data
Following [5], [14], calendar features were also added ac- such as WaveNet [15] but is specifically designed to be simpler
cording to the granularity of the datasets: Sanyo and Hanergy and to combine autoregressive prediction with a long memory.
use month, hour-of-the-day and minute-of-the-hour, and Solar TCNN is a hierarchical architecture, consisting of several
uses month, hour-of-the-day and age. convolutional hidden layers with the same size as the input
All data was normalized to have zero mean and unit layer, as shown in Fig. 1. It is designed to process data element
variance. by element.
TCNN utilizes three main techniques: causal convolutions,
B. Problem Statement dilated convolutions and residual connections.
Given is: Causal convolutions. The output at time t is convolved
1) a set of N univariate time series {Yi,1:Tl }N i=1 (PV solar in only with elements from time t or earlier time steps from the
our case study, N =1 for Sanyo and Hanergy and N =137 previous layer. This concept has been used in Waibel’s time-
for Solar), where Yi,1:Tl , [yi,1 , yi,2 , ..., yi,Tl ], Tl is the delay network [16] and the WaveNet architecture [15]. Zero
input sequence length, and yi,t ∈ < is the ith PV power padding is used in hidden layers to ensure that the hidden
generated at time t; layers have the same dimensionality as the input layer to
2) a set of associated time-based multi-dimensional covari- facilitate the convolutions.
ate vectors {Xi,1:Tl +Th }N
i=1 , where Th is the length of Dilated convolutions. This technique was introduced in
forecasting horizon. For our case study, the covariates [15], [17] to enable large receptive fields, and consequently
for the Sanyo and Hanergy datasets include: weather to capture a long memory, which is not possible with causal
{W1i,1:Tl }Ni=1 , weather forecasts {WFi,Tl +1:Tl +Th }i=1
N
convolutions alone as they require a very deep NN.
it predicts the next value, then adds the predicted value to
the input window and shifts the input window with one step,
Dropout makes the next prediction and so on until all values from the
ReLU forecasting horizon are predicted.
In summary, compared to recurrent architectures such as
WeightNorm RNN, LSTM and GRU, TCNNs have the advantage of larger
receptive field size, more stable gradients and parallelism [9].
Dilated Causal Conv
1x1 Conv However, although the use of dilated convolutions enables
Dropout (optional) larger receptive fields, this also requires more temporal convo-
lutional layers to cover long input sequences, which makes the
ReLU architecture more complex and slower to train, and may also
WeightNorm
lead to overfitting. In this paper we propose TCAN, which
allows to access all input timesteps without increasing the
Dilated Causal Conv number of temporal convolutional layers, and preserves the
other advantages of TCNN.
B. Attention Mechanism
Fig. 2. TCNN’s residual block
The attention mechanism [19] was initially proposed for
sequence-to-sequence (seq2seq) tasks in natural language pro-
The dilated convolutional operator F on the sequence cessing but since then has been successfully used in other do-
element s is defined as: mains as well. It is applied in an encoder-decoder framework
and allows to automatically identify the parts of the encoder
k−1
X input sequence that are important for the decoder outputs.
F (s) = f (i) · xs−d·i (2)
In the seq2seq framework, the encoder and decoder take
i=0
the sequential steps as input and generate the hidden state
where f : {0, . . . , k − 1} → R is the convolution filter, x is ht ∈ <1×dhid at each step, where dhid is the hidden layer size.
the sequential input (concatenation of the solar time series and Soft attention takes as input the encoder hidden states h1:Tl
covariates for our case study), k is the filter size, and d is the and the decoder hidden state ht at time step t, and generates
dilation factor. a context vector ci by calculating their dot product:
The convolution kernel remains the same for all layers but
the dilation factor increases exponentially with the depth of the ci = h1:Tl · hTt (4)
network: dl = 2l , where l is the network level. For example, Then, the context vector is normalised by the softmax
as shown in Fig. 1, d1 is 1 at the first layer (corresponding function to produce attention weights. The weight ai of each
to regular convolutions) and then increases at each layer, encoder hidden state hi is given by:
reaching 4 at the last hidden layer. This pyramidal structure
exp(ci )
and aggregation mechanism effectively increases the receptive ai = PTl (5)
field of TCNN, allowing to cover a long input sequence. j=1 exp(ci )
Residual connections. Residual blocks [18] help to over- The weight represents how important the encoder step i is
come the gradient vanishing problem in networks with many to the decoder output at step t.
layers. The main idea is to add the input x to a block of Finally, the attention layer output is computed by the dot
stacked layers (a series of transformation F) to the output of product between the attention weights a1:Tl and the encoder
this block by using shortcut connections: hidden states h1:Tl . The weighted output is concatenated with
o = σ(x + F(x)) (3) the decoder hidden state to generate the decoder output.
Intuitively, the attention mechanism helps the decoder to
where σ is the activation function. pay attention to the parts of the historical sequence steps that
Fig. 2 illustrates a residual block of TCNN [9]. There are are important regardless of the length of input steps and thus
two branches - the first one transforms the input x through a overcomes the limitation of seq2seq framework to encode a
series of stacked layers including two dilated causal convolu- whole sequence into a single fixed-size vector. The use of
tion layers, while the second one is the shortcut connection attention helps the seq2seq model to have better performance
for the input x. However, the original input x and the output when processing longer sequences.
of the residual block F could have different widths, and the However, attention using softmax always results in positive
addition cannot be done. This can be rectified by using the attention weights for all timesteps, including the irrelevant
1 × 1 convolution layer on the shortcut branch to ensure the steps which can be harmful to long sequences [20], [21].
same widths. To overcome this issue, recent studies have proposed sparse
In this work, we implemented an autoregressive TCNN [9]. attention mechanisms to learn sparse attention mappings [5],
Given an input sequence with pre-defined size (input window), [20]–[22]. The sparse approach shows advantages over dense
attention on multiple sequential modelling tasks in term of yt
accuracy and also interpretability of the results as it tends to Output Layer
generate less scattered attention maps [5].
Sparse Attention Layer
IV. T EMPORAL C ONVOLUTIONAL ATTENTION N EURAL
h1 h2 ht-1 ht
N ETWORKS Temporal
A. Motivation and Novelty Convolution
dilation=2
TCAN aims to:
Layers
1) Enable large receptive fields without increasing the num-
ber of convolutional layers. dilation=1
2) Focus on the timesteps from the input sequence that are
important for prediction and to ignore the irrelevant ones y0;x1 y1;x2 yt-2;xt-1 yt-1;xt
to improve accuracy.
3) Provide visualization of the most relevant timesteps to
Fig. 3. TCAN architecture
facilitate understanding and interpretability of the results.
Although TCNN [8]–[10] uses exponentially dilated convo-
lution to extend the receptive field, if the input sequence is architecture, may lead to overfitting and also increases the
long, it may need many convolutional layers which increases training time.
the complexity and training time. Below we present our proposed approach, TCAN, which
Given a TCNN with nL temporal convolutional layers, uses attention mechanism to enable an extended receptive field
convolutional filters of size k and dilation factor dl = 2l , without adding more temporal convolutional layers.
the effective history of layer l ({l ∈ Z : 0 ≤ l ≤ nL }) that is The attention mechanism also allows TCNN to focus on the
used by TCNN to generate predictions is (k − 1) × dl [9]. important input timesteps and ignore the irrelevant ones when
The number of historical input steps that TCNN could use making predictions which may improve accuracy. In addition,
to make predictions is the sum of the effective histories of all the attention layer is used to provide visualization of the most
nL
P −1 relevant timesteps for the prediction of each instance. This
convolutional layers: (k − 1) × dl . enables interpretability and justification of the results which is
l=0
It can be shown that to have a receptive field covering an important for real-world time series forecasting applications.
input sequence with length Tl , TCNN needs at least nL = B. Model Architecture
Tl
dlog2 ( k−1 + 1)e convolutional layers:
As illustrated in Fig. 3, TCAN consists of three parts: 1)
temporal convolution layers, 2) sparse attention layer and 3)
L −1
nX
output layer.
(k − 1) × dl ≥ Tl TCAN employs hierarchical convolutional architecture to
l=0 encode the input sequence and extract temporal pattern as
L −1
nX
latent variables. The latent variables are then used by an
(k − 1) × 2l ≥ Tl attention mechanism to learn the most relevant features (pre-
l=0
vious timesteps) and generate the final prediction. The latent
L −1
nX
Tl variables encode information from the whole input window,
2l ≥ (6)
k−1 enabling large receptive fields without adding more convolu-
l=0
Tl tional layers. In addition, the attention mechanism allows to
2nL − 1 ≥ visualise the most important timestamps for each instance to
k−1
Tl facilitate understanding and interpretability of the results.
nL
2 ≥ +1 TCAN uses the benefits of sparse attention mechanism to: 1)
k−1
access all previous timesteps without the requirement of a deep
Tl
nL ≥ log2 ( + 1) architecture and 2) focus on the important input timesteps and
k−1
ignore the irrelevant ones when making predictions, 3) provide
In our case study for solar power forecasting, the length of visualization of the most relevant timesteps for the prediction.
the input sequence (previous history) is 20 or 24 steps per day, Temporal Convolution Layers. Firstly, TCAN extracts
and we use a small kernel with a filter size of k = 3 to focus the temporal latent factors ht−Tl :t via the multiple dilated
on the local content [9]. This means that a TCNN with at least temporal convolutional layers (TC) from the historical data
24
4 convolutional layers (nL = dlog2 ( 3−1 + 1)e = 4) is needed within the input window Tl (yt−Tl :t , xt−Tl :t ) as:
to take into account all input steps. When the size of the input
ht−Tl :t = TC(yt−Tl :t , xt−Tl :t ) (7)
sequence increases (e.g. due to a different data granularity or
the need to use more previous timesteps), more convolutional where the extracted latent factors encode all information of
layers will be needed. This increases the complexity of the the input sequence within the rolling window.
Sparse Attention Layer. The sparse attention layer takes Eq. (10) and (11) form the Gaussian distribution N (yˆt , σt2 ),
the temporal latent factors (ht−Tl :t ) as input and generates and predictions could be sampled from the distribution. The
the attention vector (h̃t ) that is used to make the prediction. ρ-quantile output are generated via the inverse cumulative
Standard attention scores used in Transformer and RNN ar- probability distribution: yˆt = Ft−1 (ρ).
chitectures are computed by the softmax function. However, Loss Function. Finally, the parameters of TCAN are op-
softmax never assigns a probability of zero to any previous timised by minimising the loss function shown in Eq. (12)
timesteps, so it never fully rules out the unimportant parts of below, where ŷt is the point forecast. To provide both accurate
the input sequence [21]. In sequence modelling tasks, includ- point and probabilistic forecasts, it combines the Mean Ab-
ing solar power forecasting, the future timestep is typically solute Error (MAE) and the Negative Log-Likelihood (NLL)
strongly related to a few historical timesteps and it is desirable using the regularization parameter a. Higher a increases the
to increase the focus on them. For example, the solar power weight of the probabilistic forecast; we used a = 0.5.
at step t is more related to the solar power at the previous
hours on the same day and the same time on the previous day L(ŷTl +1:T , σI2T +1:T , yTl +1:T , a)
l
(step t − Tl ) rather than the other timesteps. This is supported
= a × NLL(ŷTl +1:T , σI2T +1:T , yTl +1:T )
by Fig. 4 which shows the partial aurocorrelation plot of the l

solar series for all data sets for 30 lags; we can see two strong + MAE(ŷTl +1:T , yTl +1:T )
linear dependencies: the first is at lag 1 and the second is at a
T
 X (12)
lag 20 for Sanyo and Hanergy and lag 24 for Solar. =− × Th log(2π) + log σI2t
2Th
Recent studies have developed variations such as sparse t=Tl +1
attention which increase the focus on the most relevant input T T
X  1 X
timesteps. Specifically, we applied α-entmax attention [21], + (yt − ŷt )2 σI−2 + |yt − ŷt |
t
Th
defined as: t=Tl +1 t=Tl +1

at−Tl :t−1 In summary, the information flow in TCAN includes several


temporal convolution layers, a sparse attention layer and an
=α − entmax(ht−Tl :t−1 · hTt ) (8) output layer; the network is trained end-to-end optimizing the
= ReLU((α − 1) × (ht−Tl :t−1 · hTt ) 1/α−1
− τ 1)) loss function from Eq. (12).
where τ is the Lagrange multiplier, 1 is the all-one vector and V. E XPERIMENTAL S ETUP
α is the hyperparameter. α-entmax maps latent variables into A. Methods Used for Comparison
sparse attention scores at−Tl :t−1 ∈ <Tl ×1 base on the dot We compare the performance of TCAN with five state-
product similarity between the historical steps (ht−Tl :t−1 ∈ of-the-art deep learning models (DeepAR, N-BEATS-G, N-
<Tl ×dhid ) and current step (ht ∈ <1×dhid ). BEATS-I, LogSparse Transformer and TCNN) and a persis-
Note that α-entmax is equivalent to using softmax when tence model.
α = 1 and sparsemax when α = 2 [20]. In TCAN we set α to • DeepAR [4] is a widely used sequence-to-sequence prob-
1.5 for a balance between softmax and sparsemax as in [21]. abilistic forecasting model.
Then we employ concatenation to combine the information • N-BEATS [2] is based on backward and forward residual
from the attention context vector ct and target hidden state ht links and stacks of fully connected layers. N-BEATS-G
to produce the attention vector h̃t . The context vector is the provides generic forecasting results, while N-BEATS-I
dot product between the attention score and hidden states of provides interpretable results by decomposing the time
the historical steps and could be considered as the weighted series into trend and seasonality. We introduced covariates
sum of hidden states ht : to N-BEATS at the input of each block to facilitate
h̃t = [ct ◦ ht ] = [(aTt−TL :t−1 · ht−TL :t−1 ) ◦ ht ] (9) multivariate series forecasting.
• LogSparse Transformer [5] is a recently proposed vari-
Output Layer. The output layer uses the attention vectors ation of the Transformer architecture for time series
to make the final predictions. In this work, we consider data forecasting. It is denoted as ”LS Transformer” in Table
is distributed in Gaussian distribution, which is commonly II.
used in real-world time series modelling [4]. We transfer the • TCNN [8]–[10] is a novel convolutional architecture and
attention vector as the forecasting results including the mean has been successfully applied to solar power forecasting
and variance of the distribution, as illustrated in Eq. (10) and [8]. As we consider probabilistic forecasts, we selected
(11). In Eq. (11), the softplus function guarantees that the the autoregressive probabilistic TCNN from [10] for the
variance is always positive. comparison and used a fixed length input sequence.
TCNN-3 and TCNN-4 correspond to TCNN with 3 and
yˆt = linear(h̃t )) (10) 4 temporal convolutional layers respectively.
• Persistence is a typical baseline in forecasting. It consid-
σt2 = softplus(linear(h̃t )) ers the time series of the previous day as the prediction
(11)
= log(1 + exp(linear(h̃t ))) for the next day.
1.0 1.0 1.0

0.5 0.5 0.5


0.0
0.0 0.0
0.5
0 5 10 15 20 25 30 0 10 20 30 0 10 20 30
(a) (b) (c)

Fig. 4. Partial autocorrelation for (a) Sanyo, (b) Hanergy and (c) Solar.

B. Data Split and Hyperparameter Tuning TABLE I


H YPERPARAMETERS FOR TCAN AND TCNN
All models were implemented with PyTorch 1.6 on Tesla δ dhid dk
P100 16GB GPU under Linux environment. The deep learning TCNN-3: Sanyo 0.1 [20,16,8] 3
models were optimized by mini-batch gradient descent with Hanergy 0.1 [16,12,6] 3
the Adam optimizer and a maximum number of epochs 200. Solar 0.1 [16,12,8] 3
We used Bayesian optimization for hyperparameter search TCNN-4: Sanyo 0.2 [20,16,12,8] 3
Hanergy 0.1 [12,8,6,4] 3
with a maximum number of iterations of 20. The models used Solar 0.1 [16,12,8,6] 3
for comparison were tuned based on the recommendations of TCAN: Sanyo 0.1 [12,6] 2
the authors in the papers. The hyperparameters which have Hanergy 0.1 [12,8,4] 3
obtained a minimum loss on the validation set were selected Solar 0.1 [12,6,4] 3
and used to evaluate the performance on the test set.
Following the experimental setup in [14] and [5], we used TABLE II
ACCURACY RESULTS - ρ0.5/ρ0.9- LOSS .  DENOTES RESULTS FROM [5].
the following training, validation and test split: for Sanyo and
Hanergy - the data from the last year as test set, the second Sanyo Hanery Solar
last year as validation set for early stopping and the remaining Persistence 0.154/- 0.242/- 0.256/-
data (5 years for Sanyo and 4 years for Hanergy) as training DeepAR 0.070/0.031 0.092/0.045 0.222 /0.093
LS Transformer 0.067/0.036 0.124/0.066 0.210 /0.082
set; for Solar - the last week data as test set (from 25/08/2006), N-BEATS-I 0.091/- 0.154/- 0.215/-
the week before as validation set. For all data sets, the data N-BEATS-G 0.077/- 0.132/- 0.212/-
preceding the validation set is split in the same way into three TCNN-3 0.066/0.032 0.088/0.045 0.230/0.088
TCNN-4 0.069/0.031 0.078/0.041 0.222/0.080
subsets and the corresponding validation set is used to select TCAN 0.062/0.031 0.068/0.035 0.209/0.081
the best hyperparameters.
For TCAN, the learning rate λ is fixed to 0.005 for all data
sets, the batch size nbatch is 256 for Sanyo set and 512 for quantile loss is given by:
Hanergy and Solar sets, the regularization parameter a and P
2 × t Pρ (yt , ŷt )
the entmax attention α are set to 0.5 and 1.5 respectively, the QLρ (y, ŷ) = P ,
t |yt |
dropout rate δ is chosen from {0, 0.1, 0.2} and the kernel  (13)
size dk from {3, 4}; the convolutional layer size dhid and the ρ(y − ŷ) if y > ŷ
Pρ (y, ŷ) =
number of convolutional layers are chosen from the descent- (1 − ρ)(ŷ − y) otherwise
sorted permutations of {20, 16, 12, 8, 6, 4} and {2, 3}. VI. R ESULTS AND D ISCUSSION
For TCNN, all settings are the same as for TCAN except The ρ0.5 and ρ0.9-losses are shown in Table II. Since N-
that we consider both TCNN with 3 and 4 convolutional layers BEATS and Persistence do not produce probabilistic forecasts,
to allow for comparable receptive fields with TCAN. The only the ρ0.5-loss (equivalent to MAE) is reported for them.
selected best hyperparameters for TCAN and TCNNs with 3 Overall, the most accurate model is TCAN - it outperforms
(TCNN-3) and 4 (TCNN-4) layers are listed in Table I and all other methods for both point (ρ0.5) and probabilistic
used for the evaluation on the test set. (ρ0.9) forecasts on all datasets except for one case - ρ0.9
for Solar, where it is ranked second after TCNN-4. For
the point forecasts, the second-best performing model is
C. Evaluation Measures LogSparse Transformer, followed by TCNN-4 and TCNN-
3 (equal rank), then DeepAR, N-BEATS-G, N-BEATS-I and
Following [4], [23], we report the standard ρ0.5 and ρ0.9- finally Persistence. For the probabilistic forecasts, TCAN and
quantile losses. Note that ρ0.5 is equivalent to the Mean TCNN-4 are equally first, followed by TCNN-3, DeepAR and
Absolute Percentage Error (MAPE) [24]. Given the ground LogSparse Transformer. Hence, overall TCNN-4 is the second-
truth y and ρ-quantile of the predicted distribution ŷ, the ρ- best performing model.
0 4 8 12 16 0 4 8 12 16
6 0 1.0 6 0 1.0
4 4 0.8 4 4 0.8
Power (kW)

Power (kW)
Encoder

Encoder
8 0.6 2 8 0.6
2
12 0.4 12 0.4
0 0
Ground truth 95% interval 16 0.2 Ground truth 95% interval 16 0.2
Forecasts Forecasts
2 2
0 6 12 18 24 30 36 0.0 0 6 12 18 24 30 36 0.0
Time (hours) Decoder Time (hours) Decoder
(a) (b) (c) (d)
0 4 8 12 16 0 4 8 12 16
6 0 6 0 1.0
0.6 0.8
4 4 4 4
Power (kW)

Power (kW)
Encoder

Encoder
2 8 0.4 2 8 0.6
12 12 0.4
0 0.2 0
Ground truth 95% interval 16 Ground truth 95% interval 16 0.2
Forecasts Forecasts
2 2
0 6 12 18 24 30 36 0.0 0 6 12 18 24 30 36 0.0
Time (hours) Decoder Time (hours) Decoder
(e) (f) (g) (h)
0 4 8 12 16 20 0 4 8 12 16 20
12 0 24 0
0.20 0.20
4 4
18
Power (kW)

Power (kW)
8 0.15
8 8 0.15
Encoder

Encoder
4 12
12 0.10 12 0.10
0 16 6 16
Ground truth 95% interval 0.05 Ground truth 95% interval 0.05
4
Forecasts 20 0 Forecasts 20
0 6 12 18 24 30 36 42 0.00 0 6 12 18 24 30 36 42 0.00
Time (hours) Decoder Time (hours) Decoder
(i) (j) (k) (l)

Fig. 5. TCAN case study: (1) actual vs predicted values and (2) attention map for two samples from each dataset. First, second and third rows correspond
to the Sanyo, Hanergy and Solar datasets respectively.

By comparing the two TCNN models, we can see that accessing longer previous history is important. For example,
TCNN-4 outperformed TCNN-3 for both point and probabilis- all maps show high attention scores for some early time steps.
tic forecasts on all datasets except for Sanyo for ρ0.5. While Fig. 5 (d) shows an extreme case - the first future prediction
the receptive field of TCNN-4 is able to cover all values from is determined by the second input step only.
the input sequence (20 for Sanyo and Hanergy and 24 for The attention-map visualization is useful to understand the
Solar), the receptive field of TCNN-3 covers only 14 steps. importance of the input features for each instance and can be
This shows that enabling a larger receptive field in TCNN-4 used for explanation and justification of decisions to increase
was beneficial. trust in the system in practical applications.
TCAN is more accurate than both TCNN-4 and TCNN-3 We also compare the training speed of TCAN and TCNN-
which shows the effectiveness of the sparse attention mecha- 4; both models can cover all input steps via their receptive
nism for improving accuracy and extending the receptive field fields. Both are trained on the same device, and the average
without adding more convolutional layers. The receptive fields elapsed time per batch and standard deviation are reported
of both TCAN and TCNN-4 can cover all values from the input in Fig. 6. TCAN is considerably faster than TCNN-4 on all
sequence but TCAN uses a smaller number of convolutional datasets because it has fewer temporal convolutional layers
layers than TCNN-4. and trainable parameters.
Fig. 5 illustrates TCAN’s forecasting results for two con- Overall, the superior performance of TCAN indicates its
secutive days from the test set of each dataset - (i) actual vs effectiveness for capturing sparse dependency between future
predicted values for each day and (ii) the corresponding sparse and past steps and enabling an extended receptive field without
attention map. The attention map shows the pair attention a deep architecture. TCAN also provides explainable results by
scores which represent the importance of the previous time showing which timesteps are most important for the prediction.
series steps (y-axis) in predicting the future steps (x-axis).
The plots show that TCAN is able to model the solar power VII. C ONCLUSION
series accurately, and the attention maps show that: 1) the In this work, we present TCAN, a new approach for
dependency between future and past steps is sparse and 2) time series forecasting. TCAN employs multiple temporal
TCAN 31630 [5] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan,
“Enhancing the locality and breaking the memory bottleneck of Trans-
TCNN
Elapsed time (milliseconds)
former on time series forecasting,” in Proceedings of the Conference on
Neural Information Processing Systems (NeurIPS), 2019.
[6] M. Bińkowski, G. Marti, and P. Donnat, “Autoregressive convolutional
neural networks for asynchronous time series,” in Proceedings of the
17594 Time Series Workshop at International Conference on Machine Learning
15368 (ICML), 2017.
[7] I. Koprinska, D. Wu, and Z. Wang, “Convolutional neural networks for
energy time series forecasting,” in Proceedings of the International Joint
7646 8977 7425 Conference on Neural Networks (IJCNN), 2018.
[8] Y. Lin, I. Koprinska, and M. Rana, “Temporal convolutional neural net-
works for solar power forecasting,” in Proceedings of the International
Joint Conference on Neural Networks (IJCNN), 2020.
[9] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
Sanyo Hanergy Solar
Datasets convolutional and recurrent networks for sequence modeling,” arXiv
preprint arXiv: 1803.01271, 2018.
[10] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, “Probabilistic forecasting with
Fig. 6. Comparison of training time of TCAN and TCNN temporal convolutional neural network,” Neurocomputing, vol. 399, pp.
491 – 501, 2020.
[11] D1, “Sanyo dataset,” [Link]
convolutional layers to learn temporal patterns and a sparse source/alice-springs/dka-m4-b-phase, 2020.
attention layer to enable an extended receptive field without [12] D2, “Hanergy dataset,” [Link]
springs/dka-m16-b-phase, 2020.
adding more layers. The sparse attention layer uses the latent [13] D3, “Solar dataset,” [Link]
factors generated by the temporal convolutional layers and 2014.
identifies the important time steps to produce the attention [14] Y. Lin, I. Koprinska, and M. Rana, “SpringNet: Transformer and Spring
DTW for time series forecasting,” in Proceedings of the International
vectors for the output layer and compute the final forecasts. Conference on Neural Information Processing (ICONIP), 2020.
The performance of TCAN is evaluated on three solar power [15] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
data sets as a case study. The results show that TCAN out- A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
A generative model for raw audio,” arXiv preprint arXiv: 1609.03499,
performs the state-of-the-art deep learning models DeepAR, 2016.
LogSparse Transformer, N-BEATS and TCNN and a persistent [16] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang,
baseline in terms of accuracy. TCAN requires less number of “Phoneme recognition using time-delay neural networks,” IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, 1989.
convolutional layers than TCNN to cover the input sequence [17] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
and is also much faster to train. The sparse attention maps convolutions,” arXiv preprint arXiv: 1511.07122, 2015.
facilitate understanding and interpretability of the results by [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv: 1512.03385, 2015.
showing the most relevant timesteps for the prediction of each [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
instance. by jointly learning to align and translate,” in Proceedings of the
In future work, we will investigate (1) the importance of International Conference on Learning Representations (ICLR), 2016.
[20] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse
the input time steps identified by the attention mechanism and model of attention and multi-label classification,” in Proceedings of The
the covariates to further improve interpretability, and (2) the International Conference on Machine Learning (ICML), 2016.
application of TCAN to other time series forecasting tasks. [21] B. Peters, V. Niculae, and A. F. T. Martins, “Sparse sequence-to-
sequence models,” in Proceedings of the Annual Meeting of the As-
R EFERENCES sociation for Computational Linguistics (ACL), 2019.
[22] G. M. Correia, V. Niculae, and A. F. T. Martins, “Adaptively sparse
[1] H. T. Pedro and C. F. Coimbra, “Assessment of forecasting techniques transformers,” in Proceedings of the Conference on Empirical Methods
for solar power production with no exogenous inputs,” Solar Energy, in Natural Language Processing and the International Joint Conference
vol. 86, no. 7, pp. 2017–2028, 2012. on Natural Language Processing (EMNLP-IJCNLP), 2019.
[2] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-BEATS: [23] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and
Neural basis expansion analysis for interpretable time series forecasting,” T. Januschowski, “Deep state space models for time series forecasting,”
in Proceedings of the International Conference on Learning Represen- in Proceedings of the Conference on Neural Information Processing
tations (ICLR), 2020. Systems (NeurIPS), 2018.
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [24] H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix fac-
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings torization for high-dimensional time series prediction,” in Proceedings
of the Conference on Neural Information Processing Systems (NeurIPS), of the Conference on Neural Information Processing Systems (NeurIPS),
2017. 2016.
[4] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, “DeepAR:
Probabilistic forecasting with autoregressive recurrent networks,” Inter-
national Journal of Forecasting, vol. 36, no. 3, pp. 1181 – 1191, 2020.

You might also like