TCAN: Advanced Time Series Forecasting
TCAN: Advanced Time Series Forecasting
Abstract—Temporal Convolutional Neural Networks (TCNNs) layers, and can access any part of the historical sequence
have been applied for various sequence modelling tasks including regardless of temporal distance. Li et al. [5] proposed the
time series forecasting. However, TCNNs may require many LogSparse Transformer, which solves the locality-agnostics
convolutional layers if the input sequence is long and are not
able to provide interpretable results. In this paper, we present and memory bottleneck problems of the Transformer.
TCAN, a novel deep learning approach that employs attention Another prominent class of deep learning methods, convolu-
mechanism with temporal convolutions for probabilistic forecast- tional neural networks, have also been applied for time series
ing, and demonstrate its performance in a case study for solar forecasting. They are attractive due to their ability to represent
power forecasting. TCAN uses the hierarchical convolutional repeated patterns in the time series via convolutional filters and
structure of TCNN to extract temporal dependencies and then
uses sparse attention to focus on the important timesteps. The extract useful features from raw data without prior knowledge
sparse attention layer of TCAN enables an extended receptive or feature engineering [6]–[8].
field without requiring a deeper architecture and allows for Temporal Convolutional Neural Networks (TCNNs) [9] are
interpretability of the forecasting results. An evaluation using specifically designed for sequence modeling tasks and have
three large solar power data sets demonstrates that TCAN been applied for different types of data - image, language and
outperforms several state-of-the-art deep learning forecasting
models including TCNN in terms of accuracy. TCAN requires
music [9], solar power forecasting [8], retail, electricity and
less number of convolutional layers than TCNN for an extended traffic [10]. The probabilistic TCNNs were proposed in [10].
receptive field, is faster to train and is able to visualize the most TCNN is a hierarchical architecture consisting of several
important timesteps for the prediction. convolutional layers. It uses casual convolutions, dilated con-
Index Terms—time series forecasting, deep learning, temporal volutions and residual connections to enable a larger receptive
convolutional neural network, sparse attention
field, reduce the unstable gradient problem and boost training
speed [9]. However, when the input sequence is long, TCNN
I. I NTRODUCTION may need many temporal convolutional layers in order to
Time series forecasting is an essential task in many areas, have a sufficiently large receptive field that covers the input
e.g. in industry - predicting electricity demand, in finance - sequence. In addition, TCNN is a black-box architecture, not
predicting stock prices and exchange rate, in retail - predicting able to provide interpretable results. Many applications of
sales, supply and demand, in health - predicting immune time series forecasting involve critical decisions; providing
response, disease progression and hospital length of stay. explanations improves the confidence of decision makers and
Statistical methods such as linear regression, ARIMA and is hence highly desirable. In this paper, we present a new
exponential smoothing [1] are well-established and widely approach to address these issues.
used by industry forecasters. However, they require domain The contributions of our work are as follows:
knowledge for model selection, fit each time series indepen- 1) We propose Temporal Convolutional Attention Neural
dently and are not able to infer shared patterns from related Network (TCAN), which employs convolutional architec-
time series [2], [3]. ture and attention mechanism. TCAN learns the temporal
On the other hand, deep learning methods have been in- dependency via hierarchical convolutional architecture
creasingly applied for time series forecasting, showing very and generates forecasting results with a sparse attention
promising results. They are able to learn from raw data with layer. The use of sparse attention layer enables TCAN to
less domain knowledge and feature engineering, and can ex- access all historical input steps regardless of the sequence
tract complex patterns, including shared patterns, from related length, to focus on the most important timesteps from the
time series. For example, Salinas et al. [4] proposed DeepAR, input sequence and to provide visualization of the results
a probabilistic forecasting model based on Long Short Term for interpretability.
Memory (LSTM) networks. The Transformer architecture [3] 2) We evaluate TCAN for time series forecasting on three
is a new sequence model that uses only attention mechanism real-world solar power data sets (two sets contain mul-
for data processing, without any recurrent or convolutional tivariate series and one set contains related series). The
results show that TCAN outperforms the state-of-the-art yt
deep learning models DeepAR, LogSparse Transformer,
N-BEATS and TCNN, and a persistence baseline. The
dilation=4
attention mappings visualize the important timesteps for
interpretability of the results and demonstrate that ac-
cessing longer previous history is important. Our results dilation=2
show that TCAN with the sparse attention layer requires
less number of convolutional layers than TCNN for an
extended receptive field and performs better than TCNN dilation=1
in terms of accuracy and training speed.
II. C ASE S TUDY: S OLAR P OWER F ORECASTING y0;x1 y1;x2 yt-2;xt-1 yt-1;xt
As a case study we consider solar power forecasting: given
Fig. 1. TCNN architecture
an input sequence of previous PV solar power generation data
(e.g. hourly or half-hourly), predict the solar power for the
next time step. and calendar features {Zi,1:Tl +Th }N
i=1 , while the covari-
Solar power forecasting is needed for optimal scheduling of ates for the Solar dataset include calendar features only.
generators and for the integration of solar into the electricity
grid. The penetration of solar energy into the electricity Task: Predict {Yi,Tl +1:Tl +Th }N
i=1 , the PV power for the
grid is rapidly increasing but since solar has a variable and next Th time steps after Tl .
intermittent nature, there is a need for accurate forecasting The input of TCAN at step t is the concatenation of yi,t−1
in order to to ensure stability and efficient operation of the and xi,t . TCAN produces the probability distribution of future
electricity grid. values, given the past history:
solar series for all data sets for 30 lags; we can see two strong + MAE(ŷTl +1:T , yTl +1:T )
linear dependencies: the first is at lag 1 and the second is at a
T
X (12)
lag 20 for Sanyo and Hanergy and lag 24 for Solar. =− × Th log(2π) + log σI2t
2Th
Recent studies have developed variations such as sparse t=Tl +1
attention which increase the focus on the most relevant input T T
X 1 X
timesteps. Specifically, we applied α-entmax attention [21], + (yt − ŷt )2 σI−2 + |yt − ŷt |
t
Th
defined as: t=Tl +1 t=Tl +1
Fig. 4. Partial autocorrelation for (a) Sanyo, (b) Hanergy and (c) Solar.
Power (kW)
Encoder
Encoder
8 0.6 2 8 0.6
2
12 0.4 12 0.4
0 0
Ground truth 95% interval 16 0.2 Ground truth 95% interval 16 0.2
Forecasts Forecasts
2 2
0 6 12 18 24 30 36 0.0 0 6 12 18 24 30 36 0.0
Time (hours) Decoder Time (hours) Decoder
(a) (b) (c) (d)
0 4 8 12 16 0 4 8 12 16
6 0 6 0 1.0
0.6 0.8
4 4 4 4
Power (kW)
Power (kW)
Encoder
Encoder
2 8 0.4 2 8 0.6
12 12 0.4
0 0.2 0
Ground truth 95% interval 16 Ground truth 95% interval 16 0.2
Forecasts Forecasts
2 2
0 6 12 18 24 30 36 0.0 0 6 12 18 24 30 36 0.0
Time (hours) Decoder Time (hours) Decoder
(e) (f) (g) (h)
0 4 8 12 16 20 0 4 8 12 16 20
12 0 24 0
0.20 0.20
4 4
18
Power (kW)
Power (kW)
8 0.15
8 8 0.15
Encoder
Encoder
4 12
12 0.10 12 0.10
0 16 6 16
Ground truth 95% interval 0.05 Ground truth 95% interval 0.05
4
Forecasts 20 0 Forecasts 20
0 6 12 18 24 30 36 42 0.00 0 6 12 18 24 30 36 42 0.00
Time (hours) Decoder Time (hours) Decoder
(i) (j) (k) (l)
Fig. 5. TCAN case study: (1) actual vs predicted values and (2) attention map for two samples from each dataset. First, second and third rows correspond
to the Sanyo, Hanergy and Solar datasets respectively.
By comparing the two TCNN models, we can see that accessing longer previous history is important. For example,
TCNN-4 outperformed TCNN-3 for both point and probabilis- all maps show high attention scores for some early time steps.
tic forecasts on all datasets except for Sanyo for ρ0.5. While Fig. 5 (d) shows an extreme case - the first future prediction
the receptive field of TCNN-4 is able to cover all values from is determined by the second input step only.
the input sequence (20 for Sanyo and Hanergy and 24 for The attention-map visualization is useful to understand the
Solar), the receptive field of TCNN-3 covers only 14 steps. importance of the input features for each instance and can be
This shows that enabling a larger receptive field in TCNN-4 used for explanation and justification of decisions to increase
was beneficial. trust in the system in practical applications.
TCAN is more accurate than both TCNN-4 and TCNN-3 We also compare the training speed of TCAN and TCNN-
which shows the effectiveness of the sparse attention mecha- 4; both models can cover all input steps via their receptive
nism for improving accuracy and extending the receptive field fields. Both are trained on the same device, and the average
without adding more convolutional layers. The receptive fields elapsed time per batch and standard deviation are reported
of both TCAN and TCNN-4 can cover all values from the input in Fig. 6. TCAN is considerably faster than TCNN-4 on all
sequence but TCAN uses a smaller number of convolutional datasets because it has fewer temporal convolutional layers
layers than TCNN-4. and trainable parameters.
Fig. 5 illustrates TCAN’s forecasting results for two con- Overall, the superior performance of TCAN indicates its
secutive days from the test set of each dataset - (i) actual vs effectiveness for capturing sparse dependency between future
predicted values for each day and (ii) the corresponding sparse and past steps and enabling an extended receptive field without
attention map. The attention map shows the pair attention a deep architecture. TCAN also provides explainable results by
scores which represent the importance of the previous time showing which timesteps are most important for the prediction.
series steps (y-axis) in predicting the future steps (x-axis).
The plots show that TCAN is able to model the solar power VII. C ONCLUSION
series accurately, and the attention maps show that: 1) the In this work, we present TCAN, a new approach for
dependency between future and past steps is sparse and 2) time series forecasting. TCAN employs multiple temporal
TCAN 31630 [5] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan,
“Enhancing the locality and breaking the memory bottleneck of Trans-
TCNN
Elapsed time (milliseconds)
former on time series forecasting,” in Proceedings of the Conference on
Neural Information Processing Systems (NeurIPS), 2019.
[6] M. Bińkowski, G. Marti, and P. Donnat, “Autoregressive convolutional
neural networks for asynchronous time series,” in Proceedings of the
17594 Time Series Workshop at International Conference on Machine Learning
15368 (ICML), 2017.
[7] I. Koprinska, D. Wu, and Z. Wang, “Convolutional neural networks for
energy time series forecasting,” in Proceedings of the International Joint
7646 8977 7425 Conference on Neural Networks (IJCNN), 2018.
[8] Y. Lin, I. Koprinska, and M. Rana, “Temporal convolutional neural net-
works for solar power forecasting,” in Proceedings of the International
Joint Conference on Neural Networks (IJCNN), 2020.
[9] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
Sanyo Hanergy Solar
Datasets convolutional and recurrent networks for sequence modeling,” arXiv
preprint arXiv: 1803.01271, 2018.
[10] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, “Probabilistic forecasting with
Fig. 6. Comparison of training time of TCAN and TCNN temporal convolutional neural network,” Neurocomputing, vol. 399, pp.
491 – 501, 2020.
[11] D1, “Sanyo dataset,” [Link]
convolutional layers to learn temporal patterns and a sparse source/alice-springs/dka-m4-b-phase, 2020.
attention layer to enable an extended receptive field without [12] D2, “Hanergy dataset,” [Link]
springs/dka-m16-b-phase, 2020.
adding more layers. The sparse attention layer uses the latent [13] D3, “Solar dataset,” [Link]
factors generated by the temporal convolutional layers and 2014.
identifies the important time steps to produce the attention [14] Y. Lin, I. Koprinska, and M. Rana, “SpringNet: Transformer and Spring
DTW for time series forecasting,” in Proceedings of the International
vectors for the output layer and compute the final forecasts. Conference on Neural Information Processing (ICONIP), 2020.
The performance of TCAN is evaluated on three solar power [15] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
data sets as a case study. The results show that TCAN out- A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
A generative model for raw audio,” arXiv preprint arXiv: 1609.03499,
performs the state-of-the-art deep learning models DeepAR, 2016.
LogSparse Transformer, N-BEATS and TCNN and a persistent [16] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang,
baseline in terms of accuracy. TCAN requires less number of “Phoneme recognition using time-delay neural networks,” IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, 1989.
convolutional layers than TCNN to cover the input sequence [17] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
and is also much faster to train. The sparse attention maps convolutions,” arXiv preprint arXiv: 1511.07122, 2015.
facilitate understanding and interpretability of the results by [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv: 1512.03385, 2015.
showing the most relevant timesteps for the prediction of each [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
instance. by jointly learning to align and translate,” in Proceedings of the
In future work, we will investigate (1) the importance of International Conference on Learning Representations (ICLR), 2016.
[20] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse
the input time steps identified by the attention mechanism and model of attention and multi-label classification,” in Proceedings of The
the covariates to further improve interpretability, and (2) the International Conference on Machine Learning (ICML), 2016.
application of TCAN to other time series forecasting tasks. [21] B. Peters, V. Niculae, and A. F. T. Martins, “Sparse sequence-to-
sequence models,” in Proceedings of the Annual Meeting of the As-
R EFERENCES sociation for Computational Linguistics (ACL), 2019.
[22] G. M. Correia, V. Niculae, and A. F. T. Martins, “Adaptively sparse
[1] H. T. Pedro and C. F. Coimbra, “Assessment of forecasting techniques transformers,” in Proceedings of the Conference on Empirical Methods
for solar power production with no exogenous inputs,” Solar Energy, in Natural Language Processing and the International Joint Conference
vol. 86, no. 7, pp. 2017–2028, 2012. on Natural Language Processing (EMNLP-IJCNLP), 2019.
[2] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-BEATS: [23] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and
Neural basis expansion analysis for interpretable time series forecasting,” T. Januschowski, “Deep state space models for time series forecasting,”
in Proceedings of the International Conference on Learning Represen- in Proceedings of the Conference on Neural Information Processing
tations (ICLR), 2020. Systems (NeurIPS), 2018.
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [24] H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix fac-
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings torization for high-dimensional time series prediction,” in Proceedings
of the Conference on Neural Information Processing Systems (NeurIPS), of the Conference on Neural Information Processing Systems (NeurIPS),
2017. 2016.
[4] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, “DeepAR:
Probabilistic forecasting with autoregressive recurrent networks,” Inter-
national Journal of Forecasting, vol. 36, no. 3, pp. 1181 – 1191, 2020.