Frequency Modulated Transformer Self-Attention for Advanced
Infectious Disease Prediction
ASMITA MAHAJAN∗ and DURGA TOSHNIWAL∗ , Indian Institute of Technology Roorkee, India
Time series forecasting of infectious diseases is crucial to addressing significant global health challenges. The application
of Artificial Intelligence (AI), particularly deep learning (DL) algorithms, has demonstrated substantial success in sequence
modeling; however, their performance in epidemiological forecasting remains constrained by the non-stationary nature and
complex frequency dynamics of disease transmission. This research introduces a novel Frequency Modulated Transformer
(FMT) framework to address these challenges. The FMT framework decomposes time series data into distinct frequency-
modulated signals, utilizing self-attention mechanisms to capture temporal frequencies. A transformer encoder-decoder
architecture then predicts and captures multi-scale temporal dependencies and makes accurate predictions. The novelty of
the proposed approach lies in decomposing the input time series into Intrinsic Mode Functions (IMFs) and integrating the
frequency-specific components into a Transformer architecture via entropy-based feature selection. The FMT framework
significantly reduces Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) by approximately 50% and 65%,
respectively, compared to conventional methods. Additionally, it achieves an 8% increase in the R2 score, demonstrating
enhanced predictive accuracy. The proposed methodology is evaluated using COVID-19 datasets from multiple countries
along with an influenza dataset, and benchmarked against statistical, machine learning, and state-of-the-art deep learning
baselines. The contribution of entropy-based IMF integration is systematically examined by comparing results with and
without this component, underscoring its importance in improving predictive accuracy. This work highlights substantial
improvements in predictive accuracy and computational efficiency, advancing epidemiological forecasting and supporting
real-time public health decision-making and AI-driven disease surveillance systems.
CCS Concepts: • Computing methodologies → Machine learning algorithms; Neural networks; • Applied computing
→ Health care information systems.
Additional Key Words and Phrases: Transformer, Deep Learning, Time Series, Sequence Modeling, Infectious Disease, Predictive
Analytics, Public Health
1 INTRODUCTION
In recent years, time series forecasting has become a crucial tool across various domains, including finance,
weather prediction, and particularly infectious disease (ID) forecasting [32]. Infectious diseases significantly
contribute to the global health burden, causing millions of deaths annually [25]. Understanding the factors driving
disease spread is essential for implementing effective control and prevention measures. Accurate forecasting is
essential for informing policymakers, healthcare professionals, and the public, especially regarding influenza
outbreaks and the more recent challenges posed by the COVID-19 pandemic [41].
Artificial intelligence (AI) in healthcare originated with expert systems that relied on rules derived from
expert interviews for data analysis, learning, and inference [24]. Machine-learning (ML) algorithms are adept
∗ Both authors contributed equally to this research.
Authors’ Contact Information: Asmita Mahajan, a_mahajan@[Link]; Durga Toshniwal, [Link]@[Link], Indian Institute of
Technology Roorkee, Roorkee, Uttarakhand, India.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
permissions@[Link].
© 2025 Copyright held by the owner/author(s).
ACM 2157-6912/2025/9-ART
[Link]
ACM Trans. Intell. Syst. Technol.
2 • A. Mahajan and D. Toshniwal
at analyzing complex datasets and identifying patterns, making them ideal for predicting infectious diseases
influenced by various factors [34]. Recent studies have shown promising results, but a significant challenge is the
availability of high-quality data. Surveillance systems collect diverse data, yet it is often incomplete, biased, or
noisy, affecting model performance [29]. To address the challenges, ML techniques like decision trees, random
forests, and support vector machines have been used to analyze intricate datasets, aiding in identifying patterns
essential for disease outbreak prediction [3]. Deep learning networks have further revolutionized this area by
capturing complex data relationships, leading to more precise and timely forecasts [20, 22].
The multifaceted nature of infectious disease transmission creates challenges in accurately capturing the
dynamics of disease spread through mathematical or computational models [54]. Real-world infectious disease
time series, such as COVID-19 or influenza outbreaks, are characterized by abrupt surges, seasonality, long-term
waves, and irregular noise induced by policy changes, underreporting, and behavioral shifts. Standard attention
mechanisms in Transformers are not inherently designed to disentangle such overlapping temporal dynamics,
often leading to poor generalization. This presents a critical knowledge gap, and there is a lack of transformer-
based architectures that explicitly incorporate frequency decomposition to isolate and model disease dynamics
across different temporal patterns. Decomposing the time series signal into different frequencies, such as using
intrinsic mode functions (IMF), offers a valuable approach to address the challenges in modeling infectious disease
transmission. This method isolates various temporal patterns, allowing for a deeper understanding of disease
dynamics at different temporal scales [11]. IMF-based decomposition improves predictive models and targeted
interventions by identifying short-term variations and long-term patterns [9, 11].
A novel frequency-modulated transformer (FMT) framework is proposed in this study. The time series’ temporal
dynamics are first broken down into different IMFs representing frequency components. Subsequently, the IMFs
undergo preprocessing utilizing an entropy-based approach. The IMFs are analyzed and processed to extract
relevant features that contribute to the underlying dynamics of the ID time series. This preprocessing step aims to
enhance the quality and informativeness of the IMF signals before further analysis. In the proposed framework,
self-attention mechanisms are used to effectively recognize and incorporate the temporal frequencies inherent to
the input IMF signals. The transformer encoder-decoder architecture is leveraged to predict each IMF and forecast
infectious disease data. The encoder processes the preprocessed IMFs as input, extracting important temporal
features, while the decoder generates forecasts for each IMF. This proposed approach enables the model to make
forecasts based on the frequency-modulated characteristics of the time series, providing valuable insights for
disease prediction and control efforts.
The main technical contributions of this study are as follows:
(1) This research introduces FMT, a novel framework that decomposes ID time-series data into a frequency-
modulated signal framework and leverages a self-attention Transformer architecture to capture multi-scale
temporal dependencies, providing a deeper understanding of disease dynamics.
(2) The paper proposes an entropy-based feature selection mechanism that filters out noisy or redundant
IMFs to retain only the most informative frequency components, improving the quality and relevance of
frequency-modulated signals.
(3) The paper further extends FMT by introducing the Frequency Modulated Respective Transformer (FMRT),
which treats each IMF as a separate sequence and processes them independently using distinct Transformer
models. This multi-transformer architecture improves interpretability and allows researchers to analyze
individual frequency components separately.
(4) The proposed FMT and FMRT models achieve superior predictive accuracy and efficiency, outperforming
SOTA techniques like LSTM, GRU, and Vanilla Transformers. They reduce RMSE by 50%, MAE by 65%,
and improve the R2 score by 8% while lowering computational cost by 12%. This balance of accuracy and
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 3
efficiency makes FMT and FMRT robust frameworks for real-time disease prediction and public health
decision-making.
The rest of the paper is structured as follows: Section 2 reviews related work in statistical, machine learning,
deep learning, and transformer-based forecasting. Section 3 covers the theoretical foundations of frequency
decomposition and self-attention. Section 4 details the proposed FMT and FMRT architectures, including prepro-
cessing and model design. Section 5 presents comprehensive experimental results, comparing the models against
baseline techniques. Finally, Section 6 summarizes key findings and discusses potential extensions to broader
epidemiological applications.
2 RELATED WORK
This section reviews prior statistical, machine learning, deep learning, and transformer-based approaches for
time-series forecasting, identifying their limitations and motivating the need for the proposed method.
2.1 Statistical Methods
The study of time series analysis has evolved significantly over the years, beginning with [51], in which the authors
proposed a novel statistical method for investigating periodicities in distributed time series, focusing on Wolfer’s
sunspot numbers. This work laid the foundation for understanding and analyzing periodic patterns in temporal
data. The authors [7] introduced a systematic approach to forecasting and control in time series analysis. Their
contributions were further formalized in their book [8], which established the ARIMA model as a fundamental tool
for time series forecasting. Later, the authors [53] introduced a hybrid ARIMA-neural network model, integrating
linear and non-linear methods to improve forecasting accuracy, showcasing a significant progression toward
incorporating machine learning techniques into traditional statistical models. These traditional statistical models
rely on linearity and stationarity assumptions, making them ineffective for capturing the complexity, nonlinearity,
and variability of real-world time-series data while requiring extensive manual preprocessing.
2.2 Machine Learning-based Approaches
Machine learning (ML) techniques have been increasingly applied to time series forecasting in healthcare, where
modeling temporal dependencies, uncovering complex, non-linear patterns, often goes undetected with traditional
statistical methods for predicting disease incidence and progression [6, 26, 42]. ML-based ensemble models, such
as Random Forest (RF) and Extreme Gradient Boosting (XGB), are particularly favored in healthcare research for
their interpretability and ability to model non-linear interactions through decision tree-based frameworks [31, 35].
RF utilizes bagging with random data subsets and feature sampling to build an ensemble of trees, while XGB
employs a gradient boosting framework for sequential tree addition [13, 16, 35]. Recent studies have explored
the integration of machine learning in healthcare [4, 19, 43], particularly for infectious and chronic disease
prediction. These models have shown success in infectious disease forecasting, [13, 16], highlighting their utility
in addressing complex real-world forecasting challenges. ML-based approaches, while effective, face limitations
such as reliance on large, high-quality datasets, susceptibility to overfitting, and difficulties in interpretability,
which can hinder their practical application in healthcare settings.
2.3 Deep Learning-based Approaches
With the advancement of deep learning (DL), its ability to handle complex and high-dimensional data has made it
a powerful tool in infectious disease forecasting. As a subset of ML, DL techniques like Recurrent Neural Networks
(RNN) [28], Long Short-Term Memory (LSTM) [18], Bidirectional LSTM (BI-LSTM) [30], and Gated Recurrent
Unit (GRU) [14] have been widely used for predicting COVID-19 trends. Studies have successfully forecasted daily
deaths [52], confirmed and recovered cases [46, 50], and long-term trends [18, 30]. Effective models [1] combined
ACM Trans. Intell. Syst. Technol.
4 • A. Mahajan and D. Toshniwal
LSTM with CNN and Bayesian optimization to achieve high accuracy using SMAPE metrics. In contrast, others
[5] extended predictions up to 30 days, demonstrating the robustness of DL approaches. However, the sequential
nature of recurrent architectures presents significant challenges for parallelizing training across samples, while
memory limitations hinder efficient batch processing, particularly for long sequences.
2.4 Transformer Self-Attention Module
Transformer models, leveraging the self-attention mechanism, have demonstrated significant advancements
in time-series forecasting by enabling parallel computation and effectively modeling long-range dependencies.
TEST-Net enhances spatio-temporal infectious disease prediction by integrating the Transformer architecture
[12], while Oriented Transformer focuses on accurate case prediction in infectious diseases [45]. Ensemble and
Transformer models have also been explored for improved infectious disease prediction accuracy [2]. The authors
[10] proposed AD-Autoformer, which uses decomposition and attention distilling for long-sequence forecasting.
Informer introduced a more efficient Transformer design tailored for long time-series sequences [55], while
Generative Pretrained Hierarchical Transformer has been utilized for general time-series forecasting tasks [23].
Additionally, Transformer models have been applied to large-scale healthcare data, such as claims data, for
predicting severe COVID-19 progression [21], showcasing their versatility and effectiveness in diverse forecasting
challenges. Recent decomposition-based forecasting approaches have demonstrated significant improvements
in time series modeling. The authors [9] proposed M-EDEM, an ensemble method leveraging multiscale neural
networks and empirical decomposition to enhance prediction accuracy across complex sequences. The authors
[47] introduced a two-stage decomposition with an attention-based multi-input LSTM framework, effectively
capturing temporal dependencies in wind speed data.
Transformers rely on the quality of the input features and struggle to capture the multifaceted dynamics
arising from the coalition of short- and long-term patterns in infectious disease spread. To address this limitation,
this study proposes decomposing time-series signals into IMF features, which isolate frequency components and
uncover multi-scale temporal patterns. Integrating these IMFs into a Transformer module enhances predictive
accuracy by combining frequency decomposition with advanced attention mechanisms.
3 BACKGROUND
This section provides the theoretical foundations of frequency decomposition, entropy analysis, and transformer
architectures, which form the basis of the proposed FMT and FMRT frameworks.
3.1 Frequency Modulated Decomposition
Signal decomposition is a crucial step in time series analysis for identifying underlying patterns in complex,
noisy signals. Complementary Ensemble Empirical Mode Decomposition (CEEMD) [49] mitigates mode mixing
by leveraging positive and negative white noise pairs, effectively canceling noise during averaging. Complete
Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) [39] extends CEEMD by incorpo-
rating a signal-to-noise ratio (SNR) parameter, which balances noise suppression and signal feature retention.
Additionally, CEEMDAN refines frequency modulation analysis, making it particularly effective in reconstructing
signals with minimal noise interference. This capability makes CEEMDAN an ideal technique for analyzing
non-linear and non-stationary signals, such as infectious disease incidence data, where dynamic trends and
anomalies are extracted precisely.
We generate 𝑁 noisy realizations 𝐼 𝐷𝑖 (𝑡) of the original time series signal 𝐼 𝐷 (𝑡) by adding white Gaussian
noise 𝜖𝑖 (𝑡), where 𝜖𝑖 (𝑡) ∼ N (0, 𝜎 2 ) with zero mean and standard deviation 𝜎. Here, the index 𝑖 corresponds to the
realization number, such that 𝑖 = 1, 2, . . . , 𝑁 . Each noisy realization is decomposed using the standard Empirical
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 5
Mode Decomposition (EMD) algorithm to extract a series of IMFs represented as 𝑐𝑖,𝑘 (𝑡) along with a residue
component 𝑟𝑖 (𝑡). We begin the iterative decomposition process with:
𝑟𝑖(0) (𝑡) = 𝐼 𝐷𝑖 (𝑡) (1)
At each iteration 𝑗, the following steps are performed:
Identify upper and lower envelopes 𝑈𝑖( 𝑗 ) (𝑡) and 𝐿𝑖( 𝑗 ) (𝑡)
𝑈𝑖( 𝑗 ) (𝑡) + 𝐿𝑖( 𝑗 ) (𝑡)
Compute mean envelope 𝑚𝑖( 𝑗 ) (𝑡) =
2
Obtain 𝑗 𝑡ℎ IMF 𝑐𝑖,𝑗 (𝑡) = 𝑟𝑖( 𝑗 −1) (𝑡) − 𝑚𝑖( 𝑗 ) (𝑡)
Update residue 𝑟𝑖( 𝑗 ) (𝑡) = 𝑟𝑖( 𝑗 −1) (𝑡) − 𝑐𝑖,𝑗 (𝑡) (2)
After processing all noisy realizations, the final CEEMDAN-based IMFs and the residual component are
computed using ensemble averaging:
𝑁
1 Õ
𝐼 𝑀𝐹𝑘 (𝑡) = 𝑐𝑖,𝑘 (𝑡), 𝑘 = 1, 2, . . . , 𝐾 (3)
𝑁 𝑖=1
𝑁
1 Õ (𝐽 )
𝑅𝑒𝑠𝑖𝑑𝑢𝑒 (𝑡) = 𝑟 (𝑡) (4)
𝑁 𝑖=1 𝑖
where 𝐾 represents the total number of IMFs obtained from the decomposition process across all realizations,
and 𝐽 is the total number of iterations required for convergence in EMD. To optimize 𝜎, CEEMDAN dynamically
adjusts it based on the SNR, ensuring an optimal balance between noise suppression and feature preservation.
3.1.1 Classification on K Selection and its Impact. The number of IMFs, 𝐾, is not fixed a priori but is determined
adaptively by CEEMDAN based on the intrinsic oscillatory behavior of the signal. The decomposition continues
until the residue contains no further oscillatory components, effectively separating high-frequency fluctuations
from long-term trends. The choice of 𝐾 has a direct influence on forecasting accuracy. A large 𝐾 may lead to
over-decomposition, increasing computational complexity and introducing redundant information. Conversely,
a small 𝐾 may merge critical frequency components, limiting the model’s ability to capture both short-term
variations and long-term trends.
To mitigate the effect of high-frequency noise, we apply entropy-based feature selection, ensuring that only
the most informative IMFs are retained. This step significantly improves the stability and predictive accuracy of
the proposed model, as shown in the experimental evaluations.
3.2 Entropy Analysis
Entropy, in general, is a measure of uncertainty or randomness. In the context of time series, entropy helps to
determine how unpredictable or irregular the series is. High entropy indicates a high degree of randomness, while
low entropy indicates more predictability and regularity [33]. Sample Entropy (SampEn) is a widely used metric
for assessing the complexity and predictability of a time series. It is an improved version of Approximate Entropy
(ApEn), addressing key limitations such as data length dependency and estimation bias [15]. SampEn provides
a more accurate and unbiased estimate of time series complexity, making it particularly suitable for analyzing
dynamic, real-world datasets. Let 𝑚 be the length of sequences to be compared (embedding dimension), 𝑟 be the
tolerance level (similarity criterion), typically a fraction of the standard deviation of the time series, and 𝑁 be
ACM Trans. Intell. Syst. Technol.
6 • A. Mahajan and D. Toshniwal
the total number of data points. For a given time series of daily infectious disease incidences {𝐼 𝐷 1, 𝐼 𝐷 2, . . . , 𝐼 𝐷 𝑁 }
construct vectors of length 𝑚:
𝑢 (𝑖) = {𝐼 𝐷 1, 𝐼 𝐷 2, . . . , 𝐼 𝐷𝑖+𝑚−1 }
𝑓 𝑜𝑟 𝑖 = 1, 2, . . . , 𝑁 − 𝑚 + 1 (5)
Count pairs of vectors 𝑢 (𝑖) and 𝑢 ( 𝑗) that are similar within the tolerance 𝑟 . Two vectors are identical if the
maximum distance between their corresponding elements is less than 𝑟 . Calculate the probability 𝐴𝑚 (𝑟 ) that any
two vectors of length 𝑚 are similar. Repeat the process for vectors of length 𝑚 + 1 to calculate the probability
𝐴𝑚+1 (𝑟 ). Sample Entropy is then defined as:
𝐴𝑚+1 (𝑟 )
𝑆𝑎𝑚𝑝𝐸𝑛(𝑚, 𝑟, 𝑁 ) = −𝑙𝑛 (6)
𝐴𝑚 (𝑟 )
This measures the negative natural logarithm of the ratio of the number of similar vector pairs of length 𝑚 + 1 to
those of length 𝑚.
By applying SampEn to decomposed IMFs, we can effectively identify the most informative components of a
time series while filtering out redundant or noisy elements. This entropy-based selection process plays a crucial
role in feature extraction and model optimization, ensuring that only the most relevant frequency components
contribute to forecasting.
3.3 Transformer
The Transformer architecture [40] represents a major breakthrough in sequence modeling and time series
forecasting by eliminating the need for recurrence or convolution. Unlike traditional Recurrent Neural Networks
(RNNs) and Convolutional Neural Networks (CNNs), Transformers rely entirely on self-attention mechanisms to
model dependencies within a sequence, enabling parallelized computation and improved efficiency [36, 38]. In a
Transformer module, the self-attention mechanism assigns attention scores to elements in the input sequence,
determining their relative importance in forming the output representation. Within our proposed Frequency
Modulated Transformer (FMT) and Frequency Modulated Respective Transformer (FMRT) frameworks, the
Transformer module plays a central role in learning temporal dependencies across multiple intrinsic frequency
components of the time series, as shown in Fig. 3 (a) & (b). Positional encodings are added to the input embeddings
to retain positional information, which is inherently captured in RNNs but absent in self-attention, allowing the
model to distinguish sequential order. The encoder layer consists of multi-head self-attention and feed-forward
networks, while the decoder layer processes the encoder’s output and introduces additional masked self-attention
to facilitate autoregressive forecasting [17].
Given an input time series of daily infectious disease incidences (𝑖𝑑 1, 𝑖𝑑 2, . . . , 𝑖𝑑𝑛 ), we obtain the corresponding
embedding matrix 𝑋 = (𝑥 1, 𝑥 2, . . . , 𝑥𝑛 ), where each 𝑥𝑖 ∈ R𝑑 denotes the 𝑑-dimensional embedding of the scalar
incidence value 𝑖𝑑𝑖 . The self-attention mechanism computes the attention scores as follows:
𝑄𝐾 𝑡
𝐴𝑡𝑡𝑒𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √ )𝑉 (7)
𝑑𝑘
where, Query 𝑄 = 𝑋𝑊𝑄 , Key 𝐾 = 𝑋𝑊𝐾 , and Value 𝑉 = 𝑋𝑊𝑉 are the linear projections of 𝐼 𝐷 using learned weight
𝑡
matrices 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 . 𝑑𝑘 is the dimensionality of the Keys (𝐾). 𝑄𝐾 √
𝑑𝑘
computes the similarity scores between
Queries and Keys, and softmax normalizes the attention scores across the Keys, ensuring a probability distribution
over the attention weights. The final output is a weighted sum of the Values 𝑉 , where the attention scores define
the contribution of each input element.
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 7
3.4 Model Evaluation
For experimentation and evaluation, we have utilized Mean Absolute Error (MAE), Root Mean Square Error
(RMSE), and the R-squared (R2) score to compare model performances [37].
3.4.1 MAE. is a measure used to evaluate the accuracy of a predictive model [37] by calculating the average of
the absolute differences between predicted and actual values. It is defined as:
𝑛
1Õ
𝑀𝐴𝐸 = |𝑦𝑖 − 𝑦ˆ𝑖 | (8)
𝑛 𝑖=1
where 𝑛 is the number of data points, 𝑦𝑖 represents the actual value, 𝑦ˆ𝑖 represents the predicted value and both 𝑦𝑖
and 𝑦ˆ𝑖 are scalars corresponding to individual time-series observations.
3.4.2 RMSE. in regression analysis [37] quantifies the average squared difference between predicted and actual
values, measuring the model’s fit quality.
v
t 𝑛
1Õ
𝑅𝑀𝑆𝐸 = (𝑦𝑖 − 𝑦ˆ𝑖 ) 2 (9)
𝑛 𝑖=1
3.4.3 R2. is a statistical metric indicating the proportion of variance in a dependent variable explained by one or
more independent variables within a regression model [37]. It quantifies the extent to which the variability of the
first variable accounts for the variability of the second variable.
(𝑦𝑖 − 𝑦ˆ𝑖 ) 2
Í𝑛
𝑅2 = 1 − Í𝑖=1
𝑛 2
(10)
𝑖=1 (𝑦𝑖 − 𝑦¯𝑖 )
where 𝑦¯𝑖 is the mean of the actual values.
Algorithm 3.1 Frequency Modulated Transformer Framework: Data Preprocessing and IMF Selection
Require: Time series data 𝑇 𝑆, window size 𝑤, learning rate 𝜂, total epochs 𝐸
Ensure: Forecasted values 𝑌
1: Step 1: Frequency Decomposition using CEEMDAN
2: [𝐼 𝑀𝐹𝑠, 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙] ← apply_CEEMDAN(𝑇 𝑆)
3: Step 2: Entropy-Based IMF Filtering
4: 𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝐼 𝑀𝐹𝑠 ← [ ]
5: for each 𝐼 𝑀𝐹𝑖 ∈ 𝐼 𝑀𝐹𝑠 do
6: 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 ← compute_entropy(𝐼 𝑀𝐹𝑖 )
7: if 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then
8: 𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝐼 𝑀𝐹𝑠.append(𝐼 𝑀𝐹𝑖 ) ⊲ Retain informative IMFs
9: end if
10: end for
11: Step 3: IMF Integration and Normalization
12: 𝐺𝑟𝑜𝑢𝑝𝑒𝑑_𝐼 𝑀𝐹𝑠 ← integrate_IMFs(𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝐼 𝑀𝐹𝑠) ⊲ Apply 2-3-3 IMF grouping
13: 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑_𝐼 𝑀𝐹𝑠 ← normalize(𝐺𝑟𝑜𝑢𝑝𝑒𝑑_𝐼 𝑀𝐹𝑠) ⊲ Ensures numerical stability
ACM Trans. Intell. Syst. Technol.
8 • A. Mahajan and D. Toshniwal
Fig. 1. Proposed Frequency Modulated Transformer Framework
4 PROPOSED METHODOLOGY
This section details the FMT and FMRT frameworks, outlining the dataset, preprocessing steps, feature selection,
model architecture, and implementation strategy.
4.1 Input Data Preparation
The first step in implementing the Transformer model involves preparing the time series data to ensure meaningful
input representations for the model, shown in Algorithm 3.1. The proposed framework achieves this through a
systematic data preprocessing pipeline that includes time series windowing, feature engineering using frequency
decomposition, and entropy-based IMF feature integration. The paper [48] investigated information leakage
issues in decomposition-based forecasting, emphasizing the importance of proper training-test separation to
prevent biased performance estimates. To preserve temporal integrity and prevent data leakage, the original
dataset is divided into train and test sets in strict chronological order, with 273 observations (20%) in the test set
and 1090 in the train set (80%). The train set is further divided to construct a validation set (10% of 1090) with 109
observations, used for tuning model hyperparameters, selecting the best architecture, and preventing overfitting
through early stopping.
4.1.1 Time Series Windowing. The raw time series data, after splitting into train and test sets, is segmented into
fixed-length, overlapping windows with a designated window size of 23. Each window serves as an independent
input sample to the proposed Transformer model, enabling the model to capture temporal dependencies over a
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 9
Algorithm 3.2 Frequency Modulated Transformer Framework: Sequence Preparation and Transformer Modeling
1: Step 4: Sequence Preparation
2: if 𝑚𝑜𝑑𝑒𝑙_𝑡𝑦𝑝𝑒 == FMT then
3: 𝑋 ← prepare_input_sequences(𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑_𝐼 𝑀𝐹𝑠, 𝑤)
4: else if 𝑚𝑜𝑑𝑒𝑙_𝑡𝑦𝑝𝑒 == FMRT then
5: for each 𝐼 𝑀𝐹𝑖 ∈ 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑_𝐼 𝑀𝐹𝑠 do
6: 𝑋𝑖 ← prepare_input_sequences(𝐼 𝑀𝐹𝑖 , 𝑤)
7: end for
8: end if
9: Step 5: Transformer Model
10: if 𝑚𝑜𝑑𝑒𝑙_𝑡𝑦𝑝𝑒 == FMT then
11: Add positional encoding to 𝑋
12: for each transformer layer do
13: 𝑋 ← MultiHeadAttention(𝑄, 𝐾, 𝑉 )
14: 𝑋 ← Add & Normalize(𝑋 )
15: 𝑋 ← FeedForward(𝑋 )
16: 𝑋 ← Add & Normalize(𝑋 )
17: end for
18: 𝑌 ← decoder(𝑋 )
19: else if 𝑚𝑜𝑑𝑒𝑙_𝑡𝑦𝑝𝑒 == FMRT then
20: for each 𝑋𝑖 do
21: Add positional encoding to 𝑋𝑖
22: for each transformer layer do
23: 𝑋𝑖 ← MultiHeadAttention(𝑄, 𝐾, 𝑉 )
24: 𝑋𝑖 ← Add & Normalize(𝑋𝑖 )
25: 𝑋𝑖 ← FeedForward(𝑋𝑖 )
26: 𝑋𝑖 ← Add & Normalize(𝑋𝑖 )
27: end for
28: 𝑌𝑖 ← decoder(𝑋𝑖 )
29: end for
30: 𝑌 ← merge_predictions( [𝑌1, 𝑌2, . . . , 𝑌𝑛 ])
31: end if
fixed time horizon. The model can learn patterns and correlations essential for accurate forecasting by focusing
on a specific time range within each window.
4.1.2 Feature Engineering. To address the challenges posed by the non-stationary and noisy nature of time
series data, the CEEMDAN technique is applied. CEEMDAN decomposes the time series into a set of IMFs, each
representing oscillatory components at different frequency levels (Algorithm 3.1). CEEMDAN decomposition
and entropy-based IMF selection are applied exclusively on the training set, and the IMFs retained based on
training-time entropy thresholds are then used to filter corresponding components in the validation and test sets.
This ensures that the model is trained and evaluated in a manner that strictly respects the temporal boundaries
of the data, avoiding any forward-looking bias. The improper handling of decomposition-based preprocessing
can lead to data leakage, as addressed in several prior studies [48]. The original dataset is broken down into
ACM Trans. Intell. Syst. Technol.
10 • A. Mahajan and D. Toshniwal
Algorithm 3.3 Frequency Modulated Transformer Framework: Training and Evaluation
1: Step 6: Model Training
2: for 𝑒𝑝𝑜𝑐ℎ = 1 to 𝐸 do
3: 𝐿 ← compute_loss(𝑚𝑜𝑑𝑒𝑙, 𝑋 )
4: update_weights(𝑚𝑜𝑑𝑒𝑙, 𝐿, 𝜂)
5: Apply early stopping if no improvement
6: end for
7: Step 7: Evaluation
8: [𝑅𝑀𝑆𝐸, 𝑀𝐴𝐸, 𝑅 2 ] ← compute_metrics(𝑌, 𝑎𝑐𝑡𝑢𝑎𝑙_𝑣𝑎𝑙𝑢𝑒𝑠)
9: return 𝑌
seven IMFs and one residual component, as shown in Fig. 2 (a). Each IMF represents a distinct frequency range,
helping to separate different temporal patterns in the data. High-frequency IMFs (IMF 0 - IMF 1) capture rapid
fluctuations, short-term variations, and noise, such as inconsistencies in daily reporting or sudden external shocks.
Medium-frequency IMFs (IMF 2 - IMF 4) contain periodic trends that reflect seasonality, event-driven changes, or
short-term interventions (e.g., policy adjustments). Finally, low-frequency IMFs (IMF 5 - IMF 6), along with the
residual, represent slow-moving trends that reflect the fundamental progression of the dataset over time. This
decomposition allows the model to focus on patterns at multiple temporal scales, comprehensively analyzing the
underlying data structure.
4.1.3 Entropy-Based IMF Feature Integration. Once the time series is decomposed into IMFs, entropy-based
feature selection is employed to identify and retain the most informative components. Sample entropy (SampEn)
is calculated for each IMF to quantify its complexity, with higher entropy values indicating more irregular and
potentially meaningful patterns. Based on the entropy analysis (Fig. 2 (b)), the selected IMFs are then grouped as
per their frequency range: The first two IMFs (IMF0 and IMF1) have the highest entropy values and are integrated
(Ori-Co-IMF0) to capture high-frequency components, isolating rapid fluctuations and noise while preserving
significant short-term variations. The next three IMFs (IMF2, IMF3, and IMF4) are combined (Ori-Co-IMF1) to
reflect medium-frequency components, providing a smoothed representation of intermediate-term trends and
cycles. The final three IMFs (IMF5, IMF6, and IMF7) are integrated (Ori-Co-IMF2) to capture low-frequency
components, representing long-term trends and the overall trajectory of COVID-19 incidences with minimal
short-term variability.
The above process is a structured 2-3-3 strategy to preserve multi-scale temporal patterns while reducing
feature dimensionality. This integration is performed by concatenating the selected IMFs along the temporal
dimension and then applying normalization to ensure scale uniformity across frequency bands. The resulting
three-channel input, capturing short-, mid-, and long-term dynamics, is used as the multivariate input to the FMT
or FMRT model. Normalization is applied after entropy-based IMF selection using the MinMaxScaler() function
(Fig. 2 (c)) to ensure that only the most informative components, those retained for training, are scaled. This
prevents amplitude disparities across IMFs from skewing the learning process, enhances gradient stability, and
ensures balanced attention across frequency bands. Applying normalization post-selection avoids unnecessary
computation on discarded IMFs and supports consistent model convergence.
This unique combination of time series windowing, frequency decomposition, and entropy-based IMF integra-
tion effectively addresses challenges in forecasting dynamic, non-stationary datasets like healthcare time series.
Windowing captures temporal dependencies by segmenting data into manageable samples. CEEMDAN decom-
position separates meaningful signal components from noise, preserving multi-scale patterns. Entropy-based
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 11
Table 1. Key Differences: FMT and FMRT Framework
Aspect FMT FMRT
Input Handling Treats all IMFs together Treats each IMF separately
Prediction Process Single transformer model Separate transformer for each IMF
Output Combination Direct output Merges individual IMF predictions
Complexity Moderate Higher due to multiple models
Computational Cost Lower Higher due to parallel processing
feature selection refines IMFs by retaining only the most informative ones, reducing computational complexity
and enhancing predictive accuracy. This robust and novel pipeline optimally prepares data for the Transformer
model, ensuring better generalization, faster convergence, and improved interpretability in forecasting tasks.
4.2 Frequency Modulated Transformer Module (FMT)
The proposed architecture follows a sequence-to-sequence modeling approach, where input sequences are
constructed from the frequency-modulated dataset, passed through a Transformer encoder-decoder network,
and used for forecasting future time steps. Once the dataset is frequency-modulated, it is organized into input
sequences, each consisting of a window of 23 consecutive times. Each time step in the sequence contains the
values of all three integrated IMFs at that specific time. This transforms the data into a multidimensional input
representation, where each time step is a vector containing frequency-based information instead of a single scalar
value. Once the input sequences are formed, they are passed through the encoder of the proposed FMT module
(Fig. 1 & Fig. 3 (a)), which consists of multi-head self-attention layers and feed-forward networks. The process
starts with positional encoding, which helps the model retain information about the order of time steps. The
encoded sequences then enter multi-head self-attention layers (Algorithm 3.2 & 3.3), where the model dynamically
learns which time steps are most relevant for forecasting future values. Following this, residual connections and
normalization layers stabilize training, while a feed-forward network (FFN) enhances feature transformation and
pattern extraction. The decoder of the FMT module is responsible for generating future predictions using the
encoded representations from the encoder. The decoder begins by receiving the previously generated predictions,
ensuring it only relies on past information rather than future data. A fully connected feed-forward network and
softmax activation layer process the decoder’s output, generating the final predictions for future time steps.
4.3 Frequency Modulated Respective Transformer Module (FMRT)
The FMRT module is a modified approach to the FMT, designed to treat each IMF as an independent sequence
within the Transformer architecture. Instead of integrating IMFs into three frequency-based groups, as in FMT,
each IMF is treated as a separate sequence and windowed individually (Fig. 1), where each sequence consists
of a fixed number of consecutive time steps. Instead of having a single Transformer input sequence containing
multiple IMF values per time step, the proposed FMRT model trains separate Transformer layers for each IMF,
as shown in Fig. 3 (b), allowing the architecture to learn frequency-specific representations without mixing
different frequency scales. The encoder network in FMRT consists of six Transformer layers, each equipped
with eight attention heads per layer. Following the encoding process, each IMF-specific encoded representation
generates independent future predictions for the corresponding IMF. As depicted in Algorithm 3.2 & 3.3, the
proposed model architecture includes a decoder layer to process the encoded outputs and generate separate
forecasts for each IMF sequence. The individual IMF predictions are passed through a dense layer with a sigmoid
activation function, which helps to refine the aggregated forecast by adjusting the relative importance of each
IMF’s contribution. The key difference between the two frameworks is given in Table 1.
ACM Trans. Intell. Syst. Technol.
12 • A. Mahajan and D. Toshniwal
Table 2. Statistical Summary of COVID-19 Incidence Data
Statistic Value
Mean 33209.53
Standard Deviation 67190.77
Minimum 0
Maximum 414188
25th Percentile (Q1) 640.5
50th Percentile (Q2) 9216
75th Percentile (Q3) 35525
ADF Test
Test Statistic -3.528
P-value 0.007284
Selected Lags 23
1% Confidence Interval -3.435
5% Confidence Interval -2.863
10% Confidence Interval -2.567
Ljung-Box Test
Autocorrelation (p-values) All lags below 0.05
Jarque-Bera Test
P-value 0.00
Skewness 3.483
Kurtosis 15.739
The proposed models (FMT and FMRT) are trained using the Mean Squared Error (MSE) loss function for
regression-based time series forecasting. MSE penalizes larger errors more heavily, thereby encouraging the
model to prioritize precise predictions.
𝑛
1Õ
LFMT/FMRT (𝜃 ) = (𝑦𝑖 − 𝑦ˆ𝑖 (𝜃 )) 2
𝑛 𝑖=1
where 𝑦ˆ𝑖 (𝜃 ) = 𝑀FMT/FMRT (𝑋˜ 𝑖 ; 𝜃 ) (11)
and 𝑋˜ 𝑖 = Norm(Group(Filterentropy (CEEMDAN(𝑇 𝑆𝑖 ))))
with 𝜃 ∈ Θ representing model parameters.
where 𝑋˜ 𝑖 denotes the input vector at time step 𝑖, composed of entropy-filtered, grouped, and normalized IMFs
derived from CEEMDAN decomposition. Specifically, CEEMDAN(𝑇 𝑆𝑖 ) represents the CEEMDAN decomposition
applied to the original time series segment 𝑇 𝑆𝑖 ; Filterentropy (·) denotes the entropy-based selection of informative
IMFs; Group(·) refers to the integration of selected IMFs into 2-3-3 frequency-based groups capturing high-, mid-,
and low-frequency dynamics; Norm(·) applies normalization to ensure numerical stability and scale invariance
across IMFs, and 𝑀FMT/FMRT (·) denotes the Transformer-based mapping function that predicts future values based
on 𝑋˜ 𝑖 .
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 13
(a) Original IMFs (b) Entropy Analysis
(c) Integrated IMFs
Fig. 2. Frequency Modulation Results
Table 4. Hyperparameter Configuration for FMT and FMRT
Models
Hyperparameter Value
Table 3. Patience and Epochs Configuration
Learning Rate 0.001
Batch Size 64
Epochs Patience Early MAE
Optimizer Adam
Stopping
Loss Function MSE
100 10 50 1454.92
Dropout Rate 0.1
300 30 150 1437.86
Embedding Dimension 256
1000 100 500 769.23
Feedforward Layer Dimension 512
# of Transformer Layers 6
# of Attention Heads 8
Sequence Window Size 23
ACM Trans. Intell. Syst. Technol.
14 • A. Mahajan and D. Toshniwal
(a) FMT Module
(b) FMRT Module
Fig. 3. Frequency-Aware Proposed Transformer Architectures: (a) FMT with grouped IMF integration; (b) FMRT with separate
Transformer pipelines per IMF
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 15
(a) Covid-19 Distribution (b) ACF and PACF Plots
Fig. 4. Covid-19 Time Series Dataset (India)
5 RESULTS AND DISCUSSIONS
This section presents an extensive comparative evaluation of FMT and FMRT against SOTA deep learning
baselines, demonstrating their superior predictive accuracy and computational efficiency.
5.1 Dataset Description
This research utilizes data from Our World in Data (OWID) spanning from January 3, 2020, to September 26,
2023 [27], focusing on daily COVID-19 statistics for the country India. The COVID-19 daily incidence time series
data for India (as shown in Fig. 4 (a)) consists of 1,363 observations. The mean value of the series is 33,209.53,
with a standard deviation of 67,190.77, indicating significant variability in the data. The minimum recorded
value is 0, while the maximum value reaches 414,188. The daily incidence summary statistics in Table 2 show
that 25% of the values are below 640.5, the median (50th percentile) is 9,216, and 75% of the values fall below
35,525. These statistics reflect the substantial fluctuations in COVID-19 cases over the observed period, with
notable peaks and a wide range of values. The graph shows two prominent peaks occurring around mid-2021
and early 2022, indicating significant increases in the data during these periods. Using the Statsmodels module,
this study examines COVID-19 incidence data through statistical tests. The Augmented Dickey-Fuller (ADF) test
suggests the data is relatively stable, with a p-value of 0.0073 and a test statistic of -3.528, indicating rejection
of the null hypothesis. The test automatically selects a lag of 23, suggesting using data from more than 23 days
for input features. For autocorrelation, the Ljung-Box test indicates strong autocorrelation with all lags having
p-values below 0.05. The Jarque-Bera test for normality shows a p-value of 0.00, skewness of 3.483, and kurtosis
of 15.739, rejecting the null hypothesis of normal distribution. Additionally, the Statsmodels module computes
the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF), as shown in Fig. 4 (b). The
steady decline in the ACF plot suggests a strong temporal dependency in the data, indicative of a non-stationary
time series with potential seasonal patterns or trends. The PACF sharply declines after lag one, showing significant
spikes before diminishing, indicating primary influence from immediate past values with fewer correlations in
subsequent lags.
ACM Trans. Intell. Syst. Technol.
16 • A. Mahajan and D. Toshniwal
5.2 Experimental Setup
The experiments in this study are conducted using a computer setup with Anaconda3 Individual Edition, Anaconda
Navigator 2.0.4, and Jupyter Notebook 6.3.0, with Python 3.8.8 as the primary programming language and CUDA
11.2 for GPU acceleration. The hardware included a multi-core CPU and a GPU compatible with CUDA 11.2.
Various Python libraries were utilized, including NumPy 1.19.5 for numerical computations, Pandas 1.2.4 for
data manipulation, Matplotlib 3.3.4 for visualization, Scikit-learn 0.24.1 for model evaluation, Statsmodels 0.12.2
for statistical analysis, and EMD-signal 1.0.0 for empirical mode decomposition (CEEMDAN). TensorFlow 2.5.0
and TensorFlow-GPU 2.5.0 were used for deep learning model development. The “Cost (sec)” metric reported in
Table 6 refers specifically to the total training time required for each model configuration until convergence. This
measurement is computed using a consistent hardware environment equipped with CUDA 11.2-enabled GPUs
and includes the full duration from model initialization to the final epoch (or early stopping). The training setup
is configured with different epoch-to-patience ratios shown in Table 3. The patience value is determined using
the formula:
𝐸𝑝𝑜𝑐ℎ𝑠
𝑝𝑎𝑡𝑖𝑒𝑛𝑐𝑒 = (12)
10
Early stopping is applied when validation loss does not improve for a threshold of:
𝑒𝑎𝑟𝑙𝑦_𝑠𝑡𝑜𝑝 = 5 ∗ 𝑝𝑎𝑡𝑖𝑒𝑛𝑐𝑒 (13)
The final model training and all reported results are obtained using the configuration with 1000 epochs, as
specified in Table 3. This configuration gives the least absolute error on the dataset. The proposed Transformer
module (FMT & FMRT) is implemented with key components such as positional encoding to capture sequence
order, multi-head attention to compute attention scores via query, key, and value matrices, and a feed-forward
network for introducing non-linearity. The Transformer model hyperparameter configuration, as given in Table
4, included an embedding dimension of 256, eight attention heads, a feed-forward layer with 512 units, a dropout
rate of 0.1, a learning rate of 0.001, and is trained using the Adam optimizer with mean squared error (MSE)
loss function. A batch size of 64 is used to optimize memory usage and training efficiency. For time series
forecasting, the dataset is preprocessed by applying CEEMDAN to decompose the data into IMFs, followed by
entropy-based selection to retain the most informative IMFs and sequence preparation with a window size of
23. The Transformer encoder architecture processed the data with multiple layers consisting of self-attention
and normalization operations, and in the case of the proposed FMRT technique, individual IMF sequences are
handled separately, and their predictions are merged.
5.3 Performance Comparison with SOTA Models
To evaluate the effectiveness of Transformer-based architectures for time series forecasting, the Vanilla Trans-
former is compared with state-of-the-art (SOTA) deep learning models, including Recurrent Neural Networks
(RNN) [28], Long Short-Term Memory (LSTM) [18], Bidirectional LSTM (BI-LSTM), Stacked LSTM [30], Encoder-
Decoder LSTM (ED-LSTM) [44], and Gated Recurrent Unit (GRU) [14]. The RMSE and R2 score comparison plots,
as shown in Fig. 5 (a) and (b), illustrate the predictive performance and explanatory power of these models. In the
RMSE comparison plot, models such as GRU and Transformer demonstrate lower RMSE values, indicating their
ability to minimize forecasting errors more effectively compared to traditional recurrent architectures like RNN
and LSTM variants. Conversely, the R2 score comparison plot highlights that models such as LSTM, GRU, and
Transformer achieve higher scores, signifying their better explanatory power and model fit. GRU outperforms
LSTM variants due to its simplified architecture with fewer gating mechanisms, leading to faster convergence
and reduced overfitting in time series forecasting. While the Vanilla Transformer performs competitively against
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 17
(a) RMSE (b) R2 Score
Fig. 5. Performance Comparison Across Baseline Models
Table 5. Configuration of Baseline Models Used in the Study
Model Layer Type Hid- Activa- Optimizer Epochs Batch 𝜂 Dropout
den tion Size
Units
RNN SimpleRNN 50 ReLU Adam 50 32 0.0001 0.1
LSTM LSTM 50 ReLU Adam 50 32 0.0001 0.1
Bi-LSTM Bidirectional 50 ReLU Adam 50 32 0.0001 0.1
LSTM
Stacked 2 LSTM Layers 50 ReLU Adam 50 32 0.0001 0.1
LSTM
Encoder- LSTM 50 ReLU Adam 50 32 0.0001 0.1
Decoder
LSTM
GRU GRU 50 ReLU Adam 50 32 0.0001 0.1
Informer Encoder- 256 GeLU & Adam 50 32 0.0001 0.1
Decoder ReLU
Transformer
Autoformer Encoder- 256 GeLU Adam 50 32 0.0001 0.1
Decoder
Transformer
with Auto-
correlation
Time-Series Transformer 256 GeLU Adam 50 32 0.0001 0.1
Transformer Encoder
Note: 𝜂 denotes the learning rate. ReLU (Rectified Linear Unit) and GeLU (Gaussian Error Linear Unit) are activation functions.
GeLU tends to yield smoother activation compared to ReLU, especially beneficial in Transformer-based models.
recurrent architectures, the results suggest that further enhancements could improve its forecasting capabilities.
The configuration of the baseline models used in the study is shown in Table 5.
ACM Trans. Intell. Syst. Technol.
18 • A. Mahajan and D. Toshniwal
(a) Vanilla Transformer Loss Chart (b) Vanilla Transformer Forecasting
(c) FM Transformer Loss Chart (d) FM Transformer Forecasting
Fig. 6. Vanilla and FM Transformer Forecasting Results
5.4 Performance Evaluation: Accuracy vs. Computational Efficiency
A comprehensive analysis of the training and validation loss curves, along with the forecasting results on test
data, provides strong evidence for the superiority of the proposed FMT model over the Vanilla Transformer. As
depicted in Fig. 6, the FMT model achieves consistently lower loss values throughout training and validation,
demonstrating its improved learning efficiency. Additionally, the forecasting curves on test data confirm that FMT
produces more precise and stable predictions, effectively capturing both short-term fluctuations and long-term
trends.
Similarly, the training and validation loss curves for the FMRT model, illustrated in Fig. 7, highlight its
robust convergence behavior. The smooth decline in loss across training epochs indicates that FMRT effectively
minimizes prediction errors, ensuring strong generalization to unseen data. By processing each IMF separately,
FMRT provides granular insights into the underlying short-term and long-term dynamics of the time series. This
allows for a more detailed decomposition of complex patterns, particularly in applications like infectious disease
modeling, where both short-lived outbreaks and long-term epidemiological trends must be accurately forecasted.
The loss analysis confirms that both FMT and FMRT outperform traditional vanilla transformers. FMT offers a
computationally efficient solution, while FMRT provides highly detailed IMF-wise predictions for specialized
forecasting tasks.
A comparative evaluation of key performance metrics for the proposed FMT and FMRT modules, summarized
in Table 6 against baseline models including Transformer (Vanilla), ARIMA [8], XGBoost (XGB), and Random
Forest (RF) [31, 35], further supports these findings. FMT reduces RMSE and MAE by approximately 50% and
65%, respectively, for India, compared to the Vanilla Transformer, confirming its ability to effectively minimize
forecasting errors. Additionally, FMT improves the R2 score by 8% for India, highlighting its enhanced predictive
accuracy and ability to explain variance in the data. Another key advantage of FMT is its computational efficiency,
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 19
Fig. 7. FM Respective Transformer Forecasting Results
achieving a 12% reduction in processing time (computational cost) for India, making it a viable solution for large-
scale time series forecasting tasks. Similarly, FMRT delivers competitive performance improvements but with a
different trade-off. Compared to the Vanilla Transformer, FMRT achieves a 45% reduction in RMSE and a 60%
reduction in MAE for India, demonstrating its effectiveness in capturing IMF-specific temporal dependencies. The
ACM Trans. Intell. Syst. Technol.
20 • A. Mahajan and D. Toshniwal
Table 6. Performance Metric Evaluation across Diverse Regions & Datasets
Coun- Metric Vanilla FMT FMRT FMT(w/o) FMRT(w/o) ARIMA XGB RF
try
RMSE 849.27 422.59 467.91 9178.53 2974.25 2680.91 602.83 577.52
MAE 769.23 272.58 307.25 8215.56 2867.25 1191.59 270.66 225.87
India
R² 0.89 0.97 0.96 -11.68 -0.33 -0.11 0.94 0.95
Cost 2823.30 2477.02 7627.24 2003.54 22314.02 37.30 0.60 0.64
RMSE 9185.41 5236.58 5650.87 6099.67 6365.23 49254.96 16663.83 16700.60
MAE 4381.53 2806.54 4357.39 3915.63 3953.73 47834.72 2272.62 2130.36
France
R² 0.60 0.87 0.85 0.82 0.81 -10.90 -0.36 -0.36
Cost 4501.85 4973.80 15040.36 5398.97 37461.01 13.58 0.22 0.23
RMSE 16252.04 9475.73 7143.37 11013.65 10220.99 23322.06 13162.42 12985.97
MAE 7489.96 5615.65 4493.84 6503.37 7038.04 22012.94 7358.74 6824.24
Brazil
R² -0.98 0.33 0.62 0.09 0.22 -2.61 -0.26 -0.22
Cost 3955.44 4344.91 14352.50 3713.86 3309.52 21.93 0.32 0.67
RMSE 48.21 40.69 40.04 47.54 45.78 126.08 51.95 49.94
MAE 24.73 22.81 22.19 24.63 25.78 65.55 25.54 25.40
Influenza
R² 0.84 0.89 0.90 00.85 0.86 -0.04 0.82 0.84
Cost 103.58 108.28 1056.87 287.55 2272.19 4.08 0.23 0.30
Note: FMT(w/o) and FMRT(w/o) refer to models trained without IMF integration. “Cost” reflects training time in “seconds”.
The best values per row are bolded, and the second-best is italicized. XGB denotes the XGBoost model, and RF denotes
Random Forest.
FMRT model shows substantial improvements in Brazil’s dataset (62% R2 score vs. -0.98 for Vanilla), highlighting
its robustness in modeling volatile epidemic dynamics. For Influenza, FMRT also outperforms all models with the
lowest RMSE (40.04), lowest MAE (22.19), and highest R2 score (0.90), demonstrating its capability to model finer
temporal patterns through frequency-specific processing. However, its computational time is noticeably higher,
representing a trade-off between predictive accuracy and computational efficiency. Compared to traditional
statistical models (ARIMA) and machine learning baselines (XGB, RF), both FMT and FMRT deliver significantly
higher R2 scores and lower forecast errors across all datasets. FMRT remains highly valuable in analytical
and retrospective settings, such as epidemiological studies and public health policy evaluation, where model
transparency and component-level insight are essential over latency. For time-sensitive deployments, the FMT
model is better suited, offering a favorable balance between predictive accuracy and computational efficiency.
This distinction allows the framework to be adapted based on specific application requirements.
5.5 Impact of Entropy-based IMF Integration
A significant decline in forecasting accuracy is observed when models are trained without integrating IMFs
based on their entropy distribution. As illustrated in Fig. 9 (a)-(h), when each IMF is predicted separately without
integration, the resulting forecasts fail to align effectively with the actual time series data. The FMT (without)
and FMRT (without) models, which are trained without combining IMFs, respectively, are shown in Fig. 9 (i) and
(j), exhibit higher forecasting errors and scattered misaligned predictions, further reinforcing the importance
of entropy-based IMF selection. However, not all IMFs contribute equally to the predictive power of the model.
High-entropy IMFs typically represent irregular, information-rich components-often reflecting dynamic, short-
term fluctuations relevant to abrupt epidemiological changes (e.g., outbreaks or policy shocks). In contrast,
low-entropy IMFs may capture smoother, long-term trends with limited immediate predictive utility. Merging
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 21
(a) India (b) France
(c) Brazil
Fig. 8. Performance Evaluation between Transformer variants across diverse Regions
individual IMFs based on entropy distribution is a crucial step of the proposed approach. It ensures a balanced
representation of the time series, allowing the model to effectively capture both short-term fluctuations and long-
term dependencies. Without this step, the forecasting models struggle to differentiate between noisy high-order
IMFs and slow-moving lower-order IMFs, leading to inconsistent predictions and enhancing signal-to-noise ratio,
reducing model overfitting, and improving generalization, thereby ensuring that the Transformer focuses on the
most informative temporal features. This is further evident in Table 6, where models without entropy-based IMF
integration (FMT (without) & FMRT (without))exhibit very high RMSE and MAE values, along with a negative
R2 score.
5.6 Assessment of Model Efficacy Across Transformer Variants
From a computational efficiency perspective, the FMT model demonstrates a smaller processing time compared
to the Vanilla Transformer. This improvement underscores FMT’s efficiency in handling time series data, making
it a practical choice for real-time forecasting applications. The forecasting curves on test data, shown in Fig. 10,
further validate FMT’s effectiveness. The FMT model closely tracks the behavior of the actual data, demonstrating
ACM Trans. Intell. Syst. Technol.
22 • A. Mahajan and D. Toshniwal
(a) IMF0 (b) IMF1 (c) IMF2
(d) IMF3 (e) IMF4 (f) IMF5
(g) IMF6 (h) IMF7 (i) FMT (without)
(j) FMRT (without)
Fig. 9. Transformer Forecasting Results without Entropy-based Integration
its superior forecasting capability and alignment with real-world trends. Fig. 8 presents a comparative evaluation
of the proposed FMT and FMRT models against state-of-the-art transformer-based baselines, namely Informer,
Autoformer, and the Time-Series Transformer [10, 55] across three geographically diverse regions: India, France,
and Brazil. The results clearly indicate that both FMT and FMRT outperform the baseline models in learning
the temporal dynamics and reproducing real-world epidemiological patterns. FMT achieves a favorable balance
between predictive performance and computational efficiency, while FMRT provides granular frequency-specific
insights but at a higher computational cost. This trade-off makes FMT the preferred choice for large-scale
forecasting, whereas FMRT is better suited for applications requiring a detailed breakdown of time series dynamics.
This underscores the effectiveness of incorporating frequency-modulated decomposition and entropy-based
feature selection into transformer architectures for robust and interpretable infectious disease forecasting.
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 23
Fig. 10. Model Efficacy Across Transformer Variants for India
6 CONCLUSION AND FUTURE WORK
This study presents the Frequency Modulated Transformer (FMT) framework that significantly advances in-
fectious disease forecasting by leveraging frequency-modulated signals and entropy-based preprocessing. By
effectively isolating high-frequency noise, capturing intermediate trends, and clearly defining long-term trajecto-
ries, the FMT framework substantially improves predictive accuracy and computational efficiency, providing a
comprehensive understanding of the time series data’s behavior across different temporal scales. The results show
a notable reduction in prediction errors and an increase in the R2 score, highlighting the model’s ability to accu-
rately capture both short-term fluctuations and long-term trends in infectious disease data. These performance
gains are achieved while maintaining a lower computational cost, making the framework suitable for real-time
forecasting and public health decision-making. The results validate the effectiveness of integrating frequency-
aware feature engineering with transformer-based self-attention mechanisms to enhance transformer-based
sequence modeling for complex, non-stationary time-series data. For future work, the FMT framework can be
extended to broader epidemiological applications, such as multi-disease forecasting and cross-region predictive
modeling, by incorporating multi-source epidemiological data. This would involve adapting the CEEMDAN
and entropy-based filtering to multivariate inputs and designing spatiotemporal frequency-aware attention
mechanisms. Further advancements could involve hybridizing FMT with probabilistic generative transformers
such as Bayesian Transformers or Latent Diffusion-based Transformers to provide confidence intervals and
scenario-based forecasting for improved uncertainty quantification and long-horizon forecasting. This would be
particularly beneficial in policy-sensitive domains such as epidemic planning. While the same hyperparameter
configuration was employed for both FMT and FMRT across all countries and datasets to maintain consistency,
future work could explore model- and dataset-specific hyperparameter optimization. The Bayesian optimization
approach can be incorporated for efficiently tuning these parameters, which may further improve forecasting
accuracy and computational efficiency.
ACKNOWLEDGMENTS
This research received no grant from public, commercial, or not-for-profit funding agencies.
REFERENCES
[1] Hossein Abbasimehr and Reza Paki. 2021. Prediction of COVID-19 confirmed cases combining deep learning methods and Bayesian
optimization. Chaos, Solitons & Fractals 142 (2021), 110511.
[2] Blessing Isoyiza Adeika, Joseph Aina, Temileye Ibirinde, Tijesunimi Adeyemi, Md Mahmudur Rahman, and Saroj Pramanik. 2023.
Ensemble and Transformer Models for Infectious Disease Prediction. In 2023 IEEE 23rd International Conference on Bioinformatics and
Bioengineering (BIBE). IEEE, 377–384.
[3] Sunday Adeola Ajagbe and Matthew O Adigun. 2024. Deep learning techniques for detection and prediction of pandemic diseases: a
systematic literature review. Multimedia Tools and Applications 83, 2 (2024), 5893–5927.
ACM Trans. Intell. Syst. Technol.
24 • A. Mahajan and D. Toshniwal
[4] Ahmad Z Al Meslamani, Isidro Sobrino, and José de la Fuente. 2024. Machine learning in infectious diseases: potential applications and
limitations. Annals of Medicine 56, 1 (2024), 2362869.
[5] Madini O Alassafi, Mutasem Jarrah, and Reem Alotaibi. 2022. Time series predicting of COVID-19 based on deep learning. Neurocomputing
468 (2022), 335–344.
[6] Gianluca Bontempi, Souhaib Ben Taieb, and Yann-Aël Le Borgne. 2013. Machine learning strategies for time series forecasting. Business
Intelligence: Second European Summer School, eBISS 2012, Brussels, Belgium, July 15-21, 2012, Tutorial Lectures 2 (2013), 62–77.
[7] George EP Box and Gwilym M Jenkins. 1968. Some recent advances in forecasting and control. Journal of the Royal Statistical Society.
Series C (Applied Statistics) 17, 2 (1968), 91–109.
[8] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. Time series analysis: forecasting and control. John
Wiley & Sons.
[9] Xiangjun Cai and Dagang Li. 2024. M-EDEM: A MNN-based Empirical Decomposition Ensemble Method for improved time series
forecasting. Knowledge-Based Systems 283 (2024), 111157.
[10] Danyang Cao and Shuai Zhang. 2024. AD-autoformer: decomposition transformers with attention distilling for long sequence time-series
forecasting. The Journal of Supercomputing (2024), 1–21.
[11] Jian Cao, Zhi Li, and Jian Li. 2019. Financial time series forecasting model based on CEEMDAN and LSTM. Physica A: Statistical
mechanics and its applications 519 (2019), 127–139.
[12] Kai Chen, Yao Liu, Tianjiao Ji, Guanyu Yang, Yang Chen, Chunfeng Yang, and Yu Zheng. 2024. TEST-Net: transformer-enhanced
Spatio-temporal network for infectious disease prediction. Multimedia Systems 30, 6 (2024), 312.
[13] Hao-Yuan Cheng, Yu-Chun Wu, Min-Hau Lin, Yu-Lun Liu, Yue-Yang Tsai, Jo-Hua Wu, Ke-Han Pan, Chih-Jung Ke, Chiu-Mei Chen,
Ding-Ping Liu, et al. 2020. Applying machine learning models with an ensemble approach for accurate real-time influenza forecasting
in Taiwan: Development and validation study. Journal of medical Internet research 22, 8 (2020), e15394.
[14] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation:
Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
[15] Alfonso Delgado-Bonal and Alexander Marshak. 2019. Approximate entropy and sample entropy: A comprehensive tutorial. Entropy
21, 6 (2019), 541.
[16] Xinyu Fang, Wendong Liu, Jing Ai, Mike He, Ying Wu, Yingying Shi, Wenqi Shen, and Changjun Bao. 2020. Forecasting incidence of
infectious diarrhea using random forest in Jiangsu Province, China. BMC infectious diseases 20 (2020), 1–8.
[17] Shibo Feng, Chunyan Miao, Zhong Zhang, and Peilin Zhao. 2024. Latent diffusion transformer for probabilistic time series forecasting.
In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 11979–11987.
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[19] Rakibul Islam, Azrin Sultana, and Mohammad Rashedul Islam. 2024. A comprehensive review for chronic disease prediction using
machine learning algorithms. Journal of Electrical Systems and Information Technology 11, 1 (2024), 27.
[20] Sangwon Lee, Junho Hong, Ling Liu, and Wonik Choi. 2024. TS-Fastformer: Fast Transformer for Time-Series Forecasting. ACM
Transactions on Intelligent Systems and Technology 15, 2 (2024), 1–20.
[21] Manuel Lentzen, Thomas Linden, Sai Veeranki, Sumit Madan, Diether Kramer, Werner Leodolter, and Holger Fröhlich. 2023. A
transformer-based model trained on large scale claims data for prediction of severe COVID-19 disease progression. IEEE Journal of
Biomedical and Health Informatics 27, 9 (2023), 4548–4558.
[22] Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S. Yu, and Lifang He. 2022. A Survey on Text Classification:
From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 13, 2 (2022).
[23] Zhiding Liu, Jiqian Yang, Mingyue Cheng, Yucong Luo, and Zhi Li. 2024. Generative pretrained hierarchical transformer for time series
forecasting. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2003–2013.
[24] Asmita Mahajan, Nonita Sharma, Silvia Aparicio-Obregon, Hashem Alyami, Abdullah Alharbi, Divya Anand, Manish Sharma, and Nitin
Goyal. 2022. A Novel Stacking-Based Deterministic Ensemble Model for Infectious Disease Prediction. Mathematics 10, 10 (2022).
[25] Asmita Mahajan and Durga Toshniwal. 2023. A Novel Ensemble-Based Framework for Feature Identification and Classification of
COVID-19 Electronic Health Record Data. In 2023 IEEE International Conference on Big Data (BigData). IEEE, 3711–3720.
[26] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018. Statistical and Machine Learning forecasting methods:
Concerns and ways forward. PloS one 13, 3 (2018), e0194889.
[27] Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni
Dattani, Diana Beltekian, Esteban Ortiz-Ospina, and Max Roser. 2020. Coronavirus Pandemic (COVID-19). Our World in Data (2020).
[Link]
[28] Larry Medsker and Lakhmi C Jain. 1999. Recurrent neural networks: design and applications. CRC press.
[29] Marta C Nunes, Edward Thommes, and Holger Fröhlich. 2024. Infectious Disease Modelling. (2024).
[30] Satya Prakash, Anand Singh Jalal, and Pooja Pathak. 2023. Forecasting COVID-19 Pandemic using Prophet, LSTM, hybrid GRU-LSTM,
CNN-LSTM, Bi-LSTM and Stacked-LSTM for India. In 2023 6th International Conference on Information Systems and Computer Networks
(ISCON). 1–6.
ACM Trans. Intell. Syst. Technol.
Frequency Modulated Transformer Self-Attention for Advanced Infectious Disease Prediction • 25
[31] Hooman H Rashidi, Nam K Tran, Elham Vali Betts, Lydia P Howell, and Ralph Green. 2019. Artificial intelligence and machine learning
in pathology: the present landscape of supervised methods. Academic pathology 6 (2019), 2374289519873088.
[32] Hafiz Tayyab Rauf, M Ikram Ullah Lali, Muhammad Attique Khan, Seifedine Kadry, Hanan Alolaiyan, Abdul Razaq, and Rizwana Irfan.
2023. Time series forecasting of COVID-19 transmission in Asia Pacific countries using deep neural networks. Personal and Ubiquitous
Computing (2023), 1–18.
[33] Joshua S. Richman, Douglas E. Lake, and [Link] Moorman. 2004. Sample Entropy. In Numerical Computer Methods, Part E. Methods
in Enzymology, Vol. 384. Academic Press, 172–184.
[34] Omar Enzo Santangelo, Vito Gentile, Stefano Pizzo, Domiziana Giordano, and Fabrizio Cedrone. 2023. Machine learning and prediction
of infectious diseases: a systematic review. Machine Learning and Knowledge Extraction 5, 1 (2023), 175–198.
[35] Gregor Stiglic, Primoz Kocbek, Nino Fijacko, Marinka Zitnik, Katrien Verbert, and Leona Cilar. 2020. Interpretability of machine
learning-based prediction models in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10, 5 (2020),
e1379.
[36] Yan Tang, Yu Zhang, and Jiaxi Li. 2024. A time series driven model for early sepsis prediction based on transformer module. BMC
Medical Research Methodology 24, 1 (2024), 23.
[37] Nguyen Van Thieu. 2023. PerMetrics: A Framework of Performance Metrics for Machine Learning Models.
[38] Junlong Tong, Liping Xie, and Kanjian Zhang. 2023. Probabilistic decomposition transformer for time series forecasting. In Proceedings
of the 2023 SIAM International Conference on Data Mining (SDM). SIAM, 478–486.
[39] María E Torres, Marcelo A Colominas, Gaston Schlotthauer, and Patrick Flandrin. 2011. A complete ensemble empirical mode
decomposition with adaptive noise. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE,
4144–4147.
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. Advances in neural information processing systems 30 (2017).
[41] Liam Vaughan, Muyang Zhang, Haoran Gu, Joan B Rose, Colleen C Naughton, Gertjan Medema, Vajra Allan, Anne Roiko, Linda Blackall,
and Arash Zamyadi. 2023. An exploration of challenges associated with machine learning for time series forecasting of COVID-19
community spread using wastewater-based epidemiological data. Science of The Total Environment 858 (2023), 159748.
[42] Cyril Voyant, Gilles Notton, Soteris Kalogirou, Marie-Laure Nivet, Christophe Paoli, Fabrice Motte, and Alexis Fouilloy. 2017. Machine
learning methods for solar radiation forecasting: A review. Renewable energy 105 (2017), 569–582.
[43] Xinye Wang, Yi Yang, Yitian Xu, Qian Chen, Hongmei Wang, and Huafang Gao. 2020. Predicting hypoglycemic drugs of type 2 diabetes
based on weighted rank support vector machine. Knowledge-Based Systems 197 (2020), 105868.
[44] Zhumei Wang, Xing Su, and Zhiming Ding. 2021. Long-Term Traffic Prediction Based on LSTM Encoder-Decoder Architecture. IEEE
Transactions on Intelligent Transportation Systems 22, 10 (2021), 6561–6571.
[45] Zhijin Wang, Pesiong Zhang, Yaohui Huang, Guoqing Chao, Xijiong Xie, and Yonggang Fu. 2023. Oriented transformer for infectious
disease case prediction. Applied Intelligence 53, 24 (2023), 30097–30112.
[46] Dong Xue, Ming Wang, Fangzhou Liu, and Martin Buss. 2024. Time series modeling and forecasting of epidemic spreading processes
using deep transfer learning. Chaos, Solitons & Fractals 185 (2024), 115092.
[47] Dongchuan Yang, Mingzhu Li, Ju-e Guo, and Pei Du. 2024. An attention-based multi-input LSTM with sliding window-based two-stage
decomposition for wind speed forecasting. Applied Energy 375 (2024), 124057.
[48] Xinyi Yang, Jingyi Li, and Xuchu Jiang. 2024. Research on information leakage in time series prediction based on empirical mode
decomposition. Scientific Reports 14, 1 (2024), 28362.
[49] Jia-Rong Yeh, Jiann-Shing Shieh, and Norden E Huang. 2010. Complementary ensemble empirical mode decomposition: A novel noise
enhanced data analysis method. Advances in adaptive data analysis 2, 02 (2010), 135–156.
[50] Shuo Yu, Feng Xia, Shihao Li, Mingliang Hou, and Quan Z. Sheng. 2023. Spatio-temporal Graph Learning for Epidemic Prediction. ACM
Trans. Intell. Syst. Technol. 14, 2 (2023).
[51] George Udny Yule. 1927. VII. On a method of investigating periodicities disturbed series, with special reference to Wolfer’s sunspot
numbers. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character
226, 636-646 (1927), 267–298.
[52] Abdelhafid Zeroual, Fouzi Harrou, Abdelkader Dairi, and Ying Sun. 2020. Deep learning methods for forecasting COVID-19 time-Series
data: A Comparative study. Chaos, solitons & fractals 140 (2020), 110121.
[53] G Peter Zhang. 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50 (2003), 159–175.
[54] Feite Zhou, Zhehao Huang, and Changhong Zhang. 2022. Carbon price forecasting based on CEEMDAN and LSTM. Applied energy 311
(2022), 118601.
[55] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond efficient
transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 11106–11115.
ACM Trans. Intell. Syst. Technol.