ICCK Transactions on Machine Intelligence
[Link]
RESEARCH ARTICLE
Emotion Detection from Speech Using CNN-BiLSTM
with Feature Rich Audio Inputs
1,* 1 1 1
Shreya Tiwari , Devansh Kumar , Akshit Mahajan and Silky Sachar
1
Amity School of Engineering and Technology, Amity University Punjab, Mohali 140306, India
Abstract and receive real-time emotion predictions. This
In the age of increasing machine-mediated end-to-end system addresses the shortcomings of
communication, the ability to detect emotional earlier SER approaches—such as limited temporal
nuances in speech has become a critical competency modeling and reduced generalization—and
for intelligent systems. This paper presents showcases practical applicability in domains
a robust Speech Emotion Recognition (SER) like mental health monitoring, virtual assistants,
framework that integrates a hybrid deep learning and affective computing.
architecture with a real-time web-based inference
interface. Utilizing the RAVDESS dataset, the Keywords: speech emotion recognition, deep learning,
proposed pipeline encompasses comprehensive CNN-BiLSTM, RAVDESS, MFCC, real-time prediction,
preprocessing, data augmentation techniques, human-computer interaction, audio processing, web
and feature extraction based on Mel-Frequency deployment, affective computing.
Cepstral Coefficients (MFCCs), Chroma features,
and Mel-spectrograms. A comparative experiment
was run against a standard machine learning 1 Introduction
classifier such as K-Nearest Neighbors (KNN), Speech Emotion Recognition (SER) has never before
Support Vector Machine (SVM), Random Forest, played such a pivotal role as it does in today’s
and XGBoost. The experimental results indicate fast-changing digital landscape. With the expansion
that the CNN-BiLSTM-Conv1D model proposed is of human interaction with artificial intelligence
much better as compared to conventional models platforms — virtual assistants, customer support bots,
with a state-of-the-art classification accuracy of 94%. and social robots — the necessity of emotionally
The model was further evaluated using ROC-AUC intelligent AI cannot be denied. Most existing
curves and per-class performance metrics. It was platforms are "emotionally dumb," reacting one way
subsequently deployed using a Flask-based web no matter if a user is seething with rage, thrilled, or
interface that enables users to upload voice inputs depressed. Having emotion sensing embedded in AI
[1] makes interactions feel more organic, empathic,
and human-like by allowing machines to adjust
responses based on the emotional states of people.
In addition to human-computer interaction, SER finds
significant application in the healthcare industry,
especially in the monitoring of mental health [2].
Submitted: 25 June 2025
Accepted: 30 July 2025
Published: 14 September 2025
Citation
Vol. 1, No. 2, 2025. Tiwari, S., Kumar, D., Mahajan, A., & Sachar, S. (2025). Emotion
10.62762/TMI.2025.306750
Detection from Speech Using CNN-BiLSTM with Feature Rich Audio
*Corresponding author: Inputs. ICCK Transactions on Machine Intelligence, 1(2), 80–89.
Shreya Tiwari © 2025 ICCK (Institute of Central Computation and Knowledge)
shreya.tiwari6@[Link]
80
ICCK Transactions on Machine Intelligence
Speech emotions have the potential to be the first • We design a hybrid deep learning architecture that
signs of depression, anxiety, or cognitive impairment. combines Convolutional Neural Networks (CNN)
Furthermore, combining speech with other modalities, for spatial feature extraction with Bidirectional
such as visual cues, can enhance emotion recognition Long Short-Term Memory (BiLSTM) layers to
accuracy in such applications [11]. Using SER, capture temporal dependencies in speech signals.
speech-based systems can non-intrusively monitor
• We implement a lightweight attention mechanism
emotional trends over time. This assists medical
to enhance focus on emotionally salient parts of
professionals in early diagnosis and intervention,
the spectrogram without significantly increasing
while also providing real-time emotional support for
computational overhead.
vulnerable groups such as the elderly.
• We employ extensive data augmentation
In education and learning where face-to-face social techniques to improve model generalization
cues [3] are missing, SER can be used to evaluate across speaker-independent scenarios.
learners’ attention by monitoring emotions like
confusion, boredom, or excitement and then allowing • We conduct comprehensive experiments and
the educator to align the teaching methods accordingly. comparative evaluations to demonstrate the
Likewise, sectors like entertainment, gaming industry, effectiveness and efficiency of the proposed model
automotive safety, and call center management are in contrast with existing approaches.
starting to adopt emotion-aware implementations to The remainder of this paper is organized as follows:
enhance the user experience, improve personalization Section 2 presents a review of related work in the
and customer satisfaction. domain of speech emotion recognition. Section 3
Ultimately, Speech Emotion Recognition fills a key details the description of the dataset. Section 4 consists
gap — enabling machines not just to hear what of the data pre-perfection phase and methodologies
we say, but also the emotions behind what we used, including pre-processing, data augmentation,
say, towards increasingly intelligent, empathetic and and model design. It also explained the experimental
effective technologies across a wide range of aspects of setup, evaluation metrics, model results, and
contemporary life. Therefore, our research’s primary interpretation of the findings. Section 5 includes
objective is to design and develop an efficient and comparative study and novelty justification. Finally,
accessible system capable of identifying a speaker’s Section 6 concludes the study and outlines directions
emotional state—such as happy, sad, angry, neutral, for future research. Section 7 presents the declaration,
and others—using machine learning models. By doing stating that the authors have no competing interests.
so, we aim to bridge the emotional gap between
humans and machines, facilitating more meaningful 2 Literature Review
and adaptive interactions. In the current paper, we SER is an important part of affective computing since
introduce a new unfamiliar framework to Speech it helps machines understand human emotions by
Emotion Recognition which merges the advantages of listening to what is spoken. Here, we summarize
convolutional neural nets and bi-directional recurrence existing research and discuss its problems, before
systems with the lightweight attention mechanism. describing how our method corrects them. Various
However, unlike previous models which either only approaches have been explored such as Ververidis et
consider spatial or temporal characteristics, we are al. [4] utilized hand-crafted features such as MFCCs,
merging both characteristics in order to extract pitch, and energy with classifiers like Gaussian
more emotional informant in the speech. We shall Mixture Models (GMMs) and k-Nearest Neighbors
also propose an improved preprocessing pipeline (k-NN) which while foundational, struggled to
which consists of obtaining state of the art data explain the nature of speech and emphasize poor
augmentations and feature extraction techniques to generalizability. Another manuscript by Eyben et al.
enhance the performance in speaker independent [5] applied Support Vector Machines (SVMs) with
settings. We find it optimal to balance between MFCCs and prosodic features. Although effective
accuracy and engine efficiency of the model by in speaker-dependent tasks, the model performed
optimizing the architecture and parameters of the poorly in speaker-independent conditions. In Contrast
model that allows our solution to be real-life applicable we enhance generalizability across speakers through
and fit into limited resources. The key contributions extensive data augmentation and robust feature
of this paper are as follows: extraction.
81
ICCK Transactions on Machine Intelligence
CNN based architecture proposed by Zhao et al. [6] emotional expression was captured at both strong
using spectrogram images of audio signals. CNNs and normal levels of strength, expect for the neutral
effectively extract spatial features but are inadequate expression.
at modeling temporal dependencies. While we
incorporate BiLSTM layers post-CNN to model both
We received all the audio files in .wav form at 48 kHz
spatial and temporal features.
with good clarity. Labeling files the same way for
Zhang et al. [7] introduced attention mechanisms into everyone and in every emotion makes preprocessing
CNNs to focus on emotionally salient regions and emotion extraction much easier. It is vital
in the spectrogram. However, the models for training strong models that the dataset contains
were computationally intensive. In contrast emotions in balance.
we use lightweight attention mechanisms and
perform hyperparameter optimization for efficient To assure reliability and preserve limited experimental
computation. conditions, the RAVDESS dataset was only applied
Although Barhoumi et al. [18] proposed a SER in this experiment. RAVDESS offers well-balanced
system to learn with deep learning as well as and high-quality audio recordings with well-marked
traditional augmentation and feature extraction emotional indices, which makes it suitable in
over several datasets, our current work has much terms of assessing baseline performance of speech
different directions in the model design, the depth emotion recognition systems. The uniform recording
of augmentation design, and usability of the models conditions and well-organized labeling allow
in the real world. We introduce a new hybrid analyzing the model behavior in a focused way
CNNBiLSTM-Conv1D model with attention layers and without unreliable external noise and demographic
a variety of features in the pipeline accompanied by a variances. Still, we realize the constraints of the use
deployed Flask web-based interface to infer real-time of one dataset. Additional benchmark datasets like
emotions. This approach provides better performance the IEMOCAP [12], CREMA-D, and the FAU Aibo
scores, can be scaled and it can be generalized because Emotion Corpus [13] will be used in the future to
of targeted multi-dataset expansion. verify this model in order to enhance its robustness
and generalizability. The datasets also involve more
Askari et al. [19] proposed a hybrid R-CNNBLSTM diverse speakers, spontaneous speech, and a broader
model in terms of both denoising based on an range of acoustic conditions, which will enable to
autoencoder and self-attention in recognizing emotion more thoroughly monitor the efficiency of the models
on CREMA-D using a single crop. By way of contrast, in the real-life situations.
our scheme values applicability in real-time, with
a computationally reasonable CNNBiLSTM-Conv1D
framework supplemented with a simple attention.
Although sharing a vision of hybrid architecture, 4 Methodology
our work in this area is unique as we are keen
to deployment, a broad array of augmentation The methodology adopted in this research includes a
mechanisms, and general evaluation measures, and, comprehensive and structured pipeline covering data
therefore, it is particularly important to apply preprocessing, augmentation, feature extraction,
environments. modeling, and evaluation. A light-weight
attention-mechanism after the BiLSTM layer was
added to increase the time perception of the model.
3 Dataset Description This mechanism applies attention weights at the time
For the purposes of this study, we used the (RAVDESS) step of BiLSTM sequence underlying the output,
dataset [8]. RAVDESS is scientifically tested dataset enabling the model to give more attention to the frames
that supports the study of how we identify different that are of emotional significance in the sequence
emotions through speech and song. The dataset of speech signal. The inclusion of this mechanism
is made up of 1440 speech files with each of 24 contributes to improved classification performance by
professionals (6 men and 6 women) voicing two enabling the model to selectively emphasize features
similar statements in a neutral North American accent that are more relevant to emotion recognition, while
under eight distinct emotions: neutral, calm, happy, maintaining computational efficiency suitable for
sad, angry, fearful, disgusted and surprised. Every real-time deployment.
82
ICCK Transactions on Machine Intelligence
Figure 1. Schematic diagram for speech emotion recognition.
4.1 Data Preparation Phase coefficients, Chroma features, Mel-spectrograms, and
4.1.1 Preprocessing and Labeling normalization techniques. These features effectively
Audio files were initially converted to mono channel captured the phonetic and tonal nuances essential for
format and standardized for consistency. Label emotion recognition.
encoding [9] was used to transform categorical To improve clarity, a schematic block diagram Figure 1
emotion labels into numerical values. A stratified has been added to illustrate the complete pipeline
splitting strategy was applied to ensure proportional of the proposed Speech Emotion Recognition (SER)
class distribution across the training (70%), testing system. The diagram illustrates main phases of
(20%), and validation (10% of training) sets. the pipeline which encompass raw audio feed,
4.1.2 Data Augmentation preprocessing, data augmentation, feature extraction
(MFCC, Chroma, Mel-Spectrograms), model structure
To mitigate overfitting and increase data diversity,
(CNNBiLSTM-Conv1D) and optionally, deployment
several augmentation techniques were utilized as
through real-time interface of Flask. As well, a
shown in Table 1, like pitch shifting [10], background
light attention mechanism was applied following the
noise addition, time stretching, audio rolling, and
BiLSTM to maximize the temporal attention with
time/frequency masking. These augmentations
getting a higher attention to frames that are more
helped the model generalize better by simulating
emotionally important. This will make the model
varied acoustic environments.
adaptive to subtle difference in speech patterns that
Table 1. Summary of data augmentation techniques. are pertinent to emotion classification, and this makes
performance and interpretability friendly.
Technique Description
Pitch Shifting Modifying pitch while
4.2 Modeling
retaining tempo
Background Injecting noise to simulate 4.2.1 Traditional Machine Learning Models
Noise real-world conditions (Benchmarking Phase)
Time Stretching Speeding up or slowing down Initially, traditional models were evaluated to establish
the audio baseline performance. Each model was trained
Audio Rolling Shifting audio content using the extracted features (MFCCs, Chroma,
circularly Mel-spectrograms), and results are summarized
Time/Frequency Randomly masking parts of below:
Masking time/frequency domain
• K-Nearest Neighbors (KNN): An algorithm is
termed as knn when it assigns a data point
4.1.3 Feature Extraction to a given class in which a majority of the
Speech signal features were extracted using nearest neighbors of the point belongs to in the
Mel-Frequency Cepstral Coefficients (MFCCs) with 40 feature space defined by the k nearest neighbors.
83
ICCK Transactions on Machine Intelligence
Classified emotions by calculating distances in • XGBoost: A high-performance gradient
feature space. However, the model struggled with boosting algorithm [14] known for modeling
higher dimensional features. complex feature interactions efficiently. The
highest-performing traditional model with, due
• Support Vector Machine (SVM): A model that to its gradient-boosting capabilities. However,
attempts to identify the maximum margin that it was used to test whether boosting-based
separates two classes of instances such that it ensemble learning could improve classification
maximizes the distance between the two by accuracy on extracted features.
locating the best hyper plane between the two
• Stacked Decision Trees (SDT): An ensemble of
classes is called svm. Utilized hyperplanes for
decision trees [15] stacked in layers. Another
class separation but underperformed due to the
ensemble technique to see if deeper decision
non-linear nature of emotional speech boundaries.
structures improved accuracy.
While SVM handles linear or slightly nonlinear
data well, SER involves highly non-linear While these models served as useful benchmarks,
emotional boundaries in time sequences, which their fundamental limitation was the lack of temporal
reduced its performance. modeling capabilities. Speech Emotion Recognition
(SER) relies heavily on time-sequenced variations in
• Random Forest (RF): An ensemble sentiment tone, pitch, and rhythm—elements that traditional
method that magnifies generalization methods fail to exploit.
performance and minimizes overfitting by
using several decision trees. Leveraged an 4.2.2 Deep Learning-Based Hybrid Architecture
ensemble of decision trees to model non-linear To address the limitations of the conventional models,
relationship. Despite improved performance, it a bespoke deep learning A Convolutional Neural form
lacked the ability to model sequential patterns. of architecture was formed, which was a mixture of
bidirectional Long ShortTerm Memory (CNN) And
• Multilayer Perceptron (MLP): Fully connected Conv1D [16] layers, as well as BiLSTM, layers. The
Neural network with hundreds of layers of model pipeline includes:
"hidden" neurons consisting of input and output
• CNN Layers: Extract spatial features from input
layers trained using nonlinear mappings between
spectrograms, identifying local emotion-related
the input to output. Attempted to learn non-linear
frequency patterns.
feature relationships and to explore whether
deep feature transformations could help learn • BiLSTM Layers: Capture long-term
emotional patterns in the extracted features. temporal dependencies in speech, improving
Table 2. Summary of models and key characteristics.
Model
Model Type Key Characteristics
Name
Distance-based classification; effective on low-dimensional data; baseline
Traditional ML KNN
model.
Uses hyperplanes for class separation; struggles with non-linear emotion
Traditional ML SVM
boundaries.
Random
Traditional ML Ensemble of decision trees; handles non-linearity; lacks sequential modeling.
Forest
Fully connected layers; learns nonlinear patterns; fails to capture temporal
Traditional ML MLP
context.
Gradient boosting; captures complex feature interactions; best traditional
Traditional ML XGBoost
model.
Traditional ML SDT Stacked decision trees; poor performance on time-dependent data.
CNN-
Deep Learning Learns spatial and sequential features; combines CNN and BiLSTM for SER.
BiLSTM
Deep Learning Conv1D 1D convolutions for fast sequential modeling; supports primary architecture.
84
ICCK Transactions on Machine Intelligence
context-awareness in emotion detection. acoustic features. This visualization provides detailed
insight into misclassification trends and highlights
• Conv1D Layers: Handle sequential 1D data
areas for future improvement.
efficiently and reinforce temporal learning.
• Dense Layers + Batch Normalization +
Dropout: Enhance generalization while reducing
overfitting.
• SoftMax Output Layer: Performs multi-class
emotion classification.
This hybrid approach significantly outperformed
traditional models due to its ability to learn both spatial
and temporal representations as shown in Table 2. It
forms the backbone of the final system integrated in
the later phase.
Figure 3. Model performance (F1-score)
4.3 Training and Evaluation across individual emotions.
The training subjected to the model was under the
Adam optimizer, the categorical cross-entropy loss
As depicted in Figure 3, the F1-scores vary across
accuracy, and learning rate scheduling. Early stopping
emotion classes. Emotions like happy and neutral
and model checkpointing techniques were employed
achieved F1-scores above 85%, while emotions such
to prevent overfitting.
as fear and disgust scored relatively lower, indicating
The evaluation was carried out using several challenges in distinguishing subtle or less frequent
performance metrics: emotional cues. This bar chart highlights the
effectiveness and class-wise limitations of the SER
• Accuracy, Precision, Recall, and F1-score
system.
• Confusion Matrix and ROC-AUC Curves
Figure 4. Training and validation of accuracy and loss
Figure 2. Confusion matrix illustrating classification
curves showing learning progression.
accuracy per emotion class using the proposed deep
learning model.
The accuracy curve (see Figure 4) illustrates
Figure 2 displays the confusion matrix for the proposed the model’s learning behavior over epochs. A
CNNBiLSTM model. The model demonstrates high consistent rise in validation accuracy indicates stable
accuracy for emotions such as happy, calm, and generalization without overfitting. The loss curve
neutral, while some confusion persists between fear complements this by showing a steady decline in
and surprise, which is common due to overlapping both training and validation loss, validating the
85
ICCK Transactions on Machine Intelligence
model’s convergence and robustness. To evaluate
the robustness and generalizability of the proposed
CNN–BiLSTM–Conv1D architecture, a stratified 5-fold
cross-validation was conducted on the RAVDESS
dataset. The model achieved a mean classification
accuracy of 76.85% with a standard deviation of
±10.92%, and a corresponding 95% confidence
interval of [61.69%, 92.01%]. The weighted F1-score
was 76.37% ± 10.99%, with a confidence interval
of [61.12%, 91.62%], indicating overall balanced
performance across emotion classes. These results
confirm that the model significantly outperforms
baseline random classification (12.5% for 8 classes),
validating its ability to learn discriminative emotional
patterns from speech.
However, the relatively high standard deviation Figure 5. Multi-class ROC Curve for the proposed deep
and wide confidence intervals suggest variability in learning model. ROC curves are plotted for each of the 8
performance across different data splits, which may emotion classes using one-vs-rest strategy. The model
achieves a micro-average and macro-average AUC of 1.00,
be attributed to class imbalance, limited training data,
with most individual classes also attaining AUC ≥ 0.99,
or sensitivity to speaker-dependent features. Despite demonstrating high separability across emotional states.
this, the consistent performance between accuracy
and F1-score indicates that the model maintains a
fair balance between precision and recall. Future disgust had a slightly lower but still strong AUC
improvements could include training on additional of 0.99. Furthermore, the model yielded both
datasets (e.g., IEMOCAP, CREMA-D), enhanced microaverage and macro-average AUC values of
data augmentation, or architectural tuning to further 1.00, signifying excellent overall performance and
stabilize performance and enhance generalization. consistent separability between emotion classes. These
results confirm the robustness of our architecture
As shown in Table 3, the cross-validation results reveal
in distinguishing between emotional states and
the model’s performance metrics, including accuracy
reinforce its effectiveness for real-world speech
and weighted F1-score, along with their variability
emotion recognition applications.
across folds.
Hyperparameter tuning was conducted using grid
Table 3. Cross-validation results with variability metrics.
search in conjunction with validation set performance
Mean Std. Dev. to optimize model configuration.
Metric 95% CI (%)
(%) (±)
4.5 Integration Phase
Accuracy 76.85 10.92 [61.69, 92.01]
F1-Score The last step of the project was implementation of the
76.37 10.99 [61.12, 91.62] trained model in a web based interface. In order to
(Weighted)
prove applicability of the proposed system of Speech
Emotion Recognition in real time fashion, a web based
4.4 ROC Analysis interface was created, with the help of the Flask
framework as shown in Figure 6. The trained deep
To evaluate performance of the model across multiple
learning model was served by a lightweight Python
emotion classes, a multi-class ROC curve was plotted
web framework: Flask and enabled interaction with the
using a one-vs-rest strategy. Figure 5 illustrates the
user. They can use the web to upload a file with audio
ROC curves for each of the eight emotion classes along
data or a speech spectrogram, which is processed
with the micro-average and macro-average curves.
on the server-side and features extracted as well as
The Area Under the Curve (AUC) values for all the related emotion is estimated based on the trained
individual classes exceeded 0.99, with the emotions CNN,BiLSTM,Conv1D model. The result emotion tag
angry, calm, fearful, happy, neutral, sad, and surprised is shown immediately in the screen page as shown
achieving a perfect AUC of 1.00. The emotion here in Figure 7. Such integration allows the research
86
ICCK Transactions on Machine Intelligence
Table 4. Comparative analysis with existing work.
Aspect Base Paper Approach Our Proposed Approach
Model Type Random Forest, AdaBoost, Gradient Hybrid Deep Learning: CNN + BiLSTM
Boosting (Ensemble Learning) + Conv1D
Temporal Modeling Not supported (no memory of Bidirectional LSTM captures sequential
sequence) dependencies
Data Augmentation Not mentioned Pitch shifting, background noise, time
stretching, masking
Feature Extraction MFCCs MFCCs, Chroma, Mel-Spectrogram,
Normalization
Accuracy Achieved 85% 94%
Evaluation Metrics Accuracy only Accuracy, Precision, Recall, F1-score,
Confusion Matrix, ROC-AUC
Deployment Not implemented Deployed with Flask Web Interface for
real-time inference
to merge into application allowing end-user to interact 5 Comparative Study and Novelty Justification
with the model in real-time. That web application, The differences between the suggested method and
in turn, passed the test in terms of responsiveness the approach taken in the mention base paper [17] are
and accuracy, which means it is possible to deploy shown in Table 4. Original work depended mostly on
it in the fields that include, but are not limited to, Random Forest, AdaBoost and Gradient Boosting for
emotion-aware assistants, educational platforms, or emotion recognition, while we rely on CNN, BiLSTM
mental health tracking programs. and Conv1D in our hybrid deep learning setup. The
new approach based on a deep neural network helps
the model notice both the spatial features and the
changing aspects of speech.
Also, to improve performance, we include extra
data augmentation measures such as pitch shifting,
introducing background noise, time stretching and
the use of masks, methods that the main paper didn’t
address. Using these techniques gives the model
experience with a broader range of acoustic situations.
The base paper’s restriction is that it cannot model
relationships between events in time, since traditional
Figure 6. System landing page. machine learning classifiers lack memory for this
purpose. Unlike other methods, ours uses BiLSTM
layers which allow the model to learn temporal
features, making it more suitable for spotting emotions
from speech.
Concerning the accuracy, our model had a gain of
94% which is hugely improved compared to the 85%
recorded in the base paper. Moreover, the results
produced by our model are divided into several rows
by each of the 8 categories of emotions and contain
the detail of the accuracy, recall and F1-score of every
emotion.
Figure 7. System workflow overview.
We also make it possible for users to upload voice
clips and instantly learn the emotions through a web
87
ICCK Transactions on Machine Intelligence
page because we use Flask to deploy our model. The dynamic and responsive emotional AI tool that has
practical use of the model reduces the gap between the potential to greatly improve human-computer
developing it and using it in society. interaction throughout the board.
Overall, our approach is better than the original
base paper in accuracy and robustness, providing a Data Availability Statement
complete SER solution that can be deployed.
Data will be made available on request.
6 Conclusion
Funding
In this project, the extensive nature of applying deep
This work was supported without any funding.
learning to the Speech Emotion Recognition can be
observed supported by a scalable and user-friendly
Conflicts of Interest
web interface. By leveraging the RAVDESS dataset
and implementing a hybrid CNN-BiLSTM architecture, The authors declare no conflicts of interest.
we achieved high accuracy in classifying a diverse
range of emotions. Conventional machine learning Ethical Approval and Consent to Participate
algorithms set the baseline performance, whereas our
deep learning system consistently outperformed them, Not applicable.
highlighting the significance of temporal and spatial
feature extraction for audio-based emotion detection. References
Moreover, realtime implementation i.e web application
[1] Fayek, H. M., Lech, M., & Cavedon, L. (2017).
demonstrates the usability of the system in the real Evaluating deep learning architectures for speech
world in various practical applications like healthcare emotion recognition. Neural Networks, 92, 60-68.
monitoring, virtual assistants, education, and customer [Crossref]
service. In the future, some upgrades can be [2] Singla, C., Singh, S., Sharma, P., Mittal, N., & Gared,
incorporated to enhance the versatility and impact F. (2024). Emotion recognition for human–computer
of the system. One of the points of improvement interaction using high-level descriptors. Scientific
is multi-format audio support, where a module that reports, 14(1), 12122. [Crossref]
can automatically decode MP3 or any other format [3] Devillers, L., Vidrascu, L., & Lamel, L. (2005).
to WAV would simplify the input pipeline for users. Challenges in real-life emotion annotation and
Furthermore, widening the training data to encompass machine learning based detection. Neural Networks,
broader and multilingual datasets will broaden the 18(4), 407-422. [Crossref]
system’s generalizability to various languages, accents, [4] Ververidis, D., & Kotropoulos, C. (2006). Emotional
and cultural aspects. Following this, adding automatic speech recognition: Resources, features, and methods.
Speech communication, 48(9), 1162-1181. [Crossref]
language detection would enable the system to
[5] Eyben, F., Wöllmer, M., & Schuller, B. (2010, October).
dynamically adjust preprocessing and model inference
Opensmile: the munich versatile and fast open-source
based on the user’s spoken language, to support audio feature extractor. In Proceedings of the 18th ACM
smooth multilingual use. international conference on Multimedia (pp. 1459-1462).
[Crossref]
Cross-platform deployment, such as to mobile and
desktop platforms, is also a natural extension, exposing [6] Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion
recognition using deep 1D & 2D CNN LSTM networks.
the system to environments outside of the web.
Biomedical signal processing and control, 47, 312-323.
Deploying it in this fashion would make it more usable [Crossref]
in areas such as in-car voice interfaces and offline
[7] Zhang, Y., Du, J., Wang, Z., Zhang, J., & Tu, Y.
medical applications. Furthermore, providing context (2018, November). Attention based fully convolutional
awareness by keeping track of historical conversations network for speech emotion recognition. In 2018
could enable the model to grasp not only isolated Asia-Pacific Signal and Information Processing Association
statements but also the affective path throughout a Annual Summit and Conference (APSIPA ASC) (pp.
conversation, resulting in more refined and accurate 1771-1775). IEEE. [Crossref]
emotion detection. [8] RAVDESS Emotional Speech Audio Dataset. (2025,
July 13). RAVDESS Emotional Speech Audio [Dataset].
By persisting in these areas of innovation, the Retrieved from [Link]
project has significant potential to develop into a ler/ravdess-emotional-speech-audio
88
ICCK Transactions on Machine Intelligence
[9] scikit-learn. (n.d.). LabelEncoder. Retrieved July 13, [19] Askari, M. H., Shahzad, A., Faraz, A., Fuzail, M.,
2025, from [Link] Aslam, N., & Tariq, M. A. (2025). EFFECTIVE
[Link] SPEECH EMOTION RECOGNITION USING R-CNN
[10] Data augmentation using pitch shifting. (2023). & BLSTM. Kashf Journal of Multidisciplinary Research,
Applied Acoustics. Retrieved July 13, 2025, from https: 2(06), 293-309. [Crossref]
//[Link]/resource/speech-data-augmentation-voi
ce-audio/ Shreya Tiwari is currently pursuing a
Bachelor of Technology ([Link]) degree in
[11] Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, Computer Science and Engineering with a
B. W., & Zafeiriou, S. (2017). End-to-end multimodal specialization in Artificial Intelligence and
emotion recognition using deep neural networks. IEEE Machine Learning at Amity University, Mohali,
Journal of selected topics in signal processing, 11(8), Punjab. Her research interests lie in the field
1301-1309. [Crossref] of affective computing, with a particular focus
on speech emotion recognition, machine
[12] Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A.,
learning, and deep learning techniques
Mower, E., Kim, S., ... & Narayanan, S. S. (2008). for human-centered AI systems. (Email:
IEMOCAP: Interactive emotional dyadic motion shreya.tiwari6@[Link])
capture database. Language resources and evaluation,
42(4), 335-359. [Crossref] Devansh Kumar is currently pursuing a
[13] Batliner, A., Steidl, S., & Nöth, E. (2008). Releasing Bachelor of Technology ([Link]) degree in
a thoroughly annotated and processed spontaneous Computer Science and Engineering with a
emotional database: the FAU Aibo Emotion Corpus. specialization in Artificial Intelligence and
Machine Learning at Amity University, Mohali,
[14] Shyam, R., Ayachit, S. S., Patil, V., & Singh, A. (2020, Punjab. His research interests lie in the field
December). Competitive analysis of the top gradient of affective computing, with a particular focus
boosting machine learning algorithms. In 2020 2nd on speech emotion recognition, machine
international conference on advances in computing, learning, and deep learning techniques
communication control and networking (ICACCCN) (pp. for human-centered AI systems. (Email:
191-196). IEEE. [Crossref] devansh.kumar2@[Link])
[15] Kumar, M., Singhal, S., Shekhar, S., Sharma, B., &
Akshit Mahajan is currently pursuing a
Srivastava, G. (2022). Optimized stacking ensemble
Bachelor of Technology ([Link]) degree in
learning model for breast cancer detection and Computer Science and Engineering with a
classification using machine learning. Sustainability, specialization in Artificial Intelligence and
14(21), 13998. [Crossref] Machine Learning at Amity University, Mohali,
[16] Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Punjab. His research interests lie in the field
Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016, of affective computing, with a particular focus
March). Adieu features? end-to-end speech emotion on speech emotion recognition, machine
recognition using a deep convolutional recurrent learning, and deep learning techniques
for human-centered AI systems. (Email:
network. In 2016 IEEE international conference on
[Link]@[Link])
acoustics, speech and signal processing (ICASSP) (pp.
5200-5204). IEEE. [Crossref] Dr. Silky Sachar is an Assistant Professor
[17] Guo, Y., Xiong, X., Liu, Y., Xu, L., & Li, Q. (2022). A in Computer Science at Amity University,
novel speech emotion recognition method based on India. She holds a Ph.D. in Computer
feature construction and ensemble learning. PLoS One, Science with a research focus on machine
17(8), e0267132. [Crossref] learning, image processing, and metaheuristic
optimization. Her work integrates classical ML
[18] Barhoumi, C., & BenAyed, Y. (2024). Real-time speech
techniques with deep learning and attention
emotion recognition using deep learning and data mechanisms for real-world applications.
augmentation. Artificial Intelligence Review, 58(2), 49. (Email: ssachar@[Link])
[Crossref]
89