Emotion Recognition
N Sai Satwik Reddy∗ , V Venkata Alluri Rohith∗ , V Poorna Muni Sasidhar Reddy∗ ,
Y Shashank Reddy∗ , Jyothish Lal G∗
∗ Amrita School of Artificial Intelligence, Coimbatore, Amrita Vishwa Vidyapeetham, India
{satwikreddy987@, vennaalluri1, vpoornareddy2004, ysrpersom}@[Link],
g jyothishlal@[Link]
Abstract— pitch, intensity, and speech rate, can unveil emotional distress
Index Terms—Speech emotion recognition markers, enabling timely intervention and support for indi-
viduals in need. Some of the real-life applications of speech
I. I NTRODUCTION emotion recognition span diverse industries, from customer
Speech signals, a fundamental aspect of human communica- service and virtual assistants to education and entertainment.
tion, carry an abundance of emotional data that enriches the Advancements in these fields leverage SER to create more
meaning and significance of our conversations. Due to the intuitive and responsive systems, ultimately enhancing user
ability to convey both linguistic content and the emotional experiences. Industries benefit from improved customer sat-
state of the speaker, speech signals are extremely valuable. isfaction, personalized learning experiences, and emotionally
The recognition of these emotional cues has become extremely engaging entertainment content, leading to a paradigm shift in
significant; thus, Speech Emotion Recognition (SER) has how we interact with technology.
emerged as a research area with applications in many impor-
tant aspects of our daily lives, including computer and robot II. R ELATED W ORK
interfaces, legally and socially acceptable applications, coun- Numerous deep learning (DL) methodologies have been pro-
selling, therapy, etc. The role of speech emotion recognition in posed for emotion recognition in recent years.
Human-Computer Interaction (HCI) plays an essential role in A lightweight dual-stream conformer fusion network is
discriminating between the emotional nuances of human com- designed with convolution kernels of sizes (3×3), (1×11),
munication. With technology interaction becoming more and and (11×1) to extract a diverse set of features from
more conversational and personalized, incorporating emotional mel-spectrograms and Mel frequency cepstral coefficients
intelligence into these systems becomes indispensable. SER (MFCCs) obtained from the audio signals [1]. The features
serves as a bridge for machines to comprehend and respond extracted from these three different methods are then fed into
appropriately to human emotions, fostering a more natural and the second part of the overall network for emotion classifi-
empathetic connection between humans and computers. cation. Constant-Q transform-based modulation spectrograms
However, the problem of speech emotion recognition is are extracted from the voice records from two well-known
inherently challenging due to the dynamic and subjective databases, EmoDB and RAVDESS, and fed into two different
nature of emotions. Unlike other modalities such as hand deep neural networks (DNNs) for classifying the emotions in
gestures or facial expressions, speech emotions are often subtle [2]. The variant of DNN that used support vector machines
and context-dependent, making their identification a complex (SVM), which took embeddings resulting from the DNN,
task. Various modalities, including facial expressions, body outperformed the usual DNN. In [3], a combination of MFCCs
language, physiological signals, and even textual analysis, and time-domain features is extracted and input into the
contribute to a holistic understanding of emotions. Neverthe- convolutional neural network (CNN) for emotion recognition,
less, the prevalence of emotional information in audio waves and this approach also outperformed the standard machine
make speech a relevant modality for emotion recognition. The learning (ML) approaches. Complex MFCCs are used as input
prevalence of audio modality in emotion recognition can be to the sequential DNN in [4], and the metrics improved sig-
attributed to its unique ability to capture the nuances of human nificantly when tested using gender-integrated differentiation
expression, including prosody, intonation, and other acoustic in the RAVDESS dataset. Multiple acoustic features, includ-
features. ing MFCCs, linear prediction cepstral coefficients (LPCCs),
Speech emotion recognition holds promise for diagnosing wavelet packet transform (WPT), and other time domain
and aiding patients with speech-related disorders such as features, are obtained from EmoDB and RAVDESS in [5], and
dysarthria or stuttering. The subtle variations in speech pat- a one-dimensional CNN is utilized for classification purposes.
terns can provide valuable insights for medical professionals, The architecture of the SER system proposed in [6] is designed
aiding in the assessment and treatment of these disorders. In for three tasks, which include the intensity estimation of the
the realm of mental health, speech emotion recognition extends emotion, type of emotion, and gender identification. Time-
its utility to detect conditions like depression, anxiety, and domain and spectral-domain filters are applied to the mel-
even suicidal thoughts. Analyzing acoustic features, such as spectrograms extracted from the voice records and input into
the CNNs and long short-term memory (LSTM) for feature [11] Z. Chen, J. Li, H. Liu, X. Wang, H. Wang, and Q. Zheng, “Learning
learning to perform the aforementioned tasks. In [7], the input multi-scale features for speech emotion recognition with connection
attention mechanism,” Expert Systems with Applications, vol. 214,
into the VGG network is chaograms, which represent the 3- p. 118943, 2023.
dimensional tensor obtained from the speech records in RGB
color space. The gray wolf optimization method is used for
fine-tuning the hyperparameters.
[8] utilized data augmentation techniques involving the
addition of white Gaussian noise to the records, and also gen-
erated pitch-shifted and time-stretched versions of the speech
records. Subsequently, multiple time-domain and frequency-
domain features such as zero-crossing rate (ZCR), MFCCs,
chromagrams, etc., were extracted and fed into multiple DL
models such as ensemble models, attention-based models,
and transfer learning-based models for emotion recognition.
A Raspberry Pi-based hardware implementation of the SER
system is proposed in [9], utilizing a multi-layer perceptron
neural network that uses MFCCs for classifying emotions.
A blend of 2-dimensional CNN and LSTM networks with
MFCC features as input is proposed in [10] and evaluated on
a dataset comprising records from RAVDESS, SAVEE, and
TESS datasets to detect eight classes of emotions. In [11],
log-mel spectrograms are extracted from the audio signals
III. M ETHODOLOGY
IV. R ESULTS AND D ISCUSSION
V. C ONCLUSION
R EFERENCES
[1] M. Tellai, L. Gao, and Q. Mao, “An efficient speech emotion recognition
based on a dual-stream cnn-transformer fusion network,” International
Journal of Speech Technology, vol. 26, no. 2, pp. 541–557, 2023.
[2] P. Singh, M. Sahidullah, and G. Saha, “Modulation spectral features
for speech emotion recognition using deep neural networks,” Speech
Communication, vol. 146, pp. 53–69, 2023.
[3] A. S. Alluhaidan, O. Saidani, R. Jahangir, M. A. Nauman, and O. S.
Neffati, “Speech emotion recognition through hybrid features and con-
volutional neural network,” Applied Sciences, vol. 13, no. 8, p. 4750,
2023.
[4] S. Patnaik, “Speech emotion recognition by using complex mfcc and
deep sequential model,” Multimedia Tools and Applications, vol. 82,
no. 8, pp. 11897–11922, 2023.
[5] K. Bhangale and M. Kothandaraman, “Speech emotion recognition
based on multiple acoustic features and deep convolutional neural
network,” Electronics, vol. 12, no. 4, p. 839, 2023.
[6] Z.-T. Liu, M.-T. Han, B.-H. Wu, and A. Rehman, “Speech emotion
recognition based on convolutional neural network with attention-based
bidirectional long short-term memory network and multi-task learning,”
Applied Acoustics, vol. 202, p. 109178, 2023.
[7] M. R. Falahzadeh, F. Farokhi, A. Harimi, and R. Sabbaghi-Nadooshan,
“Deep convolutional neural network and gray wolf optimization algo-
rithm for speech emotion recognition,” Circuits, Systems, and Signal
Processing, vol. 42, no. 1, pp. 449–492, 2023.
[8] M. R. Ahmed, S. Islam, A. M. Islam, and S. Shatabda, “An ensemble
1d-cnn-lstm-gru model with data augmentation for speech emotion
recognition,” Expert Systems with Applications, vol. 218, p. 119633,
2023.
[9] S. Kumar, M. A. Haq, A. Jain, C. A. Jason, N. R. Moparthi, N. Mittal,
and Z. S. Alzamil, “Multilayer neural network based speech emotion
recognition for smart assistance.,” Computers, Materials & Continua,
vol. 75, no. 1, 2023.
[10] J. Singh, L. B. Saheer, and O. Faust, “Speech emotion recognition using
attention model,” International Journal of Environmental Research and
Public Health, vol. 20, no. 6, p. 5140, 2023.