0% found this document useful (0 votes)
22 views5 pages

Speech Emotion Recognition with ML

Uploaded by

wigeb23329
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views5 pages

Speech Emotion Recognition with ML

Uploaded by

wigeb23329
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Speech Emotion Recognition using

Machine Learning
Aman Agrahari1 Parveen Kumar Bajaj2 Pooja3
B.E.4th Year, Computer Science & Professor, Department of Computer B.E.4th Year, Computer Science
Engineering Department, Science & Engineering, Chandigarh & Engineering Department,
Chandigarh University, Punjab University, Punjab Chandigarh University, Punjab
amanagrahari391@[Link] parveen.e15292@[Link] poojachoudhary8267@[Link]

Kirti Pandey4 Shivita Kanv5 Aadyant6


B.E.4th Year,
Computer Science & B.E.4th Year, Computer Science & B.E. 4th Year,
Computer Science
Engineering Department, Engineering Department, Chandigarh & Engineering Department,
Chandigarh University, Punjab University, Punjab Chandigarh University, Punjab
kirtipandey11mar@[Link] shivitakanv9@[Link] @[Link]

Abstract— Speech recognition is an essential Recurrent Neural Network, Transformer, Error


technology that is being used for many rate, Background noise, Multimodal inputs,
applications from virtual assistants to automated
transcription services. In this research paper, I. Introduction
machine learning-based speech recognition real- The technology to detect human emotions in speech
life implementation is explored. This differs is very interesting, as it could improve how humans
from traditional speech recognition systems, interact with computers, and even artificial
those that are dependent on statistical models intelligence or affective computing. Speech
and handcrafted features—methods which have Emotion Recognition (SER) frameworks strive to
been surpassed in performance by newer deep detect and transcribe emotions of a speaker from
learning approaches. This paper gives an the acoustic level providing naturalistic interaction
overview of various machine learning algorithms which gathers, in turn, empathy between
that have been utilized in speech recognition like human‑machine interfaces. SER has applications in
deep neural networks (DNNs), Convolutional many fields such as customer service, mental health
neural networks, recurrent neural network such monitoring, personal assistants and social robotics.
as (RNNs) and attention-based models e.g, Emotions from speech can be analyzed to have a
transformers. We evaluate their model better response of the machine, according as human
performances on four benchmark datasets and necessity and resulting in an improvement of user
discuss the advantages and disadvantages of experience.[1].
each approach. We also discusses various issues
that arise during speech recognition like accent Traditional speech recognition systems were based
variation, noise and speaker diarization. We on statistical techniques such as Gaussian Mixture
also explain how we solved the problems Models (GMMs) and Hidden Markov Models
for language recognition and verify (HMMs) which require the manual creation of
whether they are common modular tasks features like Mel-Frequency Cepstral
(accent, noise background) with continuous Coefficients . Although reasonably successful,
speech. To the best of our knowledge, we speech synthesis and natural language
have achieved state-of-the-art recognition understanding techniques developed with these
accuracy and error rates through a large- methods could deal only poorly with the
scale experimental study. Lastly, we complex range of variation in sounds or forms
introduce prospects as for multimodal that characterize human speech. these methods
input and improving the energy-efficient have their limitations-they were usually based on
models practical to solve real-time handcrafted features and hence ignored the fine-
problems. grained variations in emotional speech, difficult to
capture with human discretion. Also, these models
found difficulties in generalizing over different
Keywords: Speech Recognition, Machine speakers and languages or acoustic conditions with
Learning, Deep Learning, Deep Neutral reduced accuracy when transitioning into real-
Network, Convolutional Neural Networks, world use. [2].
Machine learning — in particular, deep learning — to capture the emotional content of speech and
has been the single most important technological therefore enable distinguishing four emotions:
advancement for making SER systems happiness, sadness, anger and fear.
[Link] more effective. Deep neural
networks (DNNs), convolutional neural networks In the early days SER utilized traditional machine
(CNNs), recurrent neural networks (RNNs) and learning algorithms such as Support Vector
newer architectures such as long short-term Machines (SVM), k-Nearest Neighbor (KNN) and
memory (LSTM) networks, and transformers have Hidden Markov Models(HMM). The algorithms
shown an ability to learn discriminative features delivered on this contrived emotion classification
with minimal or no human involvement from raw use-case and performed well for smaller datasets,
audio data. Further, these models have also but struggled to scale out into the real world and
outperformed conventional methods due to learning noisy environments.
the emotional speech features corresponding to The development of deep learning has seen the
various states of happiness, sadness, anger and emergence in usage new models like Convolutional
neutrality. The recent, and ongoing transition to Neural Networks (CNN) and Recurrent Neural
deep-learning models have generated powerful and Networks (RNN), especially Long Short-Term
scalable SER systems that can be implemented in Memory( LSTM) networks. CNN are good in
real-time and adapted for different real-world learning the spatial features from spectrogram
applications. [3]. representations of speech while an LSTM can
In this research paper, we have tried to provide an model temporal dependencies within raw forms of
extensive exploration of speech emotion speech signal and hence allows you to understand
recognition with machine learning algorithms. In how emotions change dynamically over time.
this review, we first present different deep learning Additionally, hybrid models such as CNN-LSTM
architectures and their applications in SER have enhanced the accuracy rate of activity
highlighting the pros and cons of each model. recognition by integrating both spatial and temporal
Finally, the article concludes with a discussion on feature learning. These deep learning models
the relevant challenges involved in developing SER generally perform better than standard approaches
systems, including but not limited to noisy when trained on large datasets and powered with
environment, speaker variability and lack of feature subspaces.[4].
abundance of large annotated emotional speech The IEMOCAP, EMO-DB, RAVDESS and
datasets. The paper provides the experimental CREMA-D datasets are among the most-used sets
results on standard benchmark datasets to validate in SER research. These data sets contain labeled
performance of different models. Then, we speech data, classified as per the emotions in audio
conclude and suggest future perspectives including files and hence an researcher can use these to train
multimodal emotion recognition and transfer the models for standardized conditions. Despite
learning to improve the state-of-the-art performing this, SER is still a long way from being perfect.
SER systems in more general settings. First, inter-speaker variability represents a major
challenge: Out-of-the set given how we all speak
II. Literature Review differently our models might have trouble
The automatic detection of emotions from speech generalizing to the new speaker's emotion and
signals, referred to as Speech Emotion Recognition accent; Moreover, SER systems often suffer from
(SER) with the help of machine learning is a lower accuracy in real-world environments due to
promising research domain. Computer the introduction of background noise. Emotion
understanding emotions from speech is helpful for ambiguity is also an issue as certain emotions (e.g.,
human computer interaction, mental health fear and surprise) exhibit similar acoustic patterns,
diagnostics and customer service system. In this making them hard to identify. Further, an issue of
section the proposed emotion recognition in speech the data imbalance is also prevalent in SER where
system will be explained, where a typical SER in some emotions (say happiness and anger) are
system consists of preprocessing to reduce noise, overrepresented while others (like disgust or fear)
feature extraction for emotional information and an are underrepresented that leads the data error and
attributes based classification stage. Some model uneffective.[5].
commonly used features are Prosodic ( e.g pitch,
Researchers have come up with various methods to
energy, rhytm), spectral( MFCC,Mel-spectogram
address these challenges. Transfer learning has
etc) and voice quality parameters(jitter,
recently become popular, where models were pre
shimmer,HNR). Such features are designed to
trained on massive datasets can be fine-tuned to
perform specific SER tasks. CLAIR thus answers auto encoders and generative adversarial networks
the key question of limited labeled data. Models (GAN) have been researched for effective feature
based on deep learning have been infused with representation enhancement in SER without
attention mechanisms inspired by the brain's needing large scale labelled data. In fact, Semi-
selective focus in order to concentrate solely on supervised learning (learned on both labelled and
important segments of speech signals for emotional unlabelled data) has also shown great potential in
classification. While speech modality integration terms of enhancing model accuracy while
with others such as facial expressions or diminishing the need for expensive labour intensive
physiological signals have demonstrated more data annotation.
powerful emotion detection. These systems can then
deal better with variations in emotional expression
when they have diverse sources of Transfer learning and domain adaptation, have
emotion information.[6]. always been a strong method in SER especially
However, cross-cultural and cross-linguistic when dealing with data from other languages
differences in emotion expression are still a +accents & dialects. By using transfer learning, we
substantial challenge for generalization of SER are also able to retrain models originally trained on
systems. Cultural and language differences also big general datasets with small human data and/or
impact the way that emotions are conveyed — even for another languages. A model pre trained on a
within a single language, accents or dialects will large dataset of English-language emotion can be
color comprehension; Asian Spanish speakers tend adapted to another language or accent with
to 'sing-song' their speech more than Mexican additional fine-tuning, but without requiring
Spaniards: what works for one model trained on retraining from scratch. Moreover, domain
Latin American emoting may not do well with adaptation methods enable models that are more
Singaporean angry talking. Researchers are working consistent with other acoustic environments to
on cross-lingual adaptation techniques and obtain high-quality data and improve SER systems
developing additional, more culturally diverse unaffected by noise in real-world scenarios.
datasets to solve this problem.

Given the great progress made in Speech Emotion SER research is multimodal emotion recognition.
Recognition (SER) using machine learning, various Indeed, traditional SER systems are based
novel approaches and future studies could be able to exclusively on acoustic features of speech but
push forward the whole area. Core trends include multimodal approaches combine information from
the rise of deep learning architectures specifically different modalities like facial expressions or body
tuned for speech processing. Though Convolutional language even physiological signals (e.g. heart rate
Neural Networks (CNN) and Recurrent Neural or skin conductance). Combining different
Network (RNN) has been widely established, other modalities may lead to a considerable boost in
architectures namely Transformer models along emotion recognition, since emotions are expressed
with self-attention mechanisms have also entered through verbal and non-verbal cues. Moreover,
the research scope of SER. State-of-the-arts in NLP multimodal systems are beneficial in complex or
like Transformers have shown that they can model ambiguous emotional scenarios where speech alone
long-range dependencies in speech data much better does not have enough information to properly
than RNN hence able to capture complex patterns of classify the emotion.
affect which could be spread across multiple
preceding speech turns. This is expected to do even The future of SER lies in the advancements of
better than traditional deep learning architecture in Natural Language Processing. Continued
emotion detection, and especially beneficial when research in NLP will enable SER to understand
we have a different type of emotionally shifted human emotions with increasing accuracy,
through time. grasp nuanced meanings, and even
comprehend emotions, leading to more empathetic
A third significant advancement has been the and context-aware interactions. [7].
proliferation of unsupervised and semi-supervised
learning methods. Because annotated reaction- Additionally, improvements in multilingual
labeled large datasets can be rarely found, processing will make SER accessible to a
unsupervised feature acquisition tools for processing global audience, bridging language barriers
unlabelled speech databases have become essential. and enhancing cross-cultural communication.
In comparison to deep learning approaches,
SER in Augmented and Virtual Reality Multimodal Emotion Recognition[8].

In Augmented and Virtual Realities, Speech To enhance SER in AR/VR, one possible approach
Emotion Recognition (SER) offers an exceptional is multimodal emotion recognition by
elevation with the aid to real-time potential combining information from speech emotions
emotional interactions during dialogues between sensing with other sources as facial expressions or
users and virtual worlds. As a result, the SER can body gestures and physiological signals. Using
enable virtual systems to dynamically adapt emotions in combination with speech inputs is
according to users' mood changes in real-time and desirable for better and more complete emotion
deliver personalized response elements (Such as detections by tracking facial muscle signals to
feedback or customize of Virtual settings) through understand human communication behaviour
an improved VR communication. It is useful in using VR enabling parameter (Human) which
gaming, education and mental health for offering outputs parameters of interest related to body
personalized responses and interventions. At the gestures, head position etc.. So spotting stress in a
same time, there are challenges that remain such as voice might be able to help improve how
processing in real-time and emotions reading on accurate it is by comparing this occurrence
artificial environments. There are the multimodal with someone shifting their head rapidly and
approaches that combine speech with other adopting rigid body posture – which suggests the
inputs for a higher accuracy.[8] For AR/VR to person who spoke out loud felt nervous when
evolve in the future, Novelser Realism will speaking. This can be amplified by the wearables in
prove invaluable when creating emotionally AR, which have bio-sensing capabilities to read
aware and intelligent experiences across a range physiological signals (such as heart rate and skin
of industries. conductance), emphasizing an emotional
experience of SR. This combined multimodal
Impact and Implications approach provides lot more
The influence of Speech Emotion comprehensive architecture of emotion
Recognition (SER) is momentous, and practically recognition and thus resulting into a dynamic,
every area can benefit from it. In the case of proactive interaction.
technology, SER enables human-computer Advancement in speech emotion recognition
interaction by allowing virtual assistants,
chatbots or customer service systems to Deep learning and multimodal approaches have
identify and respond to emotions, thus advanced the state-of-the-art in SER, leading
personalizing user experiences through a more to improved system accuracy. This is where the
empathetic approach. In the healthcare sector move from classical methods, like SVM and
SER can be used to monitor mental health by HMM to deep learning architectures has boosted
detecting changes in emotions which could as they are better at modeling temporal
indicate things like depression, stress or anxiety speech features alongside their non-
and therefore lead to early interventions. But it temporalary spatial counterparts. The
also enhances user engagement and collaboration, utilization of transformer models and self-
as well — in areas such as education and attention mechanisms has provided
remote work by enabling virtual platforms to advancements in emotion detection as they
better identify and respond to emotional cues. can model long-range dependencies between
However, SER raises ethical questions and more speech while considering the varying
particularly issues related with privacy: indeed contributions of different parts. Combining
the fact that emotional data are exploited can speech with facial expressions or
be qualified as an intrusion into the private life physycological signals. SER is a way to an
of individuals; taking this one step further we extent the previous method of ensuring even
could say that those kind of systems in production more comprehensive emotion detection. To
might easily lead to misuses (surveillance or address the dearth of labeled data, unsupervised and
manipulative marketing). Further, SER semi-supervised learning methods such as auto
algorithms have bias which might misinterpret encoders are employed along with transfer learning.
emotions since people of different culture or By delivering real-time SER improvements, we
individual expression display things have since extended this technology to
[Link] realities beg questions about live interaction and emotion detection in
the ethics and social implications of SER as it various industries including healthcare or
becomes further melded with technology we customer service as well. Nevertheless,
experience daily. SER has a huge impetus and challenges still exist with respect to bias,
virtually all fields can profit from its influence. cultural variance and robustness in noisy
background.
III. Conclusion [2].Latif, S., Qayyum, A., Usama, M., & Qadir, J.
(2020). "Speech Emotion Recognition Using Deep
The potential applications of Speech Emotion Learning: A Review." IEEE Transactions on
Recognition (SER) in industries such as healthcare, Affective Computing, 11(3), 429-447, DOI:
entertainment, education and customer service are 10.1109/TAFFC.2018.2874985
revolutionary. SER allows machines to detect and
interpret human emotions through speech, which in [3]. Akçay, M. B., & Oguz, K. (2020). "Speech
turn enhances the interaction between humans and emotion recognition: Emotional models, databases,
computers making systems more empathetic, features, preprocessing methods, supporting
responsive & personalized. The incorporation of modalities, and classifiers." Speech Communication,
deep learning, multimodal methods and online 116, 56-76, DOI: 10.1016/[Link].2019.12.001
processing has greatly boosted SER leading to a
rise in accuracy as well its usability from the [4].Trigeorgis, G., Nicolaou, M. A., & Zafeiriou, S.
security point of view. Nonetheless, plenty of (2016). "Adieu features? End-to-end speech emotion
challenges still loom at the horizon confessing in recognition using a deep convolutional recurrent
particular to their adaptation across languages and network." Proceedings of IEEE International
cultures, bias reduction as well robustness under Conference on Acoustics, Speech and Signal
noisy conditions. Overcoming these challenges to Processing (ICASSP), 5200-5204, DOI: 10.1109/
deploy SER technology will be crucial for realizing ICASSP.2016.7472669
its potential in full and employing the system safely [5].Tawari, A., & Trivedi, M. M. (2010). "Speech
and ethically across a wide variety of contexts. Emotion Analysis: Exploring the Role of Context."
In the future, we anticipate that further IEEE Transactions on Multimedia, 12(6), 502-509.,
breakthroughs in machine learning will DOI: 10.1109/TMM.2010.2055244
significantly advance Speech Emotion Recognition [6]. Zeng, Z., Pantic, M., Roisman, G. I., & Huang,
(SER) to transformer models incorporating self T. S. (2009). "A Survey of Affect Recognition
attention worked with multimodal frameworks Methods: Audio, Visual, and Spontaneous
connecting speech signals from non-speech data Expressions." IEEE Transactions on Pattern
such as facial expression, body language and Analysis and Machine Intelligence, 31(1), 39-58,
physiological cues. These developments offer to DOI: 10.1109/TPAMI.2008.52
next-generation SER system more accurate, robust
and universal on recognizing a wider array of cross- [7]. Schuller, B., Steidl, S., & Batliner, A. (2009).
cultural emotional patterns. In addition, "The INTERSPEECH 2009 Emotion Challenge."
developments in edge computing and 5G will Proceedings of INTERSPEECH 2009, 312-315.
provide low latency real-time processing that can
URL:[Link]
allow SER systems to become an intuitive
interspeech_2009
component of everyday technologies such as virtual
assistants, smart devices and immersive AR/VR [8]. Huang, Z., Epps, J., & Ambikairajah, E. (2011).
experiences. [10]. "An Investigation of Emotion Recognition from
Speech Under Stress." IEEE Transactions on
But as SER technology becomes increasingly
Affective Computing, 2(3), 152-161, DOI: 10.1109/
common, ethics will play a pivotal role in how it is
TAFFC.2011.13
deployed. To maintain the autonomy and fairness of
users, concerns such as data privacy, emotional [9]. Mirsamadi, S., Barsoum, E., & Zhang, C.
manipulation or algorithmic bias needs careful (2017). "Automatic Speech Emotion Recognition
handling in SER systems. It will be equally Using Recurrent Neural Networks with Local
important to define ethical standards for the proper Attention." Proceedings of IEEE International
use of this emotional data, in order to prevent abuse Conference on Acoustics, Speech and Signal
either as part surveillance technologies or nefarious Processing (ICASSP), 2227-2231, DOI: 10.1109/
marketing tactics. ICASSP.2017.7952552

IV. References [10].Zhang, Z., & Schuller, B. W. (2020). "Recent


advances in end-to-end deep learning for speech
[1]. Sahu, S. K., Nandakumar, R., & Mohamed, A. emotion recognition." Proceedings of the 2020 IEEE
(2020). "Speech Emotion Recognition Using Deep International Conference on Acoustics, Speech, and
Learning Techniques." IEEE Access, 8, Signal Processing (ICASSP), 6154-6158.
12043-12050, DOI: 10.1109ACCESS.2020.2966332 DOI: 10.1109/ICASSP40776.2020.9053568

You might also like