0% found this document useful (0 votes)

26 views7 pages

Multimodal Speech Emotion Recognition

This paper presents a novel deep dual recurrent encoder model for speech emotion recognition that simultaneously utilizes audio and text data to improve classification accuracy. The model outperforms previous state-of-the-art methods on the IEMOCAP dataset, achieving accuracies between 68.8% and 71.8% across four emotion categories. The study highlights the importance of integrating high-level textual input with low-level audio signals to enhance the understanding of emotional content in speech.

Uploaded by

hbdz936

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views7 pages

Multimodal Speech Emotion Recognition

Uploaded by

hbdz936

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT

Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung

Dept. of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
{mysmilesh, byuns9334, kjung}@[Link]

ABSTRACT is considered to be one of the fundamental research goals

arXiv:1810.04635v1 [[Link]] 10 Oct 2018

in affective computing [7]. In particular, the speech emotion

Speech emotion recognition is a challenging task, and ex-
recognition task is one of the most important problems in
tensive reliance has been placed on models that use audio fea-
the field of paralinguistics. This field has recently broadened
tures in building well-performing classifiers. In this paper, we
its applications, as it is a crucial factor in optimal human-
propose a novel deep dual recurrent encoder model that uti-
computer interactions, including dialog systems. The goal
lizes text data and audio signals simultaneously to obtain a
of speech emotion recognition is to predict the emotional
better understanding of speech data. As emotional dialogue is
content of speech and to classify speech according to one of
composed of sound and spoken content, our model encodes
several labels (i.e., happy, sad, neutral, and angry). Various
the information from audio and text sequences using dual re-
types of deep learning methods have been applied to increase
current neural networks (RNNs) and then combines the infor-
the performance of emotion classifiers; however, this task is
mation from these sources to predict the emotion class. This
still considered to be challenging for several reasons. First,
architecture analyzes speech data from the signal level to the
insufficient data for training complex neural network-based
language level, and it thus utilizes the information within the
models are available, due to the costs associated with human
data more comprehensively than models that focus on audio
involvement. Second, the characteristics of emotions must be
features. Extensive experiments are conducted to investigate
learned from low-level speech signals. Feature-based models
the efficacy and properties of the proposed model. Our pro-
display limited skills when applied to this problem.
posed model outperforms previous state-of-the-art methods
in assigning data to one of four emotion categories (i.e., an- To overcome these limitations, we propose a model
gry, happy, sad and neutral) when the model is applied to the that uses high-level text transcription, as well as low-level
IEMOCAP dataset, as reflected by accuracies ranging from audio signals, to utilize the information contained within
68.8% to 71.8%. low-resource datasets to a greater degree. Given recent im-
Index Terms— speech emotion recognition, computa- provements in automatic speech recognition (ASR) technol-
tional paralinguistics, deep learning, natural language pro- ogy [8, 3, 9, 10], speech transcription can be carried out using
cessing audio signals with considerable skill. The emotional content
of speech is clearly indicated by the emotion words contained
in a sentence [11], such as “lovely” and “awesome,” which
1. INTRODUCTION carry strong emotions compared to generic (non-emotion)
words, such as “person” and “day.” Thus, we hypothesize that
Recently, deep learning algorithms have successfully ad- the speech emotion recognition model will be benefit from
dressed problems in various fields, such as image classifica- the incorporation of high-level textual input.
tion, machine translation, speech recognition, text-to-speech
generation and other machine learning related areas [1, 2, 3]. In this paper, we propose a novel deep dual recurrent en-
Similarly, substantial improvements in performance have coder model that simultaneously utilizes audio and text data
been obtained when deep learning algorithms have been ap- in recognizing emotions from speech. Extensive experiments
plied to statistical speech processing [4]. These fundamental are conducted to investigate the efficacy and properties of the
improvements have led researchers to investigate additional proposed model. Our proposed model outperforms previous
topics related to human nature, which have long been objects state-of-the-art methods by 68.8% to 71.8% when applied to
of study. One such topic involves understanding human emo- the IEMOCAP dataset, which is one of the most well-studied
tions and reflecting it through machine intelligence, such as datasets. Based on an error analysis of the models, we show
emotional dialogue models [5, 6]. that our proposed model accurately identifies emotion classes.
In developing emotionally aware intelligence, the very Moreover, the neutral class misclassification bias frequently
first step is building robust emotion classifiers that display exhibited by previous models, which focus on audio features,
good performance regardless of the application; this outcome is less pronounced in our model.

To appear in Proc. SLT2018, Dec 18-21, 2018, Athens, Greece c IEEE 2018
2. RELATED WORK

Classical machine learning algorithms, such as hidden Markov

models (HMMs), support vector machines (SVMs), and deci-
sion tree-based methods, have been employed in speech emo-
tion recognition problems [12, 13, 14]. Recently, researchers
have proposed various neural network-based architectures to
improve the performance of speech emotion recognition. An
initial study utilized deep neural networks (DNNs) to extract
high-level features from raw audio data and demonstrated its Fig. 1. Multimodal dual recurrent encoder. The upper part
effectiveness in speech emotion recognition [15]. With the ad- shows the ARE, which encodes audio signals, and the lower
vancement of deep learning methods, more complex neural- part shows the TRE, which encodes textual information.
based architectures have been proposed. Convolutional neural
network (CNN)-based models have been trained on informa-
tion derived from raw audio signals using spectrograms or RNN (i.e., gated recurrent units (GRUs)), which leads to the
audio features such as Mel-frequency cepstral coefficients formation of the network’s internal hidden state ht to model
(MFCCs) and low-level descriptors (LLDs) [16, 17, 18]. the time series patterns. This internal hidden state is updated
These neural network-based models are combined to pro- at each time step with the input data xt and the hidden state
duce higher-complexity models [19, 20], and these models of the previous time step ht−1 as follows:
achieved the best-recorded performance when applied to the
ht = fθ (ht−1 , xt ), (1)
IEMOCAP dataset.
Another line of research has focused on adopting variant where fθ is the RNN function with weight parameter θ, ht
machine learning techniques combined with neural network- represents the hidden state at t-th time step, and xt repre-
based models. One researcher utilized the multiobject learn- sents the t-th MFCC features in x = {x1:ta }. After encoding
ing approach and used gender and naturalness as auxiliary the audio signal x with the RNN, the last hidden state of the
tasks so that the neural network-based model learned more RNN, hta , is considered to be the representative vector that
features from a given dataset [21]. Another researcher investi- contains all of the sequential audio data. This vector is then
gated transfer learning methods, leveraging external data from concatenated with another prosodic feature vector, p, to gen-
related domains [22]. erate a more informative vector representation of the signal,
As emotional dialogue is composed of sound and spo- e = concat{hta , p}. The MFCC and the prosodic features
ken content, researchers have also investigated the combina- are extracted from the audio signal using the openSMILE
tion of acoustic features and language information, built belief toolkit [27], xt ∈ R39 and p ∈ R35 , respectively. Finally, the
network-based methods of identifying emotional key phrases, emotion class is predicted by applying the softmax function
and assessed the emotional salience of verbal cues from both to the vector e. For a given audio sample i, we assume that yi
phoneme sequences and words [23, 24]. However, none of is the true label vector, which contains all zeros but contains a
these studies have utilized information from speech signals one at the correct class, and ŷi is the predicted probability dis-
and text sequences simultaneously in an end-to-end learning tribution from the softmax layer. The training objective then
neural network-based model to classify emotions. takes the following form:

3. MODEL ŷi = softmax(e| M + b),

N X
C
Y (2)
This section describes the methodologies that are applied to L = − log yi,c log(ŷi,c ),
the speech emotion recognition task. We start by introducing i=1 c=1
the recurrent encoder model for the audio and text modalities where e is the calculated representative vector of the audio
individually. We then propose a multimodal approach that en- signal with dimensionality e ∈ Rd . The M ∈ Rd×C and the
codes both audio and textual information simultaneously via bias b are learned model parameters. C is the total number of
a dual recurrent encoder. classes, and N is the total number of samples used in training.
The upper part of Figure 1 shows the architecture of the ARE
3.1. Audio Recurrent Encoder (ARE) model.

Motivated by the architecture used in [25, 26], we build an au-

3.2. Text Recurrent Encoder (TRE)
dio recurrent encoder (ARE) to predict the class of a given au-
dio signal. Once MFCC features have been extracted from an We assume that speech transcripts can be extracted from
audio signal, a subset of the sequential features is fed into the audio signals with high accuracy, given the advancement of

2
ASR technologies [8]. We attempt to use the processed tex-
tual information as another modality in predicting the emotion
class of a given signal. To use textual information, a speech
transcript is tokenized and indexed into a sequence of tokens
using the Natural Language Toolkit (NLTK) [28]. Each token
is then passed through a word-embedding layer that converts
a word index to a corresponding 300-dimensional vector that
contains additional contextual meaning between words. The
sequence of embedded tokens is fed into a text recurrent en-
coder (TRE) in such a way that the audio MFCC features
are encoded using the ARE represented by equation 1. In Fig. 2. Architecture of the MDREA model. The weighted
this case, xt is the t-th embedded token from the text input. sum of the sequence of the hidden states of the text-RNN ht
Finally, the emotion class is predicted from the last hidden is taken using the attention weight at ; at is calculated as the
state of the text-RNN using the softmax function. dot product of the final encoding vector of the audio-RNN e
We use the same training objective as the ARE model, and and ht .
the predicted probability distribution for the target class is as
follows:
ŷi = softmax(hlast | M + b), (3) audio-RNN and text-RNN, respectively. M ∈ Rd×C and the
bias b are learned model parameters.
where hlast is last hidden state of the text-RNN, hlast ∈ Rd ,
and the M ∈ Rd×C and bias b are learned model parameters.
The lower part of Figure 1 indicates the architecture of the
TRE model.
3.4. Multimodal Dual Recurrent Encoder with Attention
(MDREA)
3.3. Multimodal Dual Recurrent Encoder (MDRE)
We present a novel architecture called the multimodal dual Inspired by the concept of the attention mechanism used in
recurrent encoder (MDRE) to overcome the limitations of neural machine translation [29], we propose a novel multi-
existing approaches. In this study, we consider multiple modal attention method to focus on the specific parts of a
modalities, such as MFCC features, prosodic features and transcript that contain strong emotional information, condi-
transcripts, which contain sequential audio information, sta- tioning on the audio information. Figure 2 shows the archi-
tistical audio information and textual information, respec- tecture of the MDREA model. First, the audio data and text
tively. These types of data are the same as those used in the data are encoded with the audio-RNN and text-RNN using
ARE and TRE cases. The MDRE model employs two RNNs equation 1. We then consider the final audio encoding vector
to encode data from the audio signal and textual inputs in- e as a context vector. As seen in equation 5, during each time
dependently. The audio-RNN encodes MFCC features from step t, the dot product between the context vector e and the
the audio signal using equation 1. The last hidden state of hidden state of the text-RNN at each t-th sequence ht is eval-
the audio-RNN is concatenated with the prosodic features uated to calculate a similarity score at . Using this score at as
to form the final vector representation e, and this vector is a weight parameter, the weighted sum of the sequences of the
then passed through a fully connected neural network layer hidden state of the text-RNN, ht , is calculated to generate an
to form the audio encoding vector A. On the other hand, the attention-application vector Z. This attention-application vec-
text-RNN encodes the word sequence of the transcript using tor is concatenated with the final encoding vector of the audio-
equation 1. The final hidden states of the text-RNN are also RNN A (equation 4), which will be passed through the soft-
passed through another fully connected neural network layer max function to predict the emotion class. We use the same
to form a textual encoding vector T. Finally, the emotion training objective as the ARE model, and the predicted prob-
class is predicted by applying the softmax function to the ability distribution for the target class is as follows:
concatenation of the vectors A and T. We use the same train-
ing objective as the ARE model, and the predicted probability
distribution for the target class is as follows: exp(e| ht ) X
at = P |
, Z = at ht ,
t exp(e ht ) t (5)
A = gθ (e), T = g 0θ (hlast ), |
(4) ŷi,j = softmax(concat(Z, A) M + b),
ŷi = softmax(concat(A, T)| M + b),

where gθ , g 0θ is the feed-forward neural network with weight where M ∈ Rd×C and the bias b are learned model parame-
parameter θ, and A, T are final encoding vectors from the ters.

3
4. EXPERIMENTAL SETUP AND DATASET Model WAP
ACNN [31] 0.561
4.1. Dataset LLD RNN-attn [26] 0.635
We evaluate our model using the Interactive Emotional RNN(prop.)-ELM [34] 0.628
Dyadic Motion Capture (IEMOCAP) [19] dataset. This 3CNN-LSTM10H [20] 0.688
dataset was collected following theatrical theory in order ARE 0.546 ±0.009
to simulate natural dyadic interactions between actors. We TRE 0.635 ±0.018
use categorical evaluations with majority agreement. We use MDRE 0.718 ±0.019
only four emotional categories happy, sad, angry, and neutral MDREA 0.690 ±0.019
to compare the performance of our model with other research TRE-ASR 0.593 ±0.022
using the same categories. The IEMOCAP dataset includes MDRE-ASR 0.691 ±0.019
five sessions, and each session contains utterances from two MDREA-ASR 0.677 ±0.013
speakers (one male and one female). This data collection
process resulted in 10 unique speakers. For consistent com- Table 1. Model performance comparisons. The top 2 best-
parison with previous work, we merge the excitement dataset performing models (according to the unweighted average re-
with the happiness dataset. The final dataset contains a total call) are marked in bold. The “-ASR” models are trained with
of 5531 utterances (1636 happy, 1084 sad, 1103 angry, 1708 processed transcripts from the Google Cloud Speech API.
neutral).

weights [32]], and the text embedding layer is initialized from

4.2. Feature extraction pretrained word-embedding vectors [33].
In preparing the textual dataset, we first use the released
To extract speech information from audio signals, we use
transcripts of the IEMOCAP dataset for simplicity. To in-
MFCC values, which are widely used in analyzing audio sig-
vestigate the practical performance, we then process all of
nals. The MFCC feature set contains a total of 39 features,
the IEMOCAP audio data using an ASR system (the Google
which include 12 MFCC parameters (1-12) from the 26 Mel-
Cloud Speech API) and retrieve the transcripts. The perfor-
frequency bands and log-energy parameters, 13 delta and 13
mance of the Google ASR system is reflected by its word er-
acceleration coefficients The frame size is set to 25 ms at a
ror rate (WER) of 5.53%.
rate of 10 ms with the Hamming function. According to the
length of each wave file, the sequential step of the MFCC
features is varied. To extract additional information from the 5. EMPIRICAL RESULTS
data, we also use prosodic features, which show effectiveness
in affective computing. The prosodic features are composed 5.1. Performance evaluation
of 35 features, which include the F0 frequency, the voicing As the dataset is not explicitly split beforehand into training,
probability, and the loudness contours. All of these MFCC development, and testing sets, we perform 5-fold cross val-
and prosodic features are extracted from the data using the idation to determine the overall performance of the model.
OpenSMILE toolkit [27]. The data in each fold are split into training, development, and
testing datasets (8:0.5:1.5, respectively). After training the
4.3. Implementation details model, we measure the weighted average precision (WAP)
over the 5-fold dataset. We train and evaluate the model 10
Among the variants of the RNN function, we use GRUs as times per fold, and the model performance is assessed in
they yield comparable performance to that of the LSTM and terms of the mean score and standard deviation.
include a smaller number of weight parameters [30]. We use We examine the WAP values, which are shown in Ta-
a max encoder step of 750 for the audio input, based on the ble 1. First, our ARE model shows the baseline performance
implementation choices presented in [31] and 128 for the because we use minimal audio features, such as the MFCC
text input because it covers the maximum length of the tran- and prosodic features with simple architectures. On the other
scripts. The vocabulary size of the dataset is 3,747, including hand, the TRE model shows higher performance gain com-
the “ UNK ” token, which represents unknown words, and pared to the ARE. From this result, we note that textual data
the “ PAD ” token, which is used to indicate padding infor- are informative in emotion prediction tasks, and the recurrent
mation added while preparing mini-batch data. The number encoder model is effective in understanding these types of se-
of hidden units and the number of layers in the RNN for quential data. Second, the newly proposed model, MDRE,
each model (ARE, TRE, MDRE and MDREA) are selected shows a substantial performance gain. It thus achieves the
based on extensive hyperparameter search experiments. The state-of-the-art performance with a WAP value of 0.718. This
weights of the hidden units are initialized using orthogonal result shows that multimodal information is a key factor in af-

4
fective computing. Lastly, the attention model, MDREA, also prediction gains in predicting the happy class when compared
outperforms the best existing research results (WAP 0.690 to to the ARE model (35.15% to 75.73%). This result seems
0.688) [20]. However, the MDREA model does not match the plausible because the model can benefit from the differences
performance of the MDRE model, even though it utilizes a among the distributions of words in happy and neutral expres-
more complex architecture. We believe that this result arises sions, which gives more emotional information to the model
because insufficient data are available to properly determine than that of the audio signal data. On the other hand, it is strik-
the complex model parameters in the MDREA model. More- ing that the TRE model incorrectly predicts instances of the
over, we presume that this model will show better perfor- sad class as the happy class 16.20% of the time, even though
mance when the audio signals are aligned with the textual these emotional states are opposites of one another.
sequence while applying the attention mechanism. We leave The MDRE model (Fig. 3(c)) compensates for the weak-
the implementation of this point as a future research direction. nesses of the previous two models (ARE and TRE) and bene-
To investigate the practical performance of the proposed fits from their strengths to a surprising degree. The values ar-
models, we conduct further experiments with the ASR- ranged along the diagonal axis show that all of the accuracies
processed transcript data (see “-ASR” models in Table 1). of the correctly predicted class have increased. Furthermore,
The label accuracy of the processed transcripts is 5.53% the occurrence of the incorrect “sad-to-happy” cases in the
WER. The TRE-ASR, MDRE-ASR and MDREA-ASR mod- TRE model is reduced from 16.20% to 9.15%.
els reflect degraded performance compared to that of the TRE,
MDRE and MDREA models. However, the performance of
6. CONCLUSIONS
these models is still competitive; in particular, the MDRE-
ASR model outperforms the previous best-performing model, In this paper, we propose a novel multimodal dual recurrent
3CNN-LSTM10H (WAP 0.691 to 0.688). encoder model that simultaneously utilizes text data, as well
as audio signals, to permit the better understanding of speech
5.2. Error analysis data. Our model encodes the information from audio and text
sequences using dual RNNs and then combines the informa-
We analyze the predictions of the ARE, TRE, and MDRE tion from these sources using a feed-forward neural model to
models. Figure 3 shows the confusion matrix of each model. predict the emotion class. Extensive experiments show that
The ARE model (Fig. 3(a)) incorrectly classifies most in- our proposed model outperforms other state-of-the-art meth-
stances of happy as neutral (43.51%); thus, it shows reduced ods in classifying the four emotion categories, and accuracies
accuracy (35.15%) in predicting the the happy class. Overall, ranging from 68.8% to 71.8% are obtained when the model
most of the emotion classes are frequently confused with is applied to the IEMOCAP dataset. In particular, it resolves
the neutral class. This observation is in line with the find- the issue in which predictions frequently incorrectly yield the
ings of [31], who noted that the neutral class is located in neutral class, as occurs in previous models that focus on audio
the center of the activation-valence space, complicating its features.
discrimination from the other classes. In the future work, we aim to extend the modalities to
Interestingly, the TRE model (Fig. 3(b)) shows greater audio, text and video inputs. Furthermore, we plan to inves-

(a) ARE (b) TRE (c) MDRE

Fig. 3. Confusion matrix of each model.

5
tigate the application of the attention mechanism to data de- [8] Dong Yu and Li Deng, AUTOMATIC SPEECH
rived from multiple modalities. This approach seems likely to RECOGNITION., Springer, 2016.
uncover enhanced learning schemes that will increase perfor-
mance in both speech emotion recognition and other multi- [9] Google, “Cloud speech-to-text,”
modal classification tasks. [Link] 2018.

[10] Microsoft, “Microsoft speech api,”

Acknowledgments [Link]
services/speech/home, 2018.
K. Jung is with the Department of Electrical and Computer
Engineering, ASRI, Seoul National University, Seoul, Korea. [11] Linhong Xu, Hongfei Lin, Yu Pan, Hui Ren, and Jianmei
This work was supported by the Ministry of Trade, Industry Chen, “Constructing the affective lexicon ontology,”
& Energy (MOTIE, Korea) under Industrial Technology In- Journal of the China Society for Scientific and Techni-
novation Program (No.10073144). cal Information, vol. 27, no. 2, pp. 180–185, 2008.

[12] Thapanee Seehapoch and Sartra Wongthanavasu,

7. REFERENCES
“Speech emotion recognition using support vector
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- machines,” in Knowledge and Smart Technology (KST),
ton, “Imagenet classification with deep convolutional 2013 5th International Conference on. IEEE, 2013, pp.
neural networks,” in Advances in neural information 86–91.
processing systems, 2012, pp. 1097–1105.
[13] Björn Schuller, Gerhard Rigoll, and Manfred Lang,
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- “Hidden markov model-based speech emotion recog-
gio, “Neural machine translation by jointly learning to nition,” in Multimedia and Expo, 2003. ICME’03.
align and translate,” arXiv preprint arXiv:1409.0473, Proceedings. 2003 International Conference on. IEEE,
2014. 2003, vol. 1, pp. I–401.

[3] Dario Amodei, Sundaram Ananthanarayanan, Rishita [14] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok
Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Lee, and Shrikanth Narayanan, “Emotion recogni-
Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang tion using a hierarchical binary decision tree approach,”
Chen, et al., “Deep speech 2: End-to-end speech recog- Speech Communication, vol. 53, no. 9-10, pp. 1162–
nition in english and mandarin,” in International Con- 1171, 2011.
ference on Machine Learning, 2016, pp. 173–182.
[15] Kun Han, Dong Yu, and Ivan Tashev, “Speech emo-
[4] Alex Graves, Santiago Fernández, Faustino Gomez, and tion recognition using deep neural network and extreme
Jürgen Schmidhuber, “Connectionist temporal classifi- learning machine,” in Fifteenth Annual Conference of
cation: labelling unsegmented sequence data with recur- the International Speech Communication Association,
rent neural networks,” in Proceedings of the 23rd inter- 2014.
national conference on Machine learning. ACM, 2006,
pp. 369–376. [16] Dario Bertero and Pascale Fung, “A first look into a
convolutional neural network for speech emotion de-
[5] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan
tection,” in Acoustics, Speech and Signal Process-
Zhu, and Bing Liu, “Emotional chatting machine: Emo-
ing (ICASSP), 2017 IEEE International Conference on.
tional conversation generation with internal and external
IEEE, 2017, pp. 5115–5119.
memory,” 2018.
[6] Chenyang Huang, Osmar Zaiane, Amine Trabelsi, and [17] Abdul Malik Badshah, Jamil Ahmad, Nasir Rahim, and
Nouha Dziri, “Automatic dialogue generation with ex- Sung Wook Baik, “Speech emotion recognition from
pressed emotions,” in Proceedings of the 2018 Confer- spectrograms with deep convolutional neural network,”
ence of the North American Chapter of the Association in Platform Technology and Service (PlatCon), 2017 In-
for Computational Linguistics: Human Language Tech- ternational Conference on. IEEE, 2017, pp. 1–5.
nologies, 2018, vol. 2, pp. 49–54.
[18] Zakaria Aldeneh and Emily Mower Provost, “Us-
[7] Carlos Busso, Murtaza Bulut, and Shrikanth Narayanan, ing regional saliency for speech emotion recognition,”
“Toward effective automatic recognition systems of in Acoustics, Speech and Signal Processing (ICASSP),
emotion in speech,” Social Emotions in Nature and Ar- 2017 IEEE International Conference on. IEEE, 2017,
tifact, p. 110, 2014. pp. 2741–2745.

6
[19] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe [28] Steven Bird and Edward Loper, “Nltk: the natural lan-
Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N guage toolkit,” in Proceedings of the ACL 2004 on Inter-
Chang, Sungbok Lee, and Shrikanth S Narayanan, active poster and demonstration sessions. Association
“Iemocap: Interactive emotional dyadic motion capture for Computational Linguistics, 2004, p. 31.
database,” Language resources and evaluation, vol. 42,
no. 4, pp. 335, 2008. [29] Thang Luong, Hieu Pham, and Christopher D Manning,
“Effective approaches to attention-based neural machine
[20] Aharon Satt, Shai Rozenberg, and Ron Hoory, “Efficient translation,” in Proceedings of the 2015 Conference
emotion recognition from speech using deep learning on on Empirical Methods in Natural Language Processing,
spectrograms,” Proc. Interspeech 2017, pp. 1089–1093, 2015, pp. 1412–1421.
2017.
[30] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
[21] Jaebok Kim, Gwenn Englebienne, Khiet P Truong, and and Yoshua Bengio, “Empirical evaluation of gated re-
Vanessa Evers, “Towards speech emotion recogni- current neural networks on sequence modeling,” arXiv
tion” in the wild” using aggregated corpora and deep preprint arXiv:1412.3555, 2014.
multi-task learning,” in 18th Annual Conference of the
International Speech Communication Association, IN- [31] Michael Neumann and Ngoc Thang Vu, “Attentive con-
TERSPEECH 2017: Situated interaction. International volutional neural network based speech emotion recog-
Speech Communication Association (ISCA), 2017. nition: A study on the impact of input features, signal
length, and acted speech,” Proc. Interspeech 2017, pp.
[22] John Gideon, Soheil Khorram, Zakaria Aldeneh, Dim- 1263–1267, 2017.
itrios Dimitriadis, and Emily Mower Provost, “Progres-
sive neural networks for transfer learning in emotion [32] Andrew M Saxe, James L McClelland, and Surya Gan-
recognition,” Proc. Interspeech 2017, pp. 1098–1102, guli, “Exact solutions to the nonlinear dynamics of
2017. learning in deep linear neural networks,” arXiv preprint
arXiv:1312.6120, 2013.
[23] Björn Schuller, Gerhard Rigoll, and Manfred Lang,
“Speech emotion recognition combining acoustic fea- [33] Jeffrey Pennington, Richard Socher, and Christopher
tures and linguistic information in a hybrid support vec- Manning, “Glove: Global vectors for word representa-
tor machine-belief network architecture,” in Acous- tion,” in Proceedings of the 2014 conference on empir-
tics, Speech, and Signal Processing, 2004. Proceed- ical methods in natural language processing (EMNLP),
ings.(ICASSP’04). IEEE International Conference on. 2014, pp. 1532–1543.
IEEE, 2004, vol. 1, pp. I–577.
[34] Jinkyu Lee and Ivan Tashev, “High-level feature rep-
[24] Kalani Wataraka Gamage, Vidhyasaharan Sethu, and resentation using recurrent neural network for speech
Eliathamby Ambikairajah, “Salience based lexical fea- emotion recognition,” in Sixteenth Annual Conference
tures for emotion recognition,” in Acoustics, Speech and of the International Speech Communication Associa-
Signal Processing (ICASSP), 2017 IEEE International tion, 2015.
Conference on. IEEE, 2017, pp. 5830–5834.
[25] Yun Wang, Leonardo Neves, and Florian Metze,
“Audio-based multimedia event detection using deep
recurrent neural networks,” in Acoustics, Speech and
Signal Processing (ICASSP), 2016 IEEE International
Conference on. IEEE, 2016, pp. 2742–2746.
[26] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha
Zhang, “Automatic speech emotion recognition us-
ing recurrent neural networks with local attention,” in
Acoustics, Speech and Signal Processing (ICASSP),
2017 IEEE International Conference on. IEEE, 2017,
pp. 2227–2231.
[27] Florian Eyben, Felix Weninger, Florian Gross, and
Björn Schuller, “Recent developments in opensmile, the
munich open-source multimedia feature extractor,” in
Proceedings of the 21st ACM international conference
on Multimedia. ACM, 2013, pp. 835–838.

Common questions

Methodological innovations introduced include the use of dual recurrent encoders for simultaneous audio and text data analysis and the integration of prosodic features with sequential audio features to form a comprehensive representation. The models also leverage GRUs for efficient time-series data modeling and employ novel attention mechanisms that dynamically weight textual inputs based on emotional salience in conjunction with audio context. These approaches collectively aim to create a more nuanced and accurate depiction of emotional states from spoken content, overcoming limitations of previous models .

Researchers face several challenges when combining acoustic features and linguistic information for emotion recognition, including aligning features from different modalities that may have different sampling rates and time dependencies. Additionally, ensuring that both types of data are appropriately weighted and fused within a coherent model framework is complex, as each modality may contain distinct yet crucial emotional cues. Furthermore, dealing with noise and variability in speech and text data, such as different accents or expressions, adds to the difficulty in creating robust models that accurately capture emotional states across diverse datasets .

The IEMOCAP dataset plays a crucial role in evaluating emotion recognition models because it provides a standard and diverse set of emotional expressions captured through dyadic interactions. This dataset includes both audio and visual recordings, alongside transcriptions, which are used for testing the generalization and effectiveness of the models in recognizing different emotions such as happy, sad, angry, and neutral. Its comprehensive nature allows researchers to benchmark their models against previously generated results and ensures that the models are tested across various realistic scenarios .

The softmax function supports the classification tasks by transforming the output of the neurons into a probability distribution over all possible emotion classes. This allows for the assignment of a normalized probability score to each class, making it possible to select the most likely emotion as the classification output. The softmax function ensures that these scores add up to one, which is crucial for precise prediction based on the model’s learned representations .

The advantage of using a dual recurrent encoder, as proposed in the multimodal approach, is that it encodes both audio and textual information simultaneously. This allows for a more comprehensive analysis of the emotional content by leveraging multiple modalities. Traditional methods have not utilized information from speech signals and text sequences simultaneously in an end-to-end learning neural network model. By combining the audio and textual data, the dual recurrent encoder is able to form a more informative vector representation, leading to potentially more accurate emotion classification .

Speech features provide prosodic and acoustic cues such as pitch, loudness, and rhythm, which can convey emotions non-verbally. Text features, on the other hand, offer semantic and syntactic information that can capture the emotional content of the spoken words. Together, these modalities can provide a holistic representation of emotional expression, thus potentially leading to more accurate emotion recognition in tasks where both the verbal content and the way it is spoken carry emotional significance .

The attention mechanism in the MDREA model draws inspiration from neural machine translation (NMT) methods by adopting the concept of focusing on specific parts of the input sequence that are most relevant to the current context. In NMT, attention helps align and translate sequences by weighing the importance of each input token relative to the output prediction. Similarly, in MDREA, the attention mechanism evaluates the contribution of each part of the textual input before combining it with audio data, allowing the model to attend selectively to emotionally salient textual segments conditioned on audio context .

Integrating external data from related domains through transfer learning is beneficial because it enables models to leverage additional datasets that may contain relevant emotional features not present in the original dataset. This can enhance the model’s robustness and ability to generalize across different types of emotional expressions. Transfer learning allows the model to utilize pre-trained weights and underlying knowledge from similar tasks, such as sentiment analysis or other emotional speech corpora, which can lead to improved performance, especially when the targeted dataset is limited or lacks diversity .

The multimodal dual recurrent encoder with attention (MDREA) improves emotion prediction by incorporating an attention mechanism that focuses on specific parts of a transcript containing strong emotional information, conditioning on the audio context vector. This approach allows the model to weigh different elements of the textual data according to their emotional salience, providing a weighted sum of the hidden states that enhances the predictive capability of the model. Thus, the MDREA architecture can more effectively capture the emotional nuances in the input data .

Gated recurrent units (GRUs) contribute to the design of the audio recurrent encoder by efficiently modeling the time series patterns of audio signals. GRUs simplify the architecture by using gating mechanisms to control the flow of information, thereby capturing long-term dependencies without the complexity of traditional RNN architectures. In the ARE, GRUs use the internal hidden state to sequentially update and refine the representation of the audio data, ultimately forming a representative vector that accurately reflects the sequential patterns vital for emotion classification .

1869 3972 1 PB
No ratings yet
1869 3972 1 PB
12 pages
Emotion Recognition with SAVEE Dataset
No ratings yet
Emotion Recognition with SAVEE Dataset
9 pages
Emotion Detection
No ratings yet
Emotion Detection
2 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
5 pages
Multimodal Speech Emotion Recognition
No ratings yet
Multimodal Speech Emotion Recognition
9 pages
Speech Emotion Recognition Using Tonal and Prosodic Features With Convolutional Neural Networks
No ratings yet
Speech Emotion Recognition Using Tonal and Prosodic Features With Convolutional Neural Networks
6 pages
CNN-Transformer Speech Emotion Detection
No ratings yet
CNN-Transformer Speech Emotion Detection
11 pages
$RSM4OX0
No ratings yet
$RSM4OX0
45 pages
Real-Time Speech Emotion Recognition
No ratings yet
Real-Time Speech Emotion Recognition
5 pages
Hindi Speech Emotion Recognition with LSTM
No ratings yet
Hindi Speech Emotion Recognition with LSTM
6 pages
Research Paper
No ratings yet
Research Paper
7 pages
Real-Time Speech Emotion Recognition
No ratings yet
Real-Time Speech Emotion Recognition
41 pages
Applsci 13 02167
No ratings yet
Applsci 13 02167
14 pages
CNN Model for Speech Emotion Recognition
No ratings yet
CNN Model for Speech Emotion Recognition
5 pages
Audio Emotion Classification Using Deep Learning
No ratings yet
Audio Emotion Classification Using Deep Learning
10 pages
Speech Emotion Recognition with ML
No ratings yet
Speech Emotion Recognition with ML
5 pages
Speech Emotion Recognition Progress Report
No ratings yet
Speech Emotion Recognition Progress Report
12 pages
Research Paper 2
No ratings yet
Research Paper 2
9 pages
Towards The Explainability of Multimodal Speech Emotion Recognition
No ratings yet
Towards The Explainability of Multimodal Speech Emotion Recognition
5 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
5 pages
Real-Time Emotion Recognition via Deep Learning
No ratings yet
Real-Time Emotion Recognition via Deep Learning
40 pages
Speech Emotion Recognition with ML Techniques
No ratings yet
Speech Emotion Recognition with ML Techniques
8 pages
DeepSpeech Dynamic Emotion Detection
No ratings yet
DeepSpeech Dynamic Emotion Detection
15 pages
Speech Emotion Recognition Overview
No ratings yet
Speech Emotion Recognition Overview
11 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
6 pages
Speech Emotion Recognition with ConvLSTM
No ratings yet
Speech Emotion Recognition with ConvLSTM
6 pages
Human Emotion Recognition via ANN
No ratings yet
Human Emotion Recognition via ANN
7 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
19 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
12 pages
Advanced ML in Speech Emotion Recognition
No ratings yet
Advanced ML in Speech Emotion Recognition
6 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
10 pages
Speech Emotion Recognition Model Analysis
No ratings yet
Speech Emotion Recognition Model Analysis
12 pages
Speech Emotion Recognition Using Machine
No ratings yet
Speech Emotion Recognition Using Machine
5 pages
Deep Learning for Speech Emotion Recognition
No ratings yet
Deep Learning for Speech Emotion Recognition
6 pages
Emotion Recognition with Wav2Vec2 & HuBERT
No ratings yet
Emotion Recognition with Wav2Vec2 & HuBERT
9 pages
XEmoAccent: AI for Cross-Accent Emotion Recognition
No ratings yet
XEmoAccent: AI for Cross-Accent Emotion Recognition
19 pages
RM Expt 4
No ratings yet
RM Expt 4
2 pages
Batch No-15 IEEE
No ratings yet
Batch No-15 IEEE
6 pages
Speech Emotion Recognition with LSTM
No ratings yet
Speech Emotion Recognition with LSTM
11 pages
Hybrid CNN-BiLSTM for Speech Emotion Recognition
No ratings yet
Hybrid CNN-BiLSTM for Speech Emotion Recognition
18 pages
Speech Emotion Detection with ML
No ratings yet
Speech Emotion Detection with ML
15 pages
Sensors 22 01937 With Cover
No ratings yet
Sensors 22 01937 With Cover
18 pages
AI-Driven Speech Emotion Recognition
No ratings yet
AI-Driven Speech Emotion Recognition
10 pages
Speech Emotion Recognition with CNNs
No ratings yet
Speech Emotion Recognition with CNNs
6 pages
Speech Emotion Recognition Techniques
No ratings yet
Speech Emotion Recognition Techniques
23 pages
Speech Emotion Recognition with DNN
No ratings yet
Speech Emotion Recognition with DNN
5 pages
Deep Learning for Speech Emotion Detection
No ratings yet
Deep Learning for Speech Emotion Detection
3 pages
Advances in Speech Emotion Recognition
No ratings yet
Advances in Speech Emotion Recognition
5 pages
Speech Emotion Recognition Analysis
No ratings yet
Speech Emotion Recognition Analysis
51 pages
Speech Emotion Recognition Techniques
No ratings yet
Speech Emotion Recognition Techniques
13 pages
Emotion Recognition in Speech Analysis
No ratings yet
Emotion Recognition in Speech Analysis
11 pages
Audio Emotion Prediction Using MFCC and MEL
No ratings yet
Audio Emotion Prediction Using MFCC and MEL
5 pages
Hybrid-Module Transformer for SER
No ratings yet
Hybrid-Module Transformer for SER
20 pages
Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks For Robust Speech Emotion Recognition
No ratings yet
Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks For Robust Speech Emotion Recognition
20 pages
Speech Emotion Recognition in ML
No ratings yet
Speech Emotion Recognition in ML
20 pages
Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study On The Impact of Input Features, Signal Length, and Acted Speech
No ratings yet
Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study On The Impact of Input Features, Signal Length, and Acted Speech
5 pages
Multimodal Emotion Recognition Review
No ratings yet
Multimodal Emotion Recognition Review
24 pages
Dual-Modular Affective Learning Pipeline
No ratings yet
Dual-Modular Affective Learning Pipeline
20 pages
Audio-Visual Emotion Recognition Insights
No ratings yet
Audio-Visual Emotion Recognition Insights
9 pages
Python Programming Exercises and Solutions
No ratings yet
Python Programming Exercises and Solutions
15 pages
B.Tech Syllabus 2024-25: CSE, ECE, IT
No ratings yet
B.Tech Syllabus 2024-25: CSE, ECE, IT
73 pages
CS625 Assignment 1 Ethics & Management
No ratings yet
CS625 Assignment 1 Ethics & Management
1 page
Cybersecurity in Cooperative Driving Automation
No ratings yet
Cybersecurity in Cooperative Driving Automation
19 pages
Numerical Methods for Root Finding
No ratings yet
Numerical Methods for Root Finding
46 pages
MT6631 Datasheet Overview
100% (1)
MT6631 Datasheet Overview
46 pages
LEAGUE: Skill Learning for Long-Horizon Tasks
No ratings yet
LEAGUE: Skill Learning for Long-Horizon Tasks
8 pages
MediaPipe Hand Landmark Detection Guide
No ratings yet
MediaPipe Hand Landmark Detection Guide
7 pages
Nested Loops in Computer Applications
No ratings yet
Nested Loops in Computer Applications
19 pages
Emergency Beacon Registration Guidelines
No ratings yet
Emergency Beacon Registration Guidelines
2 pages
Hardware Description Language
No ratings yet
Hardware Description Language
16 pages
Media Evolution: Ages and Examples
No ratings yet
Media Evolution: Ages and Examples
6 pages
SB-100-SB Sea Valve Installation Guide
No ratings yet
SB-100-SB Sea Valve Installation Guide
18 pages
Roxtec Conduit Seal Installation Guide
No ratings yet
Roxtec Conduit Seal Installation Guide
4 pages
NW Maths P2 English September 2019
No ratings yet
NW Maths P2 English September 2019
14 pages
Python for Civil Engineering Automation
No ratings yet
Python for Civil Engineering Automation
3 pages
ROS Assignments: Handwritten Guidelines
No ratings yet
ROS Assignments: Handwritten Guidelines
4 pages
Optimize Your HUMAN Website Usage
No ratings yet
Optimize Your HUMAN Website Usage
3 pages
SY0-601問題集、CompTIA実際の試験問題 - 模擬練習
No ratings yet
SY0-601問題集、CompTIA実際の試験問題 - 模擬練習
24 pages
Itel 25v 200ah Lithium Battery Price in Pakistan
No ratings yet
Itel 25v 200ah Lithium Battery Price in Pakistan
1 page
Regression Analysis Overview in Malayalam
No ratings yet
Regression Analysis Overview in Malayalam
29 pages
Pulsar NS Series Installation Guide
No ratings yet
Pulsar NS Series Installation Guide
7 pages
IJREET
No ratings yet
IJREET
6 pages
CHN-201 Heat Transfer Tutorial 2
No ratings yet
CHN-201 Heat Transfer Tutorial 2
2 pages
Digital Marketing Portfolio of Saeed Ali
No ratings yet
Digital Marketing Portfolio of Saeed Ali
10 pages
ZNS Series Counterbalance Valves Guide
No ratings yet
ZNS Series Counterbalance Valves Guide
4 pages
HVAC Technician Job Opening in Nairobi
No ratings yet
HVAC Technician Job Opening in Nairobi
2 pages
Soil Stabilization via Stone Columns
No ratings yet
Soil Stabilization via Stone Columns
12 pages
Daikin EUWA 40-120K Chiller Manual
No ratings yet
Daikin EUWA 40-120K Chiller Manual
159 pages
Acti9 iC60 RCBO Datasheet Summary
No ratings yet
Acti9 iC60 RCBO Datasheet Summary
3 pages
e - 20250520 Badi Cogs Split
No ratings yet
e - 20250520 Badi Cogs Split
2 pages
Understanding Data Link Layer Protocols
No ratings yet
Understanding Data Link Layer Protocols
53 pages

Multimodal Speech Emotion Recognition

Uploaded by

Multimodal Speech Emotion Recognition

Uploaded by

MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT

Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung

ABSTRACT is considered to be one of the fundamental research goals

in affective computing [7]. In particular, the speech emotion

Classical machine learning algorithms, such as hidden Markov

3. MODEL ŷi = softmax(e| M + b),

Motivated by the architecture used in [25, 26], we build an au-

weights [32]], and the text embedding layer is initialized from

(a) ARE (b) TRE (c) MDRE

Fig. 3. Confusion matrix of each model.

[10] Microsoft, “Microsoft speech api,”

[12] Thapanee Seehapoch and Sartra Wongthanavasu,

Common questions

What methodological innovations were introduced in these models to enhance the predictive accuracy of emotion recognition in spoken content?

What key challenges do researchers face when combining acoustic features and linguistic information for emotion recognition?

What role does the IEMOCAP dataset play in the evaluation of these emotion recognition models?

How does the softmax function support the classification tasks in the described models?

What are the advantages of using a dual recurrent encoder for speech emotion recognition?

Why might speech and text features be complementary in emotion recognition tasks?

How does the attention mechanism used in the MDREA model draw inspiration from neural machine translation methods?

In the context of speech emotion recognition, why is integrating external data from related domains through transfer learning beneficial?

How does the multimodal dual recurrent encoder with attention (MDREA) improve emotion prediction compared to its predecessors?

In what way does the use of gated recurrent units (GRUs) contribute to the design of the audio recurrent encoder (ARE)?

You might also like