MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung
Dept. of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
{mysmilesh, byuns9334, kjung}@[Link]
ABSTRACT is considered to be one of the fundamental research goals
arXiv:1810.04635v1 [[Link]] 10 Oct 2018
in affective computing [7]. In particular, the speech emotion
Speech emotion recognition is a challenging task, and ex-
recognition task is one of the most important problems in
tensive reliance has been placed on models that use audio fea-
the field of paralinguistics. This field has recently broadened
tures in building well-performing classifiers. In this paper, we
its applications, as it is a crucial factor in optimal human-
propose a novel deep dual recurrent encoder model that uti-
computer interactions, including dialog systems. The goal
lizes text data and audio signals simultaneously to obtain a
of speech emotion recognition is to predict the emotional
better understanding of speech data. As emotional dialogue is
content of speech and to classify speech according to one of
composed of sound and spoken content, our model encodes
several labels (i.e., happy, sad, neutral, and angry). Various
the information from audio and text sequences using dual re-
types of deep learning methods have been applied to increase
current neural networks (RNNs) and then combines the infor-
the performance of emotion classifiers; however, this task is
mation from these sources to predict the emotion class. This
still considered to be challenging for several reasons. First,
architecture analyzes speech data from the signal level to the
insufficient data for training complex neural network-based
language level, and it thus utilizes the information within the
models are available, due to the costs associated with human
data more comprehensively than models that focus on audio
involvement. Second, the characteristics of emotions must be
features. Extensive experiments are conducted to investigate
learned from low-level speech signals. Feature-based models
the efficacy and properties of the proposed model. Our pro-
display limited skills when applied to this problem.
posed model outperforms previous state-of-the-art methods
in assigning data to one of four emotion categories (i.e., an- To overcome these limitations, we propose a model
gry, happy, sad and neutral) when the model is applied to the that uses high-level text transcription, as well as low-level
IEMOCAP dataset, as reflected by accuracies ranging from audio signals, to utilize the information contained within
68.8% to 71.8%. low-resource datasets to a greater degree. Given recent im-
Index Terms— speech emotion recognition, computa- provements in automatic speech recognition (ASR) technol-
tional paralinguistics, deep learning, natural language pro- ogy [8, 3, 9, 10], speech transcription can be carried out using
cessing audio signals with considerable skill. The emotional content
of speech is clearly indicated by the emotion words contained
in a sentence [11], such as “lovely” and “awesome,” which
1. INTRODUCTION carry strong emotions compared to generic (non-emotion)
words, such as “person” and “day.” Thus, we hypothesize that
Recently, deep learning algorithms have successfully ad- the speech emotion recognition model will be benefit from
dressed problems in various fields, such as image classifica- the incorporation of high-level textual input.
tion, machine translation, speech recognition, text-to-speech
generation and other machine learning related areas [1, 2, 3]. In this paper, we propose a novel deep dual recurrent en-
Similarly, substantial improvements in performance have coder model that simultaneously utilizes audio and text data
been obtained when deep learning algorithms have been ap- in recognizing emotions from speech. Extensive experiments
plied to statistical speech processing [4]. These fundamental are conducted to investigate the efficacy and properties of the
improvements have led researchers to investigate additional proposed model. Our proposed model outperforms previous
topics related to human nature, which have long been objects state-of-the-art methods by 68.8% to 71.8% when applied to
of study. One such topic involves understanding human emo- the IEMOCAP dataset, which is one of the most well-studied
tions and reflecting it through machine intelligence, such as datasets. Based on an error analysis of the models, we show
emotional dialogue models [5, 6]. that our proposed model accurately identifies emotion classes.
In developing emotionally aware intelligence, the very Moreover, the neutral class misclassification bias frequently
first step is building robust emotion classifiers that display exhibited by previous models, which focus on audio features,
good performance regardless of the application; this outcome is less pronounced in our model.
To appear in Proc. SLT2018, Dec 18-21, 2018, Athens, Greece c IEEE 2018
2. RELATED WORK
Classical machine learning algorithms, such as hidden Markov
models (HMMs), support vector machines (SVMs), and deci-
sion tree-based methods, have been employed in speech emo-
tion recognition problems [12, 13, 14]. Recently, researchers
have proposed various neural network-based architectures to
improve the performance of speech emotion recognition. An
initial study utilized deep neural networks (DNNs) to extract
high-level features from raw audio data and demonstrated its Fig. 1. Multimodal dual recurrent encoder. The upper part
effectiveness in speech emotion recognition [15]. With the ad- shows the ARE, which encodes audio signals, and the lower
vancement of deep learning methods, more complex neural- part shows the TRE, which encodes textual information.
based architectures have been proposed. Convolutional neural
network (CNN)-based models have been trained on informa-
tion derived from raw audio signals using spectrograms or RNN (i.e., gated recurrent units (GRUs)), which leads to the
audio features such as Mel-frequency cepstral coefficients formation of the network’s internal hidden state ht to model
(MFCCs) and low-level descriptors (LLDs) [16, 17, 18]. the time series patterns. This internal hidden state is updated
These neural network-based models are combined to pro- at each time step with the input data xt and the hidden state
duce higher-complexity models [19, 20], and these models of the previous time step ht−1 as follows:
achieved the best-recorded performance when applied to the
ht = fθ (ht−1 , xt ), (1)
IEMOCAP dataset.
Another line of research has focused on adopting variant where fθ is the RNN function with weight parameter θ, ht
machine learning techniques combined with neural network- represents the hidden state at t-th time step, and xt repre-
based models. One researcher utilized the multiobject learn- sents the t-th MFCC features in x = {x1:ta }. After encoding
ing approach and used gender and naturalness as auxiliary the audio signal x with the RNN, the last hidden state of the
tasks so that the neural network-based model learned more RNN, hta , is considered to be the representative vector that
features from a given dataset [21]. Another researcher investi- contains all of the sequential audio data. This vector is then
gated transfer learning methods, leveraging external data from concatenated with another prosodic feature vector, p, to gen-
related domains [22]. erate a more informative vector representation of the signal,
As emotional dialogue is composed of sound and spo- e = concat{hta , p}. The MFCC and the prosodic features
ken content, researchers have also investigated the combina- are extracted from the audio signal using the openSMILE
tion of acoustic features and language information, built belief toolkit [27], xt ∈ R39 and p ∈ R35 , respectively. Finally, the
network-based methods of identifying emotional key phrases, emotion class is predicted by applying the softmax function
and assessed the emotional salience of verbal cues from both to the vector e. For a given audio sample i, we assume that yi
phoneme sequences and words [23, 24]. However, none of is the true label vector, which contains all zeros but contains a
these studies have utilized information from speech signals one at the correct class, and ŷi is the predicted probability dis-
and text sequences simultaneously in an end-to-end learning tribution from the softmax layer. The training objective then
neural network-based model to classify emotions. takes the following form:
3. MODEL ŷi = softmax(e| M + b),
N X
C
Y (2)
This section describes the methodologies that are applied to L = − log yi,c log(ŷi,c ),
the speech emotion recognition task. We start by introducing i=1 c=1
the recurrent encoder model for the audio and text modalities where e is the calculated representative vector of the audio
individually. We then propose a multimodal approach that en- signal with dimensionality e ∈ Rd . The M ∈ Rd×C and the
codes both audio and textual information simultaneously via bias b are learned model parameters. C is the total number of
a dual recurrent encoder. classes, and N is the total number of samples used in training.
The upper part of Figure 1 shows the architecture of the ARE
3.1. Audio Recurrent Encoder (ARE) model.
Motivated by the architecture used in [25, 26], we build an au-
3.2. Text Recurrent Encoder (TRE)
dio recurrent encoder (ARE) to predict the class of a given au-
dio signal. Once MFCC features have been extracted from an We assume that speech transcripts can be extracted from
audio signal, a subset of the sequential features is fed into the audio signals with high accuracy, given the advancement of
2
ASR technologies [8]. We attempt to use the processed tex-
tual information as another modality in predicting the emotion
class of a given signal. To use textual information, a speech
transcript is tokenized and indexed into a sequence of tokens
using the Natural Language Toolkit (NLTK) [28]. Each token
is then passed through a word-embedding layer that converts
a word index to a corresponding 300-dimensional vector that
contains additional contextual meaning between words. The
sequence of embedded tokens is fed into a text recurrent en-
coder (TRE) in such a way that the audio MFCC features
are encoded using the ARE represented by equation 1. In Fig. 2. Architecture of the MDREA model. The weighted
this case, xt is the t-th embedded token from the text input. sum of the sequence of the hidden states of the text-RNN ht
Finally, the emotion class is predicted from the last hidden is taken using the attention weight at ; at is calculated as the
state of the text-RNN using the softmax function. dot product of the final encoding vector of the audio-RNN e
We use the same training objective as the ARE model, and and ht .
the predicted probability distribution for the target class is as
follows:
ŷi = softmax(hlast | M + b), (3) audio-RNN and text-RNN, respectively. M ∈ Rd×C and the
bias b are learned model parameters.
where hlast is last hidden state of the text-RNN, hlast ∈ Rd ,
and the M ∈ Rd×C and bias b are learned model parameters.
The lower part of Figure 1 indicates the architecture of the
TRE model.
3.4. Multimodal Dual Recurrent Encoder with Attention
(MDREA)
3.3. Multimodal Dual Recurrent Encoder (MDRE)
We present a novel architecture called the multimodal dual Inspired by the concept of the attention mechanism used in
recurrent encoder (MDRE) to overcome the limitations of neural machine translation [29], we propose a novel multi-
existing approaches. In this study, we consider multiple modal attention method to focus on the specific parts of a
modalities, such as MFCC features, prosodic features and transcript that contain strong emotional information, condi-
transcripts, which contain sequential audio information, sta- tioning on the audio information. Figure 2 shows the archi-
tistical audio information and textual information, respec- tecture of the MDREA model. First, the audio data and text
tively. These types of data are the same as those used in the data are encoded with the audio-RNN and text-RNN using
ARE and TRE cases. The MDRE model employs two RNNs equation 1. We then consider the final audio encoding vector
to encode data from the audio signal and textual inputs in- e as a context vector. As seen in equation 5, during each time
dependently. The audio-RNN encodes MFCC features from step t, the dot product between the context vector e and the
the audio signal using equation 1. The last hidden state of hidden state of the text-RNN at each t-th sequence ht is eval-
the audio-RNN is concatenated with the prosodic features uated to calculate a similarity score at . Using this score at as
to form the final vector representation e, and this vector is a weight parameter, the weighted sum of the sequences of the
then passed through a fully connected neural network layer hidden state of the text-RNN, ht , is calculated to generate an
to form the audio encoding vector A. On the other hand, the attention-application vector Z. This attention-application vec-
text-RNN encodes the word sequence of the transcript using tor is concatenated with the final encoding vector of the audio-
equation 1. The final hidden states of the text-RNN are also RNN A (equation 4), which will be passed through the soft-
passed through another fully connected neural network layer max function to predict the emotion class. We use the same
to form a textual encoding vector T. Finally, the emotion training objective as the ARE model, and the predicted prob-
class is predicted by applying the softmax function to the ability distribution for the target class is as follows:
concatenation of the vectors A and T. We use the same train-
ing objective as the ARE model, and the predicted probability
distribution for the target class is as follows: exp(e| ht ) X
at = P |
, Z = at ht ,
t exp(e ht ) t (5)
A = gθ (e), T = g 0θ (hlast ), |
(4) ŷi,j = softmax(concat(Z, A) M + b),
ŷi = softmax(concat(A, T)| M + b),
where gθ , g 0θ is the feed-forward neural network with weight where M ∈ Rd×C and the bias b are learned model parame-
parameter θ, and A, T are final encoding vectors from the ters.
3
4. EXPERIMENTAL SETUP AND DATASET Model WAP
ACNN [31] 0.561
4.1. Dataset LLD RNN-attn [26] 0.635
We evaluate our model using the Interactive Emotional RNN(prop.)-ELM [34] 0.628
Dyadic Motion Capture (IEMOCAP) [19] dataset. This 3CNN-LSTM10H [20] 0.688
dataset was collected following theatrical theory in order ARE 0.546 ±0.009
to simulate natural dyadic interactions between actors. We TRE 0.635 ±0.018
use categorical evaluations with majority agreement. We use MDRE 0.718 ±0.019
only four emotional categories happy, sad, angry, and neutral MDREA 0.690 ±0.019
to compare the performance of our model with other research TRE-ASR 0.593 ±0.022
using the same categories. The IEMOCAP dataset includes MDRE-ASR 0.691 ±0.019
five sessions, and each session contains utterances from two MDREA-ASR 0.677 ±0.013
speakers (one male and one female). This data collection
process resulted in 10 unique speakers. For consistent com- Table 1. Model performance comparisons. The top 2 best-
parison with previous work, we merge the excitement dataset performing models (according to the unweighted average re-
with the happiness dataset. The final dataset contains a total call) are marked in bold. The “-ASR” models are trained with
of 5531 utterances (1636 happy, 1084 sad, 1103 angry, 1708 processed transcripts from the Google Cloud Speech API.
neutral).
weights [32]], and the text embedding layer is initialized from
4.2. Feature extraction pretrained word-embedding vectors [33].
In preparing the textual dataset, we first use the released
To extract speech information from audio signals, we use
transcripts of the IEMOCAP dataset for simplicity. To in-
MFCC values, which are widely used in analyzing audio sig-
vestigate the practical performance, we then process all of
nals. The MFCC feature set contains a total of 39 features,
the IEMOCAP audio data using an ASR system (the Google
which include 12 MFCC parameters (1-12) from the 26 Mel-
Cloud Speech API) and retrieve the transcripts. The perfor-
frequency bands and log-energy parameters, 13 delta and 13
mance of the Google ASR system is reflected by its word er-
acceleration coefficients The frame size is set to 25 ms at a
ror rate (WER) of 5.53%.
rate of 10 ms with the Hamming function. According to the
length of each wave file, the sequential step of the MFCC
features is varied. To extract additional information from the 5. EMPIRICAL RESULTS
data, we also use prosodic features, which show effectiveness
in affective computing. The prosodic features are composed 5.1. Performance evaluation
of 35 features, which include the F0 frequency, the voicing As the dataset is not explicitly split beforehand into training,
probability, and the loudness contours. All of these MFCC development, and testing sets, we perform 5-fold cross val-
and prosodic features are extracted from the data using the idation to determine the overall performance of the model.
OpenSMILE toolkit [27]. The data in each fold are split into training, development, and
testing datasets (8:0.5:1.5, respectively). After training the
4.3. Implementation details model, we measure the weighted average precision (WAP)
over the 5-fold dataset. We train and evaluate the model 10
Among the variants of the RNN function, we use GRUs as times per fold, and the model performance is assessed in
they yield comparable performance to that of the LSTM and terms of the mean score and standard deviation.
include a smaller number of weight parameters [30]. We use We examine the WAP values, which are shown in Ta-
a max encoder step of 750 for the audio input, based on the ble 1. First, our ARE model shows the baseline performance
implementation choices presented in [31] and 128 for the because we use minimal audio features, such as the MFCC
text input because it covers the maximum length of the tran- and prosodic features with simple architectures. On the other
scripts. The vocabulary size of the dataset is 3,747, including hand, the TRE model shows higher performance gain com-
the “ UNK ” token, which represents unknown words, and pared to the ARE. From this result, we note that textual data
the “ PAD ” token, which is used to indicate padding infor- are informative in emotion prediction tasks, and the recurrent
mation added while preparing mini-batch data. The number encoder model is effective in understanding these types of se-
of hidden units and the number of layers in the RNN for quential data. Second, the newly proposed model, MDRE,
each model (ARE, TRE, MDRE and MDREA) are selected shows a substantial performance gain. It thus achieves the
based on extensive hyperparameter search experiments. The state-of-the-art performance with a WAP value of 0.718. This
weights of the hidden units are initialized using orthogonal result shows that multimodal information is a key factor in af-
4
fective computing. Lastly, the attention model, MDREA, also prediction gains in predicting the happy class when compared
outperforms the best existing research results (WAP 0.690 to to the ARE model (35.15% to 75.73%). This result seems
0.688) [20]. However, the MDREA model does not match the plausible because the model can benefit from the differences
performance of the MDRE model, even though it utilizes a among the distributions of words in happy and neutral expres-
more complex architecture. We believe that this result arises sions, which gives more emotional information to the model
because insufficient data are available to properly determine than that of the audio signal data. On the other hand, it is strik-
the complex model parameters in the MDREA model. More- ing that the TRE model incorrectly predicts instances of the
over, we presume that this model will show better perfor- sad class as the happy class 16.20% of the time, even though
mance when the audio signals are aligned with the textual these emotional states are opposites of one another.
sequence while applying the attention mechanism. We leave The MDRE model (Fig. 3(c)) compensates for the weak-
the implementation of this point as a future research direction. nesses of the previous two models (ARE and TRE) and bene-
To investigate the practical performance of the proposed fits from their strengths to a surprising degree. The values ar-
models, we conduct further experiments with the ASR- ranged along the diagonal axis show that all of the accuracies
processed transcript data (see “-ASR” models in Table 1). of the correctly predicted class have increased. Furthermore,
The label accuracy of the processed transcripts is 5.53% the occurrence of the incorrect “sad-to-happy” cases in the
WER. The TRE-ASR, MDRE-ASR and MDREA-ASR mod- TRE model is reduced from 16.20% to 9.15%.
els reflect degraded performance compared to that of the TRE,
MDRE and MDREA models. However, the performance of
6. CONCLUSIONS
these models is still competitive; in particular, the MDRE-
ASR model outperforms the previous best-performing model, In this paper, we propose a novel multimodal dual recurrent
3CNN-LSTM10H (WAP 0.691 to 0.688). encoder model that simultaneously utilizes text data, as well
as audio signals, to permit the better understanding of speech
5.2. Error analysis data. Our model encodes the information from audio and text
sequences using dual RNNs and then combines the informa-
We analyze the predictions of the ARE, TRE, and MDRE tion from these sources using a feed-forward neural model to
models. Figure 3 shows the confusion matrix of each model. predict the emotion class. Extensive experiments show that
The ARE model (Fig. 3(a)) incorrectly classifies most in- our proposed model outperforms other state-of-the-art meth-
stances of happy as neutral (43.51%); thus, it shows reduced ods in classifying the four emotion categories, and accuracies
accuracy (35.15%) in predicting the the happy class. Overall, ranging from 68.8% to 71.8% are obtained when the model
most of the emotion classes are frequently confused with is applied to the IEMOCAP dataset. In particular, it resolves
the neutral class. This observation is in line with the find- the issue in which predictions frequently incorrectly yield the
ings of [31], who noted that the neutral class is located in neutral class, as occurs in previous models that focus on audio
the center of the activation-valence space, complicating its features.
discrimination from the other classes. In the future work, we aim to extend the modalities to
Interestingly, the TRE model (Fig. 3(b)) shows greater audio, text and video inputs. Furthermore, we plan to inves-
(a) ARE (b) TRE (c) MDRE
Fig. 3. Confusion matrix of each model.
5
tigate the application of the attention mechanism to data de- [8] Dong Yu and Li Deng, AUTOMATIC SPEECH
rived from multiple modalities. This approach seems likely to RECOGNITION., Springer, 2016.
uncover enhanced learning schemes that will increase perfor-
mance in both speech emotion recognition and other multi- [9] Google, “Cloud speech-to-text,”
modal classification tasks. [Link] 2018.
[10] Microsoft, “Microsoft speech api,”
Acknowledgments [Link]
services/speech/home, 2018.
K. Jung is with the Department of Electrical and Computer
Engineering, ASRI, Seoul National University, Seoul, Korea. [11] Linhong Xu, Hongfei Lin, Yu Pan, Hui Ren, and Jianmei
This work was supported by the Ministry of Trade, Industry Chen, “Constructing the affective lexicon ontology,”
& Energy (MOTIE, Korea) under Industrial Technology In- Journal of the China Society for Scientific and Techni-
novation Program (No.10073144). cal Information, vol. 27, no. 2, pp. 180–185, 2008.
[12] Thapanee Seehapoch and Sartra Wongthanavasu,
7. REFERENCES
“Speech emotion recognition using support vector
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- machines,” in Knowledge and Smart Technology (KST),
ton, “Imagenet classification with deep convolutional 2013 5th International Conference on. IEEE, 2013, pp.
neural networks,” in Advances in neural information 86–91.
processing systems, 2012, pp. 1097–1105.
[13] Björn Schuller, Gerhard Rigoll, and Manfred Lang,
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- “Hidden markov model-based speech emotion recog-
gio, “Neural machine translation by jointly learning to nition,” in Multimedia and Expo, 2003. ICME’03.
align and translate,” arXiv preprint arXiv:1409.0473, Proceedings. 2003 International Conference on. IEEE,
2014. 2003, vol. 1, pp. I–401.
[3] Dario Amodei, Sundaram Ananthanarayanan, Rishita [14] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok
Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Lee, and Shrikanth Narayanan, “Emotion recogni-
Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang tion using a hierarchical binary decision tree approach,”
Chen, et al., “Deep speech 2: End-to-end speech recog- Speech Communication, vol. 53, no. 9-10, pp. 1162–
nition in english and mandarin,” in International Con- 1171, 2011.
ference on Machine Learning, 2016, pp. 173–182.
[15] Kun Han, Dong Yu, and Ivan Tashev, “Speech emo-
[4] Alex Graves, Santiago Fernández, Faustino Gomez, and tion recognition using deep neural network and extreme
Jürgen Schmidhuber, “Connectionist temporal classifi- learning machine,” in Fifteenth Annual Conference of
cation: labelling unsegmented sequence data with recur- the International Speech Communication Association,
rent neural networks,” in Proceedings of the 23rd inter- 2014.
national conference on Machine learning. ACM, 2006,
pp. 369–376. [16] Dario Bertero and Pascale Fung, “A first look into a
convolutional neural network for speech emotion de-
[5] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan
tection,” in Acoustics, Speech and Signal Process-
Zhu, and Bing Liu, “Emotional chatting machine: Emo-
ing (ICASSP), 2017 IEEE International Conference on.
tional conversation generation with internal and external
IEEE, 2017, pp. 5115–5119.
memory,” 2018.
[6] Chenyang Huang, Osmar Zaiane, Amine Trabelsi, and [17] Abdul Malik Badshah, Jamil Ahmad, Nasir Rahim, and
Nouha Dziri, “Automatic dialogue generation with ex- Sung Wook Baik, “Speech emotion recognition from
pressed emotions,” in Proceedings of the 2018 Confer- spectrograms with deep convolutional neural network,”
ence of the North American Chapter of the Association in Platform Technology and Service (PlatCon), 2017 In-
for Computational Linguistics: Human Language Tech- ternational Conference on. IEEE, 2017, pp. 1–5.
nologies, 2018, vol. 2, pp. 49–54.
[18] Zakaria Aldeneh and Emily Mower Provost, “Us-
[7] Carlos Busso, Murtaza Bulut, and Shrikanth Narayanan, ing regional saliency for speech emotion recognition,”
“Toward effective automatic recognition systems of in Acoustics, Speech and Signal Processing (ICASSP),
emotion in speech,” Social Emotions in Nature and Ar- 2017 IEEE International Conference on. IEEE, 2017,
tifact, p. 110, 2014. pp. 2741–2745.
6
[19] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe [28] Steven Bird and Edward Loper, “Nltk: the natural lan-
Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N guage toolkit,” in Proceedings of the ACL 2004 on Inter-
Chang, Sungbok Lee, and Shrikanth S Narayanan, active poster and demonstration sessions. Association
“Iemocap: Interactive emotional dyadic motion capture for Computational Linguistics, 2004, p. 31.
database,” Language resources and evaluation, vol. 42,
no. 4, pp. 335, 2008. [29] Thang Luong, Hieu Pham, and Christopher D Manning,
“Effective approaches to attention-based neural machine
[20] Aharon Satt, Shai Rozenberg, and Ron Hoory, “Efficient translation,” in Proceedings of the 2015 Conference
emotion recognition from speech using deep learning on on Empirical Methods in Natural Language Processing,
spectrograms,” Proc. Interspeech 2017, pp. 1089–1093, 2015, pp. 1412–1421.
2017.
[30] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
[21] Jaebok Kim, Gwenn Englebienne, Khiet P Truong, and and Yoshua Bengio, “Empirical evaluation of gated re-
Vanessa Evers, “Towards speech emotion recogni- current neural networks on sequence modeling,” arXiv
tion” in the wild” using aggregated corpora and deep preprint arXiv:1412.3555, 2014.
multi-task learning,” in 18th Annual Conference of the
International Speech Communication Association, IN- [31] Michael Neumann and Ngoc Thang Vu, “Attentive con-
TERSPEECH 2017: Situated interaction. International volutional neural network based speech emotion recog-
Speech Communication Association (ISCA), 2017. nition: A study on the impact of input features, signal
length, and acted speech,” Proc. Interspeech 2017, pp.
[22] John Gideon, Soheil Khorram, Zakaria Aldeneh, Dim- 1263–1267, 2017.
itrios Dimitriadis, and Emily Mower Provost, “Progres-
sive neural networks for transfer learning in emotion [32] Andrew M Saxe, James L McClelland, and Surya Gan-
recognition,” Proc. Interspeech 2017, pp. 1098–1102, guli, “Exact solutions to the nonlinear dynamics of
2017. learning in deep linear neural networks,” arXiv preprint
arXiv:1312.6120, 2013.
[23] Björn Schuller, Gerhard Rigoll, and Manfred Lang,
“Speech emotion recognition combining acoustic fea- [33] Jeffrey Pennington, Richard Socher, and Christopher
tures and linguistic information in a hybrid support vec- Manning, “Glove: Global vectors for word representa-
tor machine-belief network architecture,” in Acous- tion,” in Proceedings of the 2014 conference on empir-
tics, Speech, and Signal Processing, 2004. Proceed- ical methods in natural language processing (EMNLP),
ings.(ICASSP’04). IEEE International Conference on. 2014, pp. 1532–1543.
IEEE, 2004, vol. 1, pp. I–577.
[34] Jinkyu Lee and Ivan Tashev, “High-level feature rep-
[24] Kalani Wataraka Gamage, Vidhyasaharan Sethu, and resentation using recurrent neural network for speech
Eliathamby Ambikairajah, “Salience based lexical fea- emotion recognition,” in Sixteenth Annual Conference
tures for emotion recognition,” in Acoustics, Speech and of the International Speech Communication Associa-
Signal Processing (ICASSP), 2017 IEEE International tion, 2015.
Conference on. IEEE, 2017, pp. 5830–5834.
[25] Yun Wang, Leonardo Neves, and Florian Metze,
“Audio-based multimedia event detection using deep
recurrent neural networks,” in Acoustics, Speech and
Signal Processing (ICASSP), 2016 IEEE International
Conference on. IEEE, 2016, pp. 2742–2746.
[26] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha
Zhang, “Automatic speech emotion recognition us-
ing recurrent neural networks with local attention,” in
Acoustics, Speech and Signal Processing (ICASSP),
2017 IEEE International Conference on. IEEE, 2017,
pp. 2227–2231.
[27] Florian Eyben, Felix Weninger, Florian Gross, and
Björn Schuller, “Recent developments in opensmile, the
munich open-source multimedia feature extractor,” in
Proceedings of the 21st ACM international conference
on Multimedia. ACM, 2013, pp. 835–838.