Utterance-Level Multimodal Sentiment Analysis
Verónica Pérez-Rosas and Rada Mihalcea Louis-Philippe Morency
Computer Science and Engineering Institute for Creative Technologies
University of North Texas University of Southern California
veronicaperezrosas@[Link], rada@[Link] morency@[Link]
Abstract Riloff, 2005; Esuli and Sebastiani, 2006) or large
annotated datasets (Maas et al., 2011). Given the
During real-life interactions, people are accelerated growth of other media on the Web and
naturally gesturing and modulating their elsewhere, which includes massive collections of
voice to emphasize specific points or to videos (e.g., YouTube, Vimeo, VideoLectures), im-
express their emotions. With the recent ages (e.g., Flickr, Picasa), audio clips (e.g., pod-
growth of social websites such as YouTube, casts), the ability to address the identification of
Facebook, and Amazon, video reviews are opinions in the presence of diverse modalities is be-
emerging as a new source of multimodal coming increasingly important. This has motivated
and natural opinions that has been left al- researchers to start exploring multimodal clues for
most untapped by automatic opinion anal- the detection of sentiment and emotions in video
ysis techniques. This paper presents a content (Morency et al., 2011; Wagner et al., 2011).
method for multimodal sentiment classi- In this paper, we explore the addition of speech
fication, which can identify the sentiment and visual modalities to text analysis in order to
expressed in utterance-level visual datas- identify the sentiment expressed in video reviews.
treams. Using a new multimodal dataset Given the non homogeneous nature of full-video
consisting of sentiment annotated utter- reviews, which typically include a mixture of posi-
ances extracted from video reviews, we tive, negative, and neutral statements, we decided
show that multimodal sentiment analysis to perform our experiments and analyses at the ut-
can be effectively performed, and that the terance level. This is in line with earlier work on
joint use of visual, acoustic, and linguistic text-based sentiment analysis, where it has been
modalities can lead to error rate reductions observed that full-document reviews often contain
of up to 10.5% as compared to the best both positive and negative comments, which led to
performing individual modality. a number of methods addressing opinion analysis
at sentence level. Our results show that relying
1 Introduction on the joint use of linguistic, acoustic, and visual
Video reviews represent a growing source of con- modalities allows us to better sense the sentiment
sumer information that gained increasing interest being expressed as compared to the use of only one
from companies, researchers, and consumers. Pop- modality at a time.
ular web platforms such as YouTube, Amazon, Another important aspect of this paper is the in-
Facebook, and ExpoTV have reported a signifi- troduction of a new multimodal opinion database
cant increase in the number of consumer reviews annotated at the utterance level which is, to our
in video format over the past five years. Compared knowledge, the first of its kind. In our work, this
to traditional text reviews, video reviews provide a dataset enabled a wide range of multimodal senti-
more natural experience as they allow the viewer to ment analysis experiments, addressing the relative
better sense the reviewer’s emotions, beliefs, and importance of modalities and individual features.
intentions through richer channels such as intona- The following section presents related work
tions, facial expressions, and body language. in text-based sentiment analysis and audio-visual
Much of the work to date on opinion analysis has emotion recognition. Section 3 describes our new
focused on textual data, and a number of resources multimodal datasets with utterance-level sentiment
have been created including lexicons (Wiebe and annotations. Section 4 presents our multimodal sen-
973
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 973–982,
Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics
timent analysis approach, including details about such as speech or facial expressions.
our linguistic, acoustic, and visual features. Our The only exceptions that we are aware of are the
experiments and results on multimodal sentiment findings reported in (Somasundaran et al., 2006;
classification are presented in Section 5, with a Raaijmakers et al., 2008; Mairesse et al., 2012;
detailed discussion and analysis in Section 6. Metze et al., 2009), where speech and text have
been analyzed jointly for the purpose of subjectiv-
2 Related Work ity or sentiment identification, without, however,
addressing other modalities such as visual cues;
In this section we provide a brief overview of re-
and the work reported in (Morency et al., 2011;
lated work in text-based sentiment analysis, as well
Perez-Rosas et al., 2013), where multimodal cues
as audio-visual emotion analysis.
have been used for the analysis of sentiment in
2.1 Text-based Subjectivity and Sentiment product reviews, but where the analysis was done
Analysis at the much coarser level of full videos rather than
individual utterances as we do in our work.
The techniques developed so far for subjectivity
and sentiment analysis have focused primarily on
2.2 Audio-Visual Emotion Analysis.
the processing of text, and consist of either rule-
based classifiers that make use of opinion lexicons, Also related to our work is the research done on
or data-driven methods that assume the availability emotion analysis. Emotion analysis of speech sig-
of a large dataset annotated for polarity. These tools nals aims to identify the emotional or physical
and resources have been already used in a large states of a person by analyzing his or her voice
number of applications, including expressive text- (Ververidis and Kotropoulos, 2006). Proposed
to-speech synthesis (Alm et al., 2005), tracking methods for emotion recognition from speech fo-
sentiment timelines in on-line forums and news cus both on what is being said and how is be-
(Balog et al., 2006), analysis of political debates ing said, and rely mainly on the analysis of the
(Carvalho et al., 2011), question answering (Oh et speech signal by sampling the content at utterance
al., 2012), conversation summarization (Carenini et or frame level (Bitouk et al., 2010). Several re-
al., 2008), and citation sentiment detection (Athar searchers used prosody (e.g., pitch, speaking rate,
and Teufel, 2012). Mel frequency coefficients) for speech-based emo-
One of the first lexicons used in sentiment anal- tion recognition (Polzin and Waibel, 1996; Tato et
ysis is the General Inquirer (Stone, 1968). Since al., 2002; Ayadi et al., 2011).
then, many methods have been developed to auto- There are also studies that analyzed the visual
matically identify opinion words and their polarity cues, such as facial expressions and body move-
(Hatzivassiloglou and McKeown, 1997; Turney, ments (Calder et al., 2001; Rosenblum et al., 1996;
2002; Hu and Liu, 2004; Taboada et al., 2011), as Essa and Pentland, 1997). Facial expressions are
well as n-gram and more linguistically complex among the most powerful and natural means for
phrases (Yang and Cardie, 2012). human beings to communicate their emotions and
For data-driven methods, one of the most widely intentions (Tian et al., 2001). Emotions can be
used datasets is the MPQA corpus (Wiebe et al., also expressed unconsciously, through subtle move-
2005), which is a collection of news articles manu- ments of facial muscles such as smiling or eyebrow
ally annotated for opinions. Other datasets are also raising, often measured and described using the
available, including two polarity datasets consist- Facial Action Coding System (FACS) (Ekman et
ing of movie reviews (Pang and Lee, 2004; Maas et al., 2002).
al., 2011), and a collection of newspaper headlines De Silva et. al. (De Silva et al., 1997) and Chen
annotated for polarity (Strapparava and Mihalcea, et. al. (Chen et al., 1998) presented one of the
2007). early works that integrate both acoustic and visual
While difficult problems such as cross-domain information for emotion recognition. In addition to
(Blitzer et al., 2007; Li et al., 2012) or cross- work that considered individual modalities, there
language (Mihalcea et al., 2007; Wan, 2009; Meng is also a growing body of work concerned with
et al., 2012) portability have been addressed, not multimodal emotion analysis (Silva et al., 1997;
much has been done in terms of extending the ap- Sebe et al., 2006; Zhihong et al., 2009; Wollmer et
plicability of sentiment analysis to other modalities, al., 2010).
974
Utterance transcription Label
En este color, creo que era el color frambuesa. neu
In this color, I think it was raspberry
Pinta hermosisimo. pos
It looks beautiful.
Sinceramente, con respecto a lo que pinta y a que son hidratante, si son muy hidratantes. pos
Honestly, talking about how they looks and hydrates, yes they are very hydrant.
Pero el problema de estos labiales es que cuando uno se los aplica, te dejan un gusto asqueroso en la boca. neg
But the problem with those lipsticks is that when you apply them, they leave a very nasty taste
Sinceramente, es no es que sea el olor sino que es mas bien el gusto. neg
Honestly, is not the smell, it is the taste.
Table 1: Sample utterance-level annotations. The labels used are: pos(itive), neg(ative), neu(tral).
More recently, two challenges have been or- Among all the videos returned by the YouTube
ganized focusing on the recognition of emotions search, we selected only videos that respected the
using audio and visual cues (Schuller et al., following guidelines: the speaker should be in front
2011a; Schuller et al., 2011b), which included sub- of the camera; her face should be clearly visible,
challenges on audio-only, video-only, and audio- with a minimum amount of face occlusion during
video, and drew the participation of many teams the recording; there should not be any background
from around the world. Note however that most of music or animation. The final video set includes 80
the previous work on audio-visual emotion analy- videos randomly selected from the videos retrieved
sis has focused exclusively on the audio and video from YouTube that also met the guidelines above.
modalities, and did not consider textual features, as The dataset includes 15 male and 65 female speak-
we do in our work. ers, with their age approximately ranging from 20
to 60 years.
3 MOUD: Multimodal Opinion
All the videos were first pre-processed to elimi-
Utterances Dataset
nate introductory titles and advertisements. Since
For our experiments, we created a dataset of ut- the reviewers often switched topics when express-
terances (named MOUD) containing product opin- ing their opinions, we manually selected a 30 sec-
ions expressed in Spanish.1 We chose to work with onds opinion segment from each video to avoid
Spanish because it is a widely used language, and having multiple topics in a single review.
it is the native language of the main author of this
paper.
We started by collecting a set of videos from 3.1 Segmentation and Transcription
the social media web site YouTube, using several
keywords likely to lead to a product review or rec- All the video clips were manually processed to
ommendation. Starting with the YouTube search transcribe the verbal statements and also to extract
page, videos were found using the following key- the start and end time of each utterance. Since the
words: mis products favoritos (my favorite prod- reviewers utter expressive sentences that are nat-
ucts), products que no recomiendo (non recom- urally segmented by speech pauses, we decided
mended products), mis perfumes favoritos (my fa- to use these pauses (>0.5seconds) to identify the
vorite perfumes), peliculas recomendadas (recom- beginning and the end of each utterance. The tran-
mended movies), peliculas que no recomiendo (non scription and segmentation were performed using
recommended movies) and libros recomendados the Transcriber software.
(recommended books), libros que no recomiendo
Each video was segmented into an average of
(non recommended books). Notice that the key-
six utterances, resulting in a final dataset of 498
words are not targeted at a specific product type;
utterances. Each utterance is linked to the corre-
rather, we used a variety of product names, so that
sponding audio and video stream, as well as its
the dataset has some degree of generality within
manual transcription. The utterances have an aver-
the broad domain of product reviews.
age duration of 5 seconds, with a standard deviation
1
Publicly available from the authors webpage. of 1.2 seconds.
975
Figure 1: Multimodal feature extraction
3.2 Sentiment Annotation sentiment annotations. As this example illustrates,
a video can contain a mix of positive, negative, and
To enable the use of this dataset for sentiment de-
neutral utterances. Note also that sentiment is not
tection, we performed sentiment annotations at ut-
always explicit in the text: for example, the last
terance level. Annotations were done using Elan,2
utterance “Honestly, it is not the smell, it is the
which is a widely used tool for the annotation of
taste” has an implicit reference to the “nasty taste”
video and audio resources. Two annotators indepen-
expressed in the previous utterance, and thus it was
dently labeled each utterance as positive, negative,
also labeled as negative by both annotators.
or neutral. The annotation was done after seeing
the video corresponding to an utterance (along with
the corresponding audio source). The transcription 4 Multimodal Sentiment Analysis
of the utterance was also made available. Thus, the
annotation process included all three modalities: vi- The main advantage that comes with the analysis of
sual, acoustic, and linguistic. The annotators were video opinions, as compared to their textual coun-
allowed to watch the video segment and their cor- terparts, is the availability of visual and speech cues.
responding transcription as many times as needed. In textual opinions, the only source of information
The inter-annotator agreement was measured at consists of words and their dependencies, which
88%, with a Kappa of 0.81, which represents good may sometime prove insufficient to convey the ex-
agreement. All the disagreements were reconciled act sentiment of the user. Instead, video opinions
through discussions. naturally contain multiple modalities, consisting of
Table 1 shows the five utterances obtained from a visual, acoustic, and linguistic datastreams. We hy-
video in our dataset, along with their corresponding pothesize that the simultaneous use of these three
modalities will help create a better opinion analysis
2
[Link] model.
976
4.1 Feature Extraction model spoken content and represent speaker
This section describes the process of automatically characteristics.
extracting linguistic, acoustic and visual features • Cepstral features. These features emphasize
from the video reviews. First, we obtain the stream changes or periodicity in the spectrum fea-
corresponding to each modality, followed by the tures measured by frequencies; we model
extraction of a representative set of features for them using 12 Mel-frequency cepstral coeffi-
each modality, as described in the following sub- cients that are calculated based on the Fourier
sections. These features are then used as cues to transform of a speech frame.
build a classifier of positive or negative sentiment.
Figure 1 illustrates this process. Overall, we have a set of 28 acoustic features.
During the feature extraction, we use a frame sam-
4.1.1 Linguistic Features
pling of 25ms. Speaker normalization is performed
We use a bag-of-words representation of the video using z-standardization. The voice intensity is
transcriptions of each utterance to derive unigram thresholded to identify samples with and without
counts, which are then used as linguistic features. speech, with the same threshold being used for all
First, we build a vocabulary consisting of all the the experiments and all the speakers. The features
words, including stopwords, occurring in the tran- are averaged over all the frames in an utterance, to
scriptions of the training set. We then remove obtain one feature vector for each utterance.
those words that have a frequency below 10 (value
determined empirically on a small development 4.1.3 Facial Features
set). The remaining words represent the unigram Facial expressions can provide important clues for
features, which are then associated with a value affect recognition, which we use to complement
corresponding to the frequency of the unigram in- the linguistic and acoustic features extracted from
side each utterance transcription. These simple the speech stream.
weighted unigram features have been successfully The most widely used system for measuring and
used in the past to build sentiment classifiers on describing facial behaviors is the Facial Action
text, and in conjunction with Support Vector Ma- Coding System (FACS), which allows for the de-
chines (SVM) have been shown to lead to state-of- scription of face muscle activities through the use
the-art performance (Maas et al., 2011). of a set of Action Units (AUs). According with
4.1.2 Acoustic Features (Ekman, 1993), there are 64 AUs that involve the
upper and lower face, including several face posi-
Acoustic features are automatically extracted from
tions and movements.3 AUs can occur either by
the speech signal of each utterance. We used the
themselves or in combination, and can be used to
open source software OpenEAR (Schuller, 2009)
identify a variety of emotions. While AUs are fre-
to automatically compute a set of acoustic features.
quently annotated by certified human annotators,
We include prosody, energy, voicing probabilities,
automatic tools are also available. In our work, we
spectrum, and cepstral features.
use the Computer Expression Recognition Toolbox
• Prosody features. These include intensity, (CERT) (Littlewort et al., 2011), which allows us to
loudness, and pitch that describe the speech automatically extract the following visual features:
signal in terms of amplitude and frequency.
• Smile and head pose estimates. The smile
• Energy features. These features describe the feature is an estimate for smiles. Head pose
human loudness perception. detection consists of three-dimensional esti-
mates of the head orientation, i.e., yaw, pitch,
• Voice probabilities. These are probabilities
and roll. These features provide information
that represent an estimate of the percentage of
about changes in smiles and face positions
voiced and unvoiced energy in the speech.
while uttering positive and negative opinions.
• Spectral features. The spectral features are
• Facial AUs. These features are the raw es-
based on the characteristics of the human ear,
timates for 30 facial AUs related to muscle
which uses a nonlinear frequency unit to simu-
movements for the eyes, eyebrows, nose, lips,
late the human auditory system. These fea-
tures describe the speech formants, which 3
[Link]
977
and chin. They provide detailed information Modality Accuracy
about facial behaviors from which we expect Baseline 55.93%
to find differences between positive and nega- One modality at a time
tive states. Linguistic 70.94%
Acoustic 64.85%
• Eight basic emotions. These are estimates
Visual 67.31%
for the following emotions: anger, contempt,
disgust, fear, joy, sad, surprise, and neutral. Two modalities at a time
These features describe the presence of two or Linguistic + Acoustic 72.88%
more AUs that define a specific emotion. For Linguistic + Visual 72.39%
example, the unit A12 describes the pulling Acoustic + Visual 68.86%
of lip corners movement, which usually sug- Three modalities at a time
gests a smile but when associated with a Linguistic+Acoustic+Visual 74.09%
check raiser movement (unit A6), represents
Table 2: Utterance-level sentiment classification
a marker for the emotion of happiness.
with linguistic, acoustic, and visual features.
We extract a total of 40 visual features, each
of them obtained at frame level. Since only one 1997; Atrey et al., 2010). In this approach, the fea-
person is present in each video clip, most of the tures collected from all the multimodal streams are
time facing the camera, the facial tracking was combined into a single feature vector, thus result-
successfully applied for most of our data. For the ing in one vector for each utterance in the dataset
analysis, we use a sampling rate of 30 frames per which is used to make a decision about the senti-
second. The features extracted for each utterance ment orientation of the utterance.
are averaged over all the valid frames, which are We run several comparative experiments, using
automatically identified using the output of CERT.4 one, two, and three modalities at a time. We use
Segments with more than 60% of invalid frames the entire set of 412 utterances and run ten fold
are simply discarded. cross validations using an SVM classifier, as imple-
mented in the Weka toolkit.5 In line with previous
5 Experiments and Results
work on emotion recognition in speech (Haq and
We run our sentiment classification experiments Jackson, 2009; Anagnostopoulos and Vovoli, 2010)
on the MOUD dataset introduced earlier. From where utterances are selected in a speaker depen-
the dataset, we remove utterances labeled as neu- dent manner (i.e., utterances from the same speaker
tral, thus keeping only the positive and negative are included in both training and test), as well as
utterances with valid visual features. The removal work on sentence-level opinion classification where
of neutral utterances is done for two main reasons. document boundaries are not considered in the split
First, the number of neutral utterances in the dataset performed between the training and test sets (Wil-
is rather small. Second, previous work in subjec- son et al., 2004; Wiegand and Klakow, 2009), the
tivity and sentiment analysis has demonstrated that training/test split for each fold is performed at ut-
a layered approach (where neutral statements are terance level regardless of the video they belong
first separated from opinion statements followed to.
by a separation between positive and negative state- Table 2 shows the results of the utterance-level
ments) works better than a single three-way classifi- sentiment classification experiments. The baseline
cation. After this process, we are left with an exper- is obtained using the ZeroR classifier, which as-
imental dataset of 412 utterances, 182 of which are signs the most frequent label by default, averaged
labeled as positive, and 231 are labeled as negative. over the ten folds.
From each utterance, we extract the linguis-
tic, acoustic, and visual features described above, 6 Discussion
which are then combined using the early fusion
The experimental results show that sentiment clas-
(or feature-level fusion) approach (Hall and Llinas,
sification can be effectively performed on multi-
4
There is a small number of frames that CERT could not modal datastreams. Moreover, the integration of
process, mostly due to the brief occlusions that occur when
5
the speaker is showing the product she is reviewing. [Link]
978
the correlation between features AU6 and AU12
or the correlation between intensity and loudness
is higher than the correlation between AU6 and in-
tensity. Nonetheless, we still find some significant
correlations between features of different types, for
instance AU12 and AU45 which are both signifi-
cantly correlated with the intensity and loudness
features. This give us confidence about using them
for further analysis.
Video-level sentiment analysis.
To understand the role played by the size of the
video-segments considered in the sentiment classi-
fication experiments, as well as the potential effect
Figure 2: Visual and acoustic feature weights. This
of a speaker-independence assumption, we also run
graph shows the relative importance of the infor-
a set of experiments where we use full videos for
mation gain weights associated with the top most
the classification.
informative acoustic-visual features.
In these experiments, once again the sentiment
annotation is done by two independent annotators,
visual, acoustic, and linguistic features can improve using the same protocol as in the utterance-based
significantly over the use of one modality at a time, annotations. Videos that were ambivalent about
with incremental improvements observed for each the general sentiment were either labeled as neu-
added modality. tral (and thus removed from the experiments), or
Among the individual classifiers, the linguistic labeled with the dominant sentiment. The inter-
classifier appears to be the most accurate, followed annotator agreement for this annotation was mea-
by the classifier that relies on visual clues, and by sured at 96.1%. As before, the linguistic, acoustic,
the audio classifier. Compared to the best indi- and visual features are averaged over the entire
vidual classifier, the relative error rate reduction video, and we use an SVM classifier in ten-fold
obtained with the tri-modal classifier is 10.5%. cross validation experiments.
The results obtained with this multimodal utter- Table 4 shows the results obtained in these
ance classifier are found to be significantly better video-level experiments. While the combination of
than the best individual results (obtained with the modalities still helps, the improvement is smaller
text modality), with significance being tested with than the one obtained during the utterance-level
a t-test (p=0.05). classification. Specifically, the combined effect of
acoustic and visual features improves significantly
Feature analysis.
over the individual modalities. However, the com-
To determine the role played by each of the vi-
bination of linguistic features with other modalities
sual and acoustic features, we compare the fea-
does not lead to clear improvements. This may be
ture weights assigned by the learning algorithm,
due to the smaller number of feature vectors used
as shown in Figure 2. Interestingly, a distressed
in the experiments (only 80, as compared to the
brow is the strongest indicator of sentiment, fol-
412 used in the previous setup). Another possi-
lowed, this time not surprisingly, by the smile fea-
ble reason is the fact that the acoustic and visual
ture. Other informative features for sentiment clas-
modalities are significantly weaker than the lin-
sification are the voice probability, representing the
guistic modality, most likely due to the fact that
energy in speech, the combined visual features that
the feature vectors are now speaker-independent,
represent an angry face, and two of the cepstral
which makes it harder to improve over the linguis-
coefficients.
tic modality alone.
To reach a better understanding of the relation
between features, we also calculate the Pearson 7 Conclusions
correlation between the visual and acoustic fea-
tures. Table 3 shows a subset of these correlation In this paper, we presented a multimodal approach
figures. As we expected, correlations between fea- for utterance-level sentiment classification. We
tures of the same type are higher. For example, introduced a new multimodal dataset consisting
979
AU6 AU12 AU45 AUs 1,1+4 Pitch Voice probability Intensity Loudness
AU6 1.00 0.46* -0.03 -0.05 0.06 -0.14* -0.04 -0.02
AU12 1.00 -0.23* -0.33* 0.04 0.05 0.15* 0.16*
AU45 1.00 0.05 -0.05 -0.11* -.163* 0.16*
AUs 1,1+4 1.00 -0.11* -0.16* 0.06 0.07
Pitch 1.00 -0.04 -0.01 -0.08
Voice probability 1.00 0.19* 0.38*
Intensity 1.00 0.85*
Loudness 1.00
Table 3: Correlations between several visual and acoustic features. Visual features: AU6 Cheek raise,
AU12 Lip corner pull, AU45 Blink eye and closure, AU1,1+4 Distress brow. Acoustic features: Pitch,
Voice probability, Intensity, Energy. *Correlation is significant at the 0.05 level (1-tailed)
.
Modality Accuracy findings, and conclusions or recommendations ex-
Baseline 55.93% pressed in this material are those of the authors
One modality at a time and do not necessarily reflect the views of the Na-
Linguistic 73.33% tional Science Foundation, the Defense Advanced
Acoustic 53.33% Research Projects Agency, or the U.S. Army Re-
Visual 50.66% search, Development, and Engineering Command.
Two modalities at a time
Linguistic + Acoustic 72.00% References
Linguistic + Visual 74.66%
C. Alm, D. Roth, and R. Sproat. 2005. Emotions
Acoustic + Visual 61.33% from text: Machine learning for text-based emotion
Three modalities at a time prediction. In Proceedings of the Conference on
Linguistic+Acoustic+Visual 74.66% Empirical Methods in Natural Language Processing,
pages 347–354, Vancouver, Canada.
Table 4: Video-level sentiment classification with C. Anagnostopoulos and E. Vovoli. 2010. Sound pro-
linguistic, acoustic, and visual features. cessing features for speaker-dependent and phrase-
independent emotion recognition in berlin database.
In Information Systems Development, pages 413–
of sentiment annotated utterances extracted from 421. Springer.
video reviews, where each utterance is associated
A. Athar and S. Teufel. 2012. Context-enhanced cita-
with a video, acoustic, and linguistic datastream. tion sentiment detection. In Proceedings of the 2012
Our experiments show that sentiment annotation Conference of the North American Chapter of the
of utterance-level visual datastreams can be ef- Association for Computational Linguistics: Human
fectively performed, and that the use of multiple Language Technologies, Montréal, Canada, June.
modalities can lead to error rate reductions of up to P. K. Atrey, M. A. Hossain, A. El Saddik, and
10.5% as compared to the use of one modality at a M. Kankanhalli. 2010. Multimodal fusion for mul-
time. In future work, we plan to explore alternative timedia analysis: a survey. Multimedia Systems, 16.
multimodal fusion methods, such as decision-level M. El Ayadi, M. Kamel, and F. Karray. 2011. Survey
and meta-level fusion, to improve the integration on speech emotion recognition: Features, classifica-
of the visual, acoustic, and linguistic modalities. tion schemes, and databases. Pattern Recognition,
44(3):572 – 587.
Acknowledgments K. Balog, G. Mishne, and M. de Rijke. 2006. Why are
they excited? identifying and explaining spikes in
We would like to thank Alberto Castro for his help blog mood levels. In Proceedings of the 11th Meet-
with the sentiment annotations. This material is ing of the European Chapter of the As sociation for
based in part upon work supported by National Sci- Computational Linguistics (EACL-2006).
ence Foundation awards #0917170 and #1118018, Dmitri Bitouk, Ragini Verma, and Ani Nenkova. 2010.
by DARPA-BAA-12-47 DEFT grant #12475008, Class-level spectral features for emotion recognition.
and by a grant from U.S. RDECOM. Any opinions, Speech Commun., 52(7-8):613–625, July.
980
J. Blitzer, M. Dredze, and F. Pereira. 2007. Biogra- M. Hu and B. Liu. 2004. Mining and summariz-
phies, bollywood, boom-boxes and blenders: Do- ing customer reviews. In Proceedings of the tenth
main adaptation for sentiment classification. In As- ACM SIGKDD international conference on Knowl-
sociation for Computational Linguistics. edge discovery and data mining, Seattle, Washing-
ton.
A. J. Calder, A. M. Burton, P. Miller, A. W. Young, and
S. Akamatsu. 2001. A principal component analysis F. Li, S. J. Pan, O. Jin, Q. Yang, and X. Zhu. 2012.
of facial expressions. Vision research, 41(9):1179– Cross-domain co-extraction of sentiment and topic
1208, April. lexicons. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguistics,
G. Carenini, R. Ng, and X. Zhou. 2008. Summarizing Jeju Island, Korea.
emails with conversational cohesion and subjectivity.
In Proceedings of the Association for Computational G. Littlewort, J. Whitehill, Tingfan Wu, I. Fasel,
Linguistics: Human Language Technologies (ACL- M. Frank, J. Movellan, and M. Bartlett. 2011. The
HLT 2008), Columbus, Ohio. computer expression recognition toolbox (cert). In
Automatic Face Gesture Recognition and Workshops
P. Carvalho, L. Sarmento, J. Teixeira, and M. Silva. (FG 2011), 2011 IEEE International Conference on,
2011. Liars and saviors in a sentiment annotated pages 298 –305, march.
corpus of comments to political debates. In Proceed- A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and
ings of the Association for Computational Linguis- C. Potts. 2011. Learning word vectors for sentiment
tics (ACL 2011), Portland, OR. analysis. In Proceedings of the Association for Com-
putational Linguistics (ACL 2011), Portland, OR.
L. S. Chen, T. S. Huang, T. Miyasato, and R. Nakatsu.
1998. Multimodal human emotion/expression recog- F. Mairesse, J. Polifroni, and G. Di Fabbrizio. 2012.
nition. In Proceedings of the 3rd. International Con- Can prosody inform sentiment analysis? experi-
ference on Face & Gesture Recognition, pages 366–, ments on short spoken reviews. In Acoustics, Speech
Washington, DC, USA. IEEE Computer Society. and Signal Processing (ICASSP), 2012 IEEE Inter-
national Conference on, pages 5093 –5096, march.
L C De Silva, T Miyasato, and R Nakatsu, 1997. Facial
emotion recognition using multi-modal information, X. Meng, F. Wei, X. Liu, M. Zhou, G. Xu, and H. Wang.
volume 1, page 397401. IEEE Signal Processing So- 2012. Cross-lingual mixture model for sentiment
ciety. classification. In Proceedings of the 50th Annual
Meeting of the Association for Computational Lin-
P. Ekman, W. Friesen, and J. Hager. 2002. Facial ac- guistics, Jeju Island, Korea.
tion coding system.
F. Metze, T. Polzehl, and M. Wagner. 2009. Fusion
P. Ekman. 1993. Facial expression of emotion. Ameri- of acoustic and linguistic features for emotion detec-
can Psychologist, 48:384–392. tion. In Semantic Computing, 2009. ICSC ’09. IEEE
International Conference on, pages 153 –160, sept.
I.A. Essa and A.P. Pentland. 1997. Coding, analy-
sis, interpretation, and recognition of facial expres- R. Mihalcea, C. Banea, and J. Wiebe. 2007. Learning
sions. Pattern Analysis and Machine Intelligence, multilingual subjective language via cross-lingual
IEEE Transactions on, 19(7):757 –763, jul. projections. In Proceedings of the Association for
Computational Linguistics, Prague, Czech Republic.
A. Esuli and F. Sebastiani. 2006. SentiWordNet: A L.P. Morency, R. Mihalcea, and P. Doshi. 2011. To-
publicly available lexical resource for opinion min- wards multimodal sentiment analysis: Harvesting
ing. In Proceedings of the 5th Conference on Lan- opinions from the web. In Proceedings of the In-
guage Resources and Evaluation (LREC 2006), Gen- ternational Conference on Multimodal Computing,
ova, IT. Alicante, Spain.
D.L. Hall and J. Llinas. 1997. An introduction to mul- J. Oh, K. Torisawa, C. Hashimoto, T. Kawada,
tisensor fusion. IEEE Special Issue on Data Fusion, S. De Saeger, J. Kazama, and Y. Wang. 2012.
85(1). Why question answering using sentiment analysis
and word classes. In Proceedings of the 2012 Joint
S. Haq and P. Jackson. 2009. Speaker-dependent Conference on Empirical Methods in Natural Lan-
audio-visual emotion recognition. In International guage Processing and Computational Natural Lan-
Conference on Audio-Visual Speech Processing. guage Learning, Jeju Island, Korea.
V. Hatzivassiloglou and K. McKeown. 1997. Predict- B. Pang and L. Lee. 2004. A sentimental education:
ing the semantic orientation of adjectives. In Pro- Sentiment analysis using subjectivity summarization
ceedings of the Conference of the European Chap- based on minimum cuts. In Proceedings of the 42nd
ter of the Association for Computational Linguistics, Meeting of the Association for Computational Lin-
pages 174–181. guistics, Barcelona, Spain, July.
981
V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. 2013. P. Turney. 2002. Thumbs up or thumbs down? seman-
Multimodal sentiment analysis of spanish online tic orientation applied to unsupervised classification
videos. IEEE Intelligent Systems. of reviews. In Proceedings of the 40th Annual Meet-
ing of the Association for Computational Linguistics
T. Polzin and A. Waibel. 1996. Recognizing emotions (ACL 2002), pages 417–424, Philadelphia.
in speech. In In ICSLP.
D. Ververidis and C. Kotropoulos. 2006. Emotional
S. Raaijmakers, K. Truong, and T. Wilson. 2008. Mul- speech recognition: Resources, features, and meth-
timodal subjectivity analysis of multiparty conversa- ods. Speech Communication, 48(9):1162–1181,
tion. In Proceedings of the Conference on Empiri- September.
cal Methods in Natural Language Processing, pages
466–474, Honolulu, Hawaii. J. Wagner, E. Andre, F. Lingenfelser, and Jonghwa
Kim. 2011. Exploring fusion methods for multi-
M. Rosenblum, Y. Yacoob, and L.S. Davis. 1996. Hu- modal emotion recognition with missing data. Af-
man expression recognition from motion using a ra- fective Computing, IEEE Transactions on, 2(4):206
dial basis function network architecture. Neural Net- –218, oct.-dec.
works, IEEE Transactions on, 7(5):1121 –1138, sep.
X. Wan. 2009. Co-training for cross-lingual sentiment
B. Schuller, M. Valstar, R. Cowie, and M. Pantic, edi- classification. In Proceedings of the Joint Confer-
tors. 2011a. Audio/Visual Emotion Challenge and ence of the Association of Computational Linguistics
Workshop (AVEC 2011). and the International Joint Conference on Natural
Language Processing, Singapore, August.
B. Schuller, M. Valstar, F. Eyben, R. Cowie, and
M. Pantic, editors. 2011b. Audio/Visual Emotion J. Wiebe and E. Riloff. 2005. Creating subjective and
Challenge and Workshop (AVEC 2011). objective sentence classifiers from unannotated texts.
In Proceedings of the 6th International Conference
F. Eyben M. Wollmer B. Schuller. 2009. Openear in- on Intelligent Text Processing and Computational
troducing the munich open-source emotion and af- Linguistics (CICLing-2005) (invited paper), Mexico
fect recognition toolkit. In ACII. City, Mexico.
N. Sebe, I. Cohen, T. Gevers, and T.S. Huang. 2006. J. Wiebe, T. Wilson, and C. Cardie. 2005. Annotating
Emotion recognition based on joint visual and audio expressions of opinions and emotions in language.
cues. In ICPR. Language Resources and Evaluation, 39(2-3):165–
210.
D. Silva, T. Miyasato, and R. Nakatsu. 1997. Facial
M. Wiegand and D. Klakow. 2009. The role of
emotion recognition using multi-modal information.
knowledge-based features in polarity classification
In Proceedings of the International Conference on
at sentence level. In Proceedings of the Interna-
Information and Communications Security.
tional Conference of the Florida Artificial Intelli-
S. Somasundaran, J. Wiebe, P. Hoffmann, and D. Lit- gence Research Society.
man. 2006. Manual annotation of opinion cate- T. Wilson, J. Wiebe, and R. Hwa. 2004. Just how mad
gories in meetings. In Proceedings of the Work- are you? finding strong and weak opinion clauses.
shop on Frontiers in Linguistically Annotated Cor- In Proceedings of the American Association for Arti-
pora 2006. ficial Intelligence.
P. Stone. 1968. General Inquirer: Computer Approach M. Wollmer, B. Schuller, F. Eyben, and G. Rigoll.
to Content Analysis. MIT Press. 2010. Combining long short-term memory and dy-
namic bayesian networks for incremental emotion-
C. Strapparava and R. Mihalcea. 2007. Semeval-2007 sensitive artificial listening. IEEE Journal of Se-
task 14: Affective text. In Proceedings of the 4th In- lected Topics in Signal Processing, 4(5), October.
ternational Workshop on the Semantic Evaluations
(SemEval 2007), Prague, Czech Republic. B. Yang and C. Cardie. 2012. Extracting opinion
expressions with semi-markov conditional random
M. Taboada, J. Brooke, M. Tofiloski, K. Voli, and fields. In Proceedings of the 2012 Joint Conference
M. Stede. 2011. Lexicon-based methods for sen- on Empirical Methods in Natural Language Process-
timent analysis. Computational Linguistics, 37(3). ing and Computational Natural Language Learning,
Jeju Island, Korea.
R. Tato, R. Santos, R. Kompe, and J. M. Pardo. 2002.
Emotional space improves emotion recognition. In Z. Zhihong, M. Pantic G.I. Roisman, and T.S. Huang.
In Proc. ICSLP 2002, pages 2029–2032. 2009. A survey of affect recognition methods: Au-
dio, visual, and spontaneous expressions. PAMI,
Y.-I. Tian, T. Kanade, and J.F. Cohn. 2001. Recogniz- 31(1).
ing action units for facial expression analysis. Pat-
tern Analysis and Machine Intelligence, IEEE Trans-
actions on, 23(2):97 –115, feb.
982