0% found this document useful (0 votes)
13 views12 pages

Visual Keyword Generation for Image Caption

Uploaded by

poojitha.ch
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Visual Keyword Generation for Image Caption

Uploaded by

poojitha.ch
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Received January 10, 2021, accepted February 2, 2021, date of publication February 10, 2021, date of current version

February 19, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3058425

VSAM-Based Visual Keyword Generation for


Image Caption
SUYA ZHANG , YANA ZHANG, ZEYU CHEN , AND ZHAOHUI LI
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China
Corresponding author: Yana Zhang (zynjenny@[Link])
This work was supported in part by the High-quality and Cutting-edge Disciplines Construction Project for Universities in Beijing
(Internet Information, Communication University of China) under Grant GJ19013504, and in part by the National Key Task Team
Cultivation Project 2019 under Grant CUC19ZD003.

ABSTRACT Image caption is to understand and describe the visual content, which is expected to be
applied in automatic news reporting in future. In recent years, there has been an increasing interest in an
Encoder-Decoder framework for image caption: the encoder takes the responsibility for visual semantic
comprehension and the decoder is designed for sentence generation. In the Encoder-Decoder framework the
translation is based on the correspondence between image feature vectors and caption vectors. Attention
mechanism makes sense for a more accurate correspondence. However, the attention model works with
the decoder, and the focused content changes dynamically with the generated word. It results that in many
cases the salient contents are not described in the caption, or the objects described are not the salient ones.
To improve the precision of image caption, to bridge the gap between image understanding and sentence
generation in the Encoder-Decoder framework, and to align visual information and semantic information
better, we propose a concept of visual keyword as a gang board between seeing and saying. This paper
presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual
Keyword Dataset (IVKD). Also, a Visual Semantic Attention Model(VSAM) is proposed to obtain visual
keywords for generating the annotation. In VSAM, the object-level visual features are extracted by an object
detector after pre-training on IVKD. Then the object features are fed in an Optimized Pointer Network(OPN)
to generate visual keywords. The experiments show that the precision of visual keyword generation reaches
91.7% by the proposed model VSAM.

INDEX TERMS Visual semantic, attention model, image caption.

I. INTRODUCTION many artificial intelligent applications, such as automatic


Humans can easily describe an image in words, focusing on news reporting.
the important and interesting things in a view. Such descrip- Traditionally, there were two general methods for
tions are rich, accurate, concise, which explain who they image caption: generation-based method and retrieval-based
are, what they are doing, where it is happening and even method. The generation-based method extracts the image fea-
more. Since 2010 many systems [1], [2] have proved the tures and obtains the objects, attributes, and scene informa-
feasibility of talking visual content by a computer, and this tion by classifiers, such as CRF [2] and HMM [3]. Then the
novel technology is called image caption. Image caption is detected objects are described by templates [2]–[4], parsing
an interdisciplinary research of computer vision and natural [5]–[7], or language models [8], [9]. They are more grounded
language processing. For image caption, the computer needs in the image. However, those captions lack of diversity and
to detect the image features, recognize the objects and their naturalness limited by fixed templates, syntactic models and
relationships, understand the scene information and possible language models. The retrieval-based method firstly finds
behavior, and finally translate the visual content to a human- out an image collection in which the images are similar
readable sentence. Image caption is expected to be used in to the target image, and then organizes the description of
the target image reasonably based on captions of images
in the collection. This method relies on image similarity
The associate editor coordinating the review of this manuscript and calculation and tends to generate similar captions for different
approving it for publication was Victor Sanchez . image.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
27638 VOLUME 9, 2021
S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

increases the burden of decoder to judge salient objects.


Sometimes, even the most salient object is not described in the
caption, or the keywords in the captions are not corresponding
to salient objects. For example, in Fig.2.(a) the salient objects
are ‘person’ and ’bike’ in ground truth annotations, but the
sentence generated by Neural Baby Talk [11] only describes
‘boat’. In Fig.2.(b), ‘a small airplane’ is described instead of
the salient object ‘person’. As shown in Fig.2.(c), the sentence
tends to describe the ‘table’ that does not exist in the picture.
Recently, researches in image caption are push forward to
the stage of loose matching between images and sentences
FIGURE 1. Attention mechanism based Encoder-Decoder framework. instead of close matching. Venugopalan et al. [12] used dif-
xi is a pixel or a region in an image, ct is salient content with an
attention weight at time t, and yj is a generated word in captions. ferent external sources to train encoding and decoding model
independently and combined them to generate descriptions
about novel objects. Ke et al. [13] explored the vocabulary
Benefited from the deep neural network, Encoder-Decoder coherence between words and syntactic paradigm of sen-
framework has become a general framework of image cap- tences. Those researches focus on generating grammatically
tion. In the Encoder-Decoder framework [10], an image is standardized sentences, but the precision of description is still
encoded into a hidden vector by a Convolutional Neural Net- limited.
work(CNN), and the caption is decoded by a Recurrent Neu- Stated thus, here are two big challenges: firstly, the
ral Network(RNN) or a Long Short-Term Memory(LSTM), computer is expected to describe what humans are inter-
word by word. Compared to the traditional image caption ested in. Another, the alignment between visual concepts and
methods, in the Encoder-Decoder framework all the knowl- semantic concepts needs to be more precise. The motiva-
edge could be learned from data, and the generated sentences tion of this paper is to bridge the gap between the under-
are much more abundant. However, this method mismatches standing of visual content and the sentence generation in
the process from image understanding to sentence generation Encoder-Decoder framework. The contributions are as
as humans do. The captions sound smooth and reasonable, follows:
because it tends to copy captions from training data, rather • Considering there is no need to describe every object
than accurately match the visual content. Furthermore, this in an image, a concept of visual keyword, which deter-
method align all semantic features with visual feature vector. mines what to describe, is proposed as a gang board
Even prepositions, conjunctions, auxiliary words and other between seeing and saying. This paper presents an image
words that have nothing to do with visual features are also dataset derived from MSCOCO as the first collection
one-to-one corresponding to a visual feature vector. This is of visual keywords: Image Visual Keyword Dataset
unreasonable. (IVKD). At present, the disadvantage of popular atten-
As humans trend to describe the salient content, attention tion models in image caption is that they are integrated
mechanism has been widely adopted in the Encoder-Decoder with sentence generation. Our attention model is not
framework (shown in Fig.1). The decoder focuses on a vari- designed to generate sentences directly, but the visual
able hidden vector with attention weights, rather than a stable keywords only. Also, two evaluation methods are pro-
hidden vector. The generated description based on atten- posed to evaluate the precision and recall of visual key-
tion mechanism is more relevant to the image, and avoid word generation.
to describe what humans do not pay attention to. However, • A Visual Semantic Attention Model(VSAM) is pro-
the noticed content dynamically changes when the sentence is posed in this paper to generate visual keywords as the
generated word by word. This dynamic attention mechanism bridge. In VSAM, object-level visual semantic features

FIGURE 2. Image caption based on dynamic attention mechanism. The value on a bounding box is the probability of the object appeared in a
ground truth caption. Captions are generated by neural baby talk [11].

VOLUME 9, 2021 27639


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

are detected by an object detector after pre-trained.


The attention mechanisms in VSAM include both the
feature-driven attention mechanism (to detect the salient
object based on visual features) and knowledge-driven
attention mechanism (to determine what to describe
based on prior knowledge).
• An Optimized Pointer Network (OPN) is proposed
to support variable-length sequence input, rather than
padding the sequence to a fixed length. Experiments
proved that the model trained with variable-length
sequences performed better. It is more sensitive to the
length of sequence and easier to converge with the same
hyper parameter of the unoptimized.

II. RELATED WORK


It is believed that human attention mechanism is divided into
two phases [14], [15]: fast, subconscious, bottom-up, feature-
driven saliency extraction; slow, task-dependent, top-down,
target-driven saliency extraction. The first phase makes it
easy to notice where the features suddenly change. The sec-
ond phase pays attention to familiar objects based on human
prior knowledge, such as faces, vehicles, animals, and road
signs.
FIGURE 3. Three types of attention models for image caption.
The attention methods in the Encoder-Decoder framework
are divided into two types [16]: soft attention method and
hard attention method. For the former, the attention weight the Encoder-Decoder framework a decoded word was condi-
from 0 to 1 is interpreted as the importance of features. For tioned on the attention of that moment but had no information
the latter, the attention weight is interpreted as the attention in future. They integrated discriminative supervision into the
probability of the feature, which is either 0 or 1. The experi- Encoder-Decoder by adding a review network. The review
ments in [16] proved that the hard attention method performs network performs a number of review steps with soft attention
better in image caption. But hard attention method needs to on the encoder hidden states, and outputs a thought vector
use Monte Carlo method for gradient descent. This requires after each review step. The thought vector is introduced to
sampling the attention location each time, being difficulty for capture global properties in a compact vector representation.
application. Soft attention method is used widely in image S. J. Rennie et al. [33] used reinforcement learning to opti-
caption in recent years. mize image caption systems based on pixel-level attention
The attention models in the Encoder-Decoder framework mechanism.
can be divided into three levels according to the focused Anderson et al. [19] argued that the pixel-level attention
contents: pixel-level, region-level and object-level. As shown mechanism took a representation of a partially completed
in Fig.3, these models can automatically focus on salient caption as context, and it was driven by non-visual or
pixels, regions or objects when generating a sentence word task-specific context. They proposed a region-level attention
by word. model combined bottom-up and top-down attention mech-
Xu et al. [16] introduced a pixel-level attention model anism(as shown in Fig.3.(b)). The bottom-up mechanism
to the Encoder-Decoder framework. This model learns to points out an image region with an associated feature vec-
focus on salient pixels while generating words (as shown tor through the hard attention method, while the top-down
in Fig.3.(a)). The attention weights on spatial locations are mechanism determines the region feature weights by the soft
computed by a multilayer perceptron, conditioned on the pre- attention method. Experiments showed that adding regional
vious hidden vector and generated words. This method aligns features improved the accuracy of annotations. Li et al. [20]
every word in the sentence with image pixels. Lu et al. [17] used a region proposal network to generate regions of interest,
proposed an adaptive attention model with visual sentinel and proposed a novel scheme with an object context encoding
to consider the alignment of non-visual words. The visual LSTM to generate caption for regions. Huang et al. [34]
sentinel is a latent representation of what the decoder already extended the region-level attention mechanism to determine
knows without an image. The adaptive attention model based the relevance between attention results and queries, so as to
on visual feature vectors and visual sentinel is able to deter- know whether or how well the attended vector and the given
mine whether it attends to predict the next word. It uses the attention query are related. Guo et al. [32] combined the
soft attention method, and the non-visual words still aligned optimized self-attention module in the transformer structure
with a visual feature vector. Yang et al. [18] thought that in of image caption.

27640 VOLUME 9, 2021


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

Li et al. [21] introduced an object-level attention model to semantic attention model is proposed to obtain the visual key-
address the problems of object missing and misprediction. words. It includes two parts: an object detector to simulate the
They obtained global pixel-level features and local object- process of that humans extract object features from an image,
level features using deep CNNs, then integrated local object which is essentially features-driven attention; an optimized
features with global image features through the soft atten- pointer network to model the process of that humans select
tion method. The global feature is a 4096-dimension vector the most concerned and representative ones among all visual
extracted from the fc7 layer of VGG16-net and the local words, which is essentially knowledge-driven attention.
object-level feature was 4096-dimension vector extracted
from Faster R-CNN. To generate natural and accurate cap- A. THE FRAMEWORK OF VSAM
tions that are generally better grounded in images, Lu et al. The framework of VSAM is shown in Fig.4. N objects are
[11] combined the bottom-up and up-down attention mecha- detected by an object detector (for example a pre-trained
nism with an adaptive object-level attention model. As shown Faster R-CNN). As shown in Fig.5, the object features include
in Fig.3.(c)), the model generates a template with slot loca- labels, feature maps and coordinates. In region ri (i ∈ (1, N )),
tions, then fills visual concepts into the slot. Visual concepts a location vector Vil ∈ R1×200 is obtained by projecting
are chosen from objects-level features by a pointer network. region coordinates [xmin , ymin , xmax , ymax ] to m-dimension.
The focused objects dynamically change when the sentence is An object visual feature vector ViP ∈ R1×2048 is obtained
generated and all words in the output sentence rely on visual by projecting the pooling of feature maps of RoI align layer
g
features. The pointer network in [11] is only applied as a in Faster R-FCNN. A word embedding vector Vi ∈ R1×300
decoder to decide which input is selected as an attentional is the glove vector [26] of the label. The location vector and
output, and the inputs of pointer network are the independent visual feature vector together represent the visual concepts
object-level image features encoded by an object detector. of an object, while the word embedding vector represents
There are also multi-level attention models. Chen et al. [22] the semantic concepts of an object. Three vectors are con-
proposed SCA-CNN that incorporated Spatial and Channel- catenated to an object-level visual semantic feature vector
g
Vi = ViP ; Vil ; Vi .

wise Attentions in a CNN to dynamically modulate the sen-
tence generation in multi-layer feature maps. Chen et al. [23] Visual keywords of an image are identified from the
considered co-occurrence dependencies among attributes caption by the keyword extractor. In a keyword extractor,
before generating the caption. The attention model provides Stanford lemmatization toolbox [27] is used to obtained the
two types of image features, the region-based features for prototype of words in a sentence. The base form of each
inference module and the attribute-based ones for genera- word in an annotation is matched with the base form of
tion module, respectively. Yao et al. [24] analyzed an image word in a visual keyword vocabulary(detailed in [Link].A).
in three levels (pixel-level, region-level and object-level) to The inference from detected objects to described objects are
delve into a thorough image understanding for captioning. learned by an optimized pointer network(OPN).
Pan et al. [25] proposed an X-Linear attention block to exploit
both spatial and channel-wise bilinear attention distributions. B. THE PRINCIPLE OF VISUAL KEYWORD GENERATION
This block captures the 2nd order interactions between the Given an image I, VSAM is to generate the visual keywords
input single-modal or multi-modal features. Experiments k = {k1 , · · · , kT } (T is the number of visual keywords),
showed that multi-modal inputs performed better than single- corresponding to a subset of salient objects. Following the
modal inputs in attention models. standard supervised learning paradigm, we find the parameter
The existing attention models in Encoder-Decoder frame- θ ∗ of VSAM by maximizing the likelihood of the correct
work decode every word from image features, even includ- visual keywords:
ing prepositions, auxiliary words, and conjunctions. What’s X
θ ∗ = arg max log p(k I; θ)

more, the attention weights dynamically change at each step (1)
θ
(I ,k)
of decoding. It makes mistakes in judging which objects is
salient. In this paper, we proposed a novel visual seman- Using the chain rule, the joint probability distribution can
tic attention model to determine what to describe before be decomposed over a sequence of tokens:
generating sentences. The VSAM extracts visual keywords YT
p(k/I ; θ) = p(kt /k1:t−1 , I ; θ) (2)
from image based on both feature-driven visual attention and t=1
knowledge-driven semantic attention, considering the com- The visual keywords are chosen from the object-level
plex relations (semantic relations, positional relations and visual semantic feature vector Vi by the OPN. Thus, the
attention order) between objects. probability is decomposed to:
p(kt /k1:t−1 , I ; θ) = p(kt /k1:t−1 , Vi ; θ 3 )p(Vi /I ; θ 1 , θ 2 ) (3)
III. VISUAL SEMANTIC ATTENTION MODEL
In this paper, the words extracted from visual concepts are where θ 1 is the parameter of Faster R-CNN, θ 2 is the parame-
defined as visual words, among which the ones appeared ter of the embedding layers which embed the detected results
in the caption and corresponding to the focused/concerned of Faster R-CNN to visual semantic feature vectors Vi and θ 3
objects in an image are called visual keywords. A visual is the parameter of the OPN.

VOLUME 9, 2021 27641


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

FIGURE 4. The framework of VSAM.

FIGURE 5. Object-level visual semantic feature vector.

are computed with previous hidden state and object feature


Vi as follow:
ei = LSTM (W0 Vi , ei−1 ) , i ∈ (1, · · · , N ) (4)
dt = LSTM (W0 Vi , dt−1 ) , t ∈ (1, · · · , T ) (5)
where d0 = eN , W0 ∈ Rm×d .
At each step t in a decoder, the conditional probability of
visual keywords is calculated as follow:
uti = ωT tanh (W1 ei + W2 dt ) (6)
FIGURE 6. OPN (the blue part represents an encoder, and the pink part
  
represents a decoder). P k t k1 , . . . , kt−1 , I ; θ 3 = softmax(ut ) (7)

There are relevance among objects in descriptions, for where W1 ∈ Rd×d , W2 ∈ Rd×d , ω ∈ Rd×1 are parameters to
example when a ‘‘traffic light’’ is salient in the image, ‘‘car’’ be learned. Softmax normalizes the vector ut to be an condi-
is probably chosen as a keyword in a description. In our tional distribution with dictionary size equal to the length of
dataset IVKD, the information of attention order is presented the input. The visual keyword is pointed out corresponding to
as the keywords order. The first keyword is more attentional the highest probability.
than the second. The OPN is composed by an encoder and a
decoder: it encodes the complex relations (semantic relations, C. AN OPTIMIZED POINTER NETWORK (OPN)
positional relations and attention order) between objects, and In original PN, the input sequences in a batch need to be
makes an input sequence to a hidden vector; the decoder padded with specific characters to the same dimension, so that
computes the conditional probability of an output sequence. the matrix operation works based on Mini-Batch Gradient
It makes a pointer to select one of the inputs as an output Descent (MBGD). The padded characters participate the cal-
based on attention mechanism. culation of the network before masked. This causes the devia-
As shown in Fig.6, the hidden states vectors of encoder tion in calculation results of real elements and makes network
and decoder are defined as ei and dt respectively. ei and dt be insensitive to the truth length of sequences. We inferred

27642 VOLUME 9, 2021


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

from 80 objects categories in MSCOCO, an expanded visual


keyword vocabulary with 1044 fine-grained categories is
established based on Word Net [29].
Five sets of visual keywords are extracted from five manual
annotations separately by a keyword extractor. Restricted
by object detection technology and limited by the keyword
vocabulary, some sets of visual keywords are discarded when
there is no matching object labels. Finally, IVKD contains
110535 sets of visual keywords for training, and contains
4736 and 4831 sets for validation and testing respectively.
B. EXPERIMENT SETTINGS
The Faster R-CNN [30] with ResNet-101 is pre-trained on
MSCOCO, but not finetune during training. For VSAM,
an image is resized to 512 × 512 as the input of ResNet-101.
The IoU threshold is 0.7 for region proposal suppression
and 0.3 for class suppression in Non-Maximum Suppres-
g
FIGURE 7. Loss and accuracy of OPN in training and testing. sion(NMS) to select regions. The dimension of Vil , ViP , Vi
is 200, 2048, 300 respectively.
that padding sequence to stable length will increase the pres- For OPN, the dimension of input vector is 1024. The
sure for model to learn the extraction relationship between encoder and decoder is a Bi-LSTM with hidden size 256.
sequences. We use the Adam [31] optimizer with an initial learning rate
To avoid the effect of variable sequence length on the loss of 5 × 10−5 , and anneal the learning rate by a factor of
L(θ), an optimized average loss of n-length output sequence 0.8 every three epochs. The batch size is 20 and we trained
is calculated for gradient descent. In the VSAM, 100 epoch for early stopping.
 given the
∗ and the parameter θ = θ 1 , θ 2 , θ 3 , VSAM is built based on PyTorch. Experiments are

ground truth k1:T
the average cross entropy loss of OPN is defined as: conducted on a GTX 1080Ti GPU with 12G.
T C. EVALUATION METHODS
1X
L (θ) = − log(p(kt∗ /k1:t−1

, Vt ; θ 3 )p(Vt /I ; θ 1 , θ 2 )) (8) Two indicators are used to calculate the precision and recall of
T
t=1 VSAM, as formulation (11) and (12). In this paper, the preci-
For k-th batch, the MBGD implementation update the j-th sion describes how many ground truth visual keywords in the
parameter θj with: predicted visual keywords, and the recall describes how many
ground truth visual keywords are predicted in all ground
n
∂ L̄k (θ) 1 X ∂Lk,i (θ) truth visual keywords. Essentially, precision represents the
= (9)
∂θ n ∂θ performance whether the described objects are salient, and
i=1
recall represents the performance whether all salient objects
∂ L̄k (θ) are described. Since the existing image captioning generates
θj = θj − α (10)
∂θ only one sentence rather than a paragraph for an image,
where n is the batch size, α is the learning rate, ∂L(θ)
∂θ is the
the precision of visual keywords is more important than
∂ L¯k (θ) recall.
partial derivative function of L(θ) to θ, ∂θ is the average
gradient of k-th batch. To avoid matrix calculation when TP
precision = (11)
update the parameter by a batch of samples, gradients of TP + FP
all parameters ∂L(θ)
∂θ are calculated separately. The average recall =
TP
(12)
gradients of a batch are calculated by (9). All parameters are TP + FN
updated by (10). where TP(True Positive) is the number of visual keywords
As shown in Fig.7, the OPN converges easily and well, and which both in the predicted and ground truth visual keywords,
could make contributions in VSAM. FP(False Positive) is the number of visual keywords which
in predicted visual keywords but not in the ground truth,
IV. EXPERIMENT AND ANALYSIS FN(False Negative) is the number of visual keywords which
A. IMAGE VISUAL KEYWORD DATASET (IVKD) in the ground truth but not in predicted visual keywords.
An Image Visual Keyword Dataset (IVKD) derived from In order to evaluate the performance of VSAM comprehen-
MSCOCO [28] is established. In MSCOCO, five manual sively and objectively, we propose two evaluation methods
annotations, as well as object labels and bounding boxes, (VKE: Visual Keywords Evaluation). In VKE I, TP is counted
are annotated for each image. Following the study of paper between the predicted and one set of ground truth visual
[11] in which 410 fine-grained categories are subdivided keywords, as well as FP and FN. This method considers

VOLUME 9, 2021 27643


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

the importance and the relevance of visual keywords. For TABLE 1. Precision and recall of different models.
example, if the ground truth sets are [donut, table, knife],
[donut, table], [donut, baby], donut is the highest frequency
words, therefore donut is the most important keyword. Words
‘donut’ and ‘table’ are most likely to appear in one sentence,
therefore they are closely related.

 PN P M
TP =

 TPi,j
i=1 j=1



N P M

 P
FP = FPi,j (13)
 i=1 j=1
are exactly what the ground truth captions have. This ben-



 PN PM
FN = FNi,j efits from that the attention model in VSAM is indepen-




i=1 j=1 dent of the sentence generation. The object detection and
where N is the number of testing images, M is the number of saliency judgment are not disturbed by the generation of
ground truth sets corresponding to an image. template words. What’s more, the visual keywords gener-
Method VKE II uses a union of M visual keywords sets as ated by VSAM are more relevant in visual concepts and
a ground truth set corresponding to an image. Then TP, FP semantic concepts, such as ‘bed’ and ‘teddy bear’ in (b).
and NP are calculated as formulation (14). This benefits from the feature-driven and knowledge-driven

N
hard attention mechanism implemented by OPN. The OPN
learns human ability to select interest objects to describe.
 P P
TP = vali,key × wi,key


i=1 key∈TPS This independent knowledge-driven attention is not inclined



N

to describe all visual salient objects as what other models tend
 P P
FP = vali,key (14)

 i=1 key∈FPS to do, but considers the semantic relevance between salient
objects.


 PN P
FN = vali,key × wi,key

In Table. 2, NBT described many unfocused objects,


i=1 key∈FNS
which are not existed in ground truth annotations, such
where TPS, FPS, FNS represents a set of true positive visual as ‘chair’ in (a) and (b). AOA falsely detected objects,
keywords, a set of false positive keywords, and a set of false such as ‘dog’ instead of ’horse’ in (d), and output it as
negative keywords respectively. If a keyword belongs to the a visual keyword. Object detection inaccuracy is due to
corresponding set, val is set to 1; otherwise, it is 0. wkey that the encoder-decoder framework adopted by NBT and
represents the frequency of a keyword. This method removes AOA generated sentence directly from images feature vec-
the relevance of visual keywords in one ground truth set. tor. The inaccuracy of judging the salience of objects is
For example, hwe could i get a union set [donut, table, knife] due to that AOA and NBT align all semantic concepts
with wkey ∈ 3 , 3 , 3 from three ground truth sets [donut,
3 2 1 (including visual words and template words) with visual
table, knife], [donut, table] and [donut]. If the predicted visual concepts, which exposed the shortcomings of their attention
keyword set is [donut, table, baby], the TPS is [donut, table], models.
the FPS is [baby], and the FNS is [knife]. For VSAM, limited by the visual keyword vocabulary,
the incomplete extraction of keywords from captions leads to
D. EXPERIMENT RESULTS AND ANALYSIS the imperfect of ground truth visual keywords. Another, the
We compare the performance of VSAM with that of five regions in different rectangles overlap too much. It causes
different models through two evaluation methods (detailed in the confusion of objects saliency judgment. For example,
[Link].C). They are pre-trained models obtained from Github the ‘dog’ in (e) is described in all human h captions(ground
i
and tested on the same dataset MSCOCO. truth is [dog, motorcycle] with weight 55 , 55 ), but VSAM
As shown in Table. 1, when using evaluation method doesn’t generate the visual keyword ‘dog’ but ‘person’. The
VKE I, the precision of VSAM is 0.112∼0.126 better than reason is that in OPN the same feature vector in input
that of the others, while the recall of VSAM is close to sequence would be only selected once in output sequence.
the others’. When evaluated through VKE II, the precision When the ‘motorcycle’ is selected, the ‘dog’ which over-
of VSAM reaches to 91.7%, which outperforms other five laps too much with ‘motorcycle’ would not be selected.
models by 0.054 ∼ 0.069. The recall of VSAM falls a little. Also, the OPN is tend to output semantically relevant visual
This is because the VSAM was trained with one set of visual keywords. In most cases, ‘person’ is more relevant with
keywords. The experiments show that VSAM can generated ‘motorcycle’ than ’dog’. Another, when there are too many
visual keywords more precisely. visual keywords in one image, such as (c), it is difficult
Some samples of three different models’ outputs are shown for VSAM to point out every one of them, because gener-
in Table. 2. We can see that, VSAM extracted more pre- ally humans do not describe more than four objects in one
cise visual keywords, such as girl and table in (a), which sentence.

27644 VOLUME 9, 2021


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

TABLE 2. Visual keywords generated by different models (The gray ones are visual words, the others are visual keywords.).

VOLUME 9, 2021 27645


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

TABLE 3. statistic of FN, FP and TP of vsam (the green ones are good results, the greener the better, while the red ones are bad results, the redder the
worse.)

FIGURE 8. Performance evaluation of VSAM through VKE II.

In Table. 3, the image quantity of FN, FP and TP are ground truths) from testing images. 88% images have one
counted under different GTS(quantity of visual keywords in or two visual keywords in each, namely, in most captions

27646 VOLUME 9, 2021


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

FIGURE 9. VSAM outputs with TP = 0 (The red ones represent the visual keywords generated by VSAM.)

there are not more than two visual keywords. Under GTS=1, V. CONCLUSION
there are 97.4% images whose FN equals to zero. When GTS Image caption is an interdisciplinary research of computer
increases to 2, 3, and 4, the quantity percentage of images vision and natural language processing. Before to translate
whose FN equals to zero comes down to 84.5%, 74.2% the visual content to a human-readable sentence, the com-
and 53.5% respectively. If GTS ≤2, there are more than puter has to detect the image features, recognize the objects
90% images whose FP equals to zero. When GTS increases and determine which object to describe. Attention mech-
to 3 and 4, the quantity percentage of images whose FP anism makes sense for an more accurate correspondence.
equals to zero comes down too, 81.7% and 77.5% respec- However, the existing attention models are built with the
tively. Under GTS=1, VSAM shows excellent performance decoder, and the focused content changes dynamically with
in missed detection testing. VSAM shows good performance the generated words. It results that in many cases the salient
in false detection testing when GTS≤2. Under GTS=1, there contents are not described in the caption, or the objects
are 94.6% images whose TP is not equal to zero. Under described are not the salient ones.
GTS=2, in 72.4% images VSAM correctly generates one To generate the accurate and valuable caption for an
visual keyword, and in 21% images, two visual keywords. image and align visual concepts and semantic concepts bet-
When GTS increases to 3 and 4, the quantity percentage of ter, we proposed the concept of visual keywords. A visual
images whose TP would better be 3 or 4 are only 1.8% and semantic attention model based on an optimized pointer
0% respectively. Therefore, VSAM could correctly generate network is proposed to extract visual keywords from an
at least one visual keyword in most cases, and it is hope to image. We also proposed two evaluation methods, consid-
improve the TP when GTS≥2. ering the importance and relevance of the visual keywords,
As shown in Fig.8, 100 samples are randomly selected to evaluate the precision and recall of attention models. The
from testing images. The positive numbers represent true pre- experiments show that both the precision and recall of visual
diction, and the negative numbers represent false prediction. keyword generation is above or close to 90% by the proposed
The numbers are not integers because of the wkey in VKE II. model VSAM.
And we can see FP and FN of most samples are both equal It is believed that the precision of VSAM could be
to zero, that is why the VSAM concludes high performance improved later when the object detector is becoming much
in precision and recall. Here we analyze what happens in more robust, as well as the categories of visual keywords in
those images whose TP equals to zero, as shown in Fig.9. vocabulary are rich enough. Expanding the visual keyword
We found that VSAM is failure to generate visual keyword vocabulary not only extracts more comprehensive visual key-
from extreme long shot, on which the object detector could words from an image but also expands the categories of
not work well. VSAM is built on object detection and try object detection. When objects overlap too much, the VSAM
to determine what to describe, while in extreme long shot, pays more attention to semantic relevance but not salient
human prefer to describe the location, the weather, the scene overlap objects. Pixel-based labeling of objects detection will
and the feeling etc. improve the ability of saliency judgment dragged down by
In summary, VSAM has much better performance of visual object area coverage.
keyword generation in precision than the performance of all In future, the VSAM is expected to contribute to a
compared five models, especially when there are no more novel image caption framework for generating more accu-
than two visual keywords in long/medium/close shots. It is rate and abundant sentences. And we are making effort
not so robust in too many objects of one image, or in extreme to generate One Sentence Photo News Reporting for an
long shot. emergency. It is hoped that the research will arouse the

VOLUME 9, 2021 27647


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

interest and attention in the technical application of image [22] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua,
caption. ‘‘SCA-CNN: Spatial and channel-wise attention in convolutional networks
for image captioning,’’ in Proc. CVPR, Jul. 2017, pp. 5659–5667, doi:
10.1109/CVPR.2017.667.
REFERENCES [23] H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, ‘‘Show, observe and tell:
Attribute-driven attention model for image captioning,’’ in Proc. IJCAI,
[1] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, Jul. 2018, pp. 606–612, doi: 10.24963/ijcai.2018/84.
J. Hockenmaier, and D. Forsyth, ‘‘Every picture tells a story: Gener- [24] T. Yao, Y. Pan, Y. Li, and T. Mei, ‘‘Hierarchy parsing for image caption-
ating sentences from images,’’ in Proc. ECCV, Crete, Greece, 2010, ing,’’ in Proc. ICCV, Oct. 2019, pp. 2621–2629.
pp. 15–29. [25] Y. Pan, T. Yao, Y. Li, and T. Mei, ‘‘X-linear attention networks for image
[2] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, captioning,’’ in Proc. CVPR, Jun. 2020, pp. 10971–10980.
and T. L. Berg, ‘‘Baby talk: Understanding and generating simple image [26] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for
descriptions,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, word representation,’’ in Proc. EMNLP, 2014, pp. 1532–1543.
pp. 2891–2903, Dec. 2013. [27] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and
[3] Y. Yang, C. L. Teo, H. Daumé, III, and Y. Aloimonos, ‘‘Corpus-guided D. McClosky, ‘‘The Stanford CoreNLP natural language processing
sentence generation of natural images,’’ in Proc. EMNLP, Edinburgh, toolkit,’’ in Proc. ACL, 2014, pp. 55–60.
Scotland, Jul. 2011, pp. 444–454. [28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
[4] P. Kuznetsova, V. Ordonez, T. L. Berg, and Y. Choi, ‘‘TreeTalk: Com- and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’
position and compression of trees for image descriptions,’’ Trans. Assoc. in Proc. ECCV, vol. 8693, 2014, pp. 740–755, doi: 10.1007/978-3-319-
Comput. Linguistics, vol. 2, pp. 351–362, Dec. 2014. 10602-1_48.
[5] M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han, [29] G. A. Miller, ‘‘WordNet: A lexical database for English,’’ Commun. ACM,
A. Mensch, A. Berg, T. Berg, and H. Daumé, III, ‘‘Midge: Generating vol. 38, no. 11, pp. 39–41, 1995. [Online]. Available: [Link]
image descriptions from computer vision detections,’’ in Proc. EACL, [Link]/, doi: 10.1145/219717.219748.
2012, pp. 747–756. [30] J. Yang, J. Lu, D. Batra, and D. Parikh. (Jul. 2017). A Faster PyTorch
[6] D. Elliott and F. Keller, ‘‘Image description using visual dependency Implementation of Faster R-CNN. [Online]. Available: [Link]
representations,’’ in Proc. EMNLP, Seattle, WA, USA, Oct. 2013, com/jwyang/[Link]
pp. 1292–1302. [31] D. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimiza-
[7] D. Elliott and A. de Vries, ‘‘Describing images using inferred visual depen- tion,’’ in Proc. ICLR, 2015, pp. 1–18. [Online]. Available: [Link]
dency representations,’’ in Proc. ACL, IJCNLP, Beijing, China, Jul. 2015, org/abs/1412.6980
pp. 42–52. [32] L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, ‘‘Normalized
[8] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, ‘‘Composing simple and geometry-aware self-attention network for image captioning,’’ in
image descriptions using Web-scale N-grams,’’ in Proc. CoNLL, Portland, Proc. CVPR, Jun. 2020, pp. 10327–10336, doi: 10.1109/CVPR42600.
OR, USA, Jun. 2011, pp. 220–228. 2020.01034.
[9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, [33] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, ‘‘Self-
X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig, ‘‘From captions critical sequence training for image captioning,’’ in Proc. CVPR, Jul. 2017,
to visual concepts and back,’’ in Proc. CVPR, Jun. 2015, pp. 1473–1482, pp. 7008–7024, doi: 10.1109/CVPR.2017.131.
doi: 10.1109/CVPR.2015.7298754. [34] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, ‘‘Attention on attention
[10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, ‘‘Show and tell: for image captioning,’’ in Proc. ICCV, Seoul, South Korea, Oct. 2019,
A neural image caption generator,’’ in Proc. CVPR, Jun. 2015, pp. 4634–4643.
pp. 3156–3164.
[11] J. Lu, J. Yang, D. Batra, and D. Parikh, ‘‘Neural baby talk,’’ in Proc. CVPR,
Jun. 2018, pp. 7219–7228.
[12] S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. Mooney, T. Darrell,
and K. Saenko, ‘‘Captioning images with diverse objects,’’ in Proc. CVPR,
Jul. 2017, pp. 5753–5761.
[13] L. Ke, W. Pei, R. Li, X. Shen, and Y.-W. Tai, ‘‘Reflective decoding network
for image captioning,’’ in Proc. ICCV, Oct. 2019, pp. 8888–8897.
[14] T. J. Buschman and E. K. Miller, ‘‘Top-down versus bottom-up control
of attention in the prefrontal and posterior parietal cortices,’’ Science,
vol. 315, no. 5820, pp. 1860–1862, Mar. 2007.
[15] M. Corbetta and G. L. Shulman, ‘‘Control of goal-directed and stimulus- SUYA ZHANG was born in Henan, China, in 1995.
driven attention in the brain,’’ Nature Rev. Neurosci., vol. 3, no. 3, She received the B.S. degree in broadcasting and
pp. 201–215, Mar. 2002. television engineering from the Communication
[16] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, University of China, Beijing, China, in 2018,
R. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image where she is currently pursuing the degree with
caption generation with visual attention,’’ in Proc. ICML, 2015, the School of Information and Communication
pp. 2048–2057.
Engineering.
[17] J. Lu, C. Xiong, D. Parikh, and R. Socher, ‘‘Knowing when to look: From 2017 to 2018, she had her researches in
Adaptive attention via a visual sentinel for image captioning,’’ in Proc.
the college student innovation training program
CVPR, Jul. 2017, pp. 375–383, doi: 10.1109/CVPR.2017.345.
Research and Design of Intelligent Data Journal-
[18] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen,
ism. Since 2018, she has been a member of the State Key Laboratory
‘‘Review networks for caption generation,’’ in Proc. NIPS, 2016,
pp. 2361–2369. of Media Convergence and Communication, Communication University of
[19] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and
China. She got an opportunity to be an Intern with the Department of AI-
L. Zhang, ‘‘Bottom-up and top-down attention for image captioning and Laboratory, Xinhua Zhiyun Ltd., Company, in 2020, and worked on the
visual question answering,’’ in Proc. CVPR, Jun. 2018, pp. 6077–6086, algorithm optimization of optical character recognition and harvesting raw
doi: 10.1109/CVPR.2018.00636. tables from infographics. Her research interests include computer vision and
[20] X. Li, S. Jiang, and J. Han, ‘‘Learning object context for dense cap- neural language processing.
tioning,’’ in Proc. AAAI, Jul. 2019, pp. 8650–8657, doi: 10.1609/aaai. Ms. Zhang received the Zhongshi Guangxin Motivational Scholarship,
v33i01.33018650. School Motivational Scholarship during undergraduate education, the third-
[21] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, ‘‘Image caption with class scholarship during postgraduate education, and the overall second prize
global-local attention,’’ in Proc. AAAI, San Francisco, CA, USA, 2017, in the Competition on Harvesting Raw Tables from Infographics in 2020,
pp. 4133–4139. 25th International Conference on Pattern Recognition.

27648 VOLUME 9, 2021


S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption

YANA ZHANG was born in Hangzhou, Zhejiang, ZHAOHUI LI was born in Henan, Pingding Shan,
China, in 1980. She received the Ph.D. degree in China, in 1969. She received the B.S. degree in
information and communication engineering from electrical engineering from Shandong University,
the Communication University of China, Beijing, Shandong, China, in 1987, and the M.S. degree
in 2013. in electrical engineering from the Beijing Insti-
She was a Visiting Scholar with the Department tute of Technology, Beijing, China, in 2002, and
of Electrical and Computer Science, University of the Ph.D. degree in electrical engineering from
California at San Diego, San Diego, CA, USA, the Communication University of China, Beijing,
in 2014. Since 2015, she has been an Associate in 2008.
Professor with the Department of Broadcasting Since 2008, she has been an Associate Professor
and Television Engineering, Communication University of China. She is the with the Department of Broadcasting and Television Engineering, Commu-
author of two books, more than 30 papers, and more than four inventions. She nication University of China. Her research interests include video coding
holds two patents. Her research interests include intelligent video processing optimization-based on visual perception, video forgery detection, and video
and media information security. She is a member of China DRM. She is a source identification.
Reviewer of the journal China Communications. Dr. Li received several awards and honors, including the second prize and
Dr. Zhang was a recipient of the second prize Science and Technology the third prize of Science and Technology Innovation of the State Admin-
Innovation Award of SARFT, in 2007, and the second and third rank of Film istration of Radio, Film and Television and the second prize of Excellent
and Television Science and Technology Best Paper Award, in 2017. She Achievement Project of Communication University of China.
received the title of Famous Teacher Award of Communication University
of China, in 2020.

ZEYU CHEN was born in Hubei, Wuhan, China,


in 1997. He received the B.E. degree from the
Communication University of China, Beijing,
China, in 2019, where he is currently pursuing
the M.S. degree with the School of Information
and Communication Engineering. He is a member
of the State Key Laboratory of Media Conver-
gence and Communication, Communication Uni-
versity of China. He is studying in camera motion
detection and intelligent classification of shot
scale. His research interests include intelligent video editing and computer
vision. He received the second-class scholarship during the postgraduate
education.

VOLUME 9, 2021 27649

You might also like