Visual Keyword Generation for Image Caption
Visual Keyword Generation for Image Caption
ABSTRACT Image caption is to understand and describe the visual content, which is expected to be
applied in automatic news reporting in future. In recent years, there has been an increasing interest in an
Encoder-Decoder framework for image caption: the encoder takes the responsibility for visual semantic
comprehension and the decoder is designed for sentence generation. In the Encoder-Decoder framework the
translation is based on the correspondence between image feature vectors and caption vectors. Attention
mechanism makes sense for a more accurate correspondence. However, the attention model works with
the decoder, and the focused content changes dynamically with the generated word. It results that in many
cases the salient contents are not described in the caption, or the objects described are not the salient ones.
To improve the precision of image caption, to bridge the gap between image understanding and sentence
generation in the Encoder-Decoder framework, and to align visual information and semantic information
better, we propose a concept of visual keyword as a gang board between seeing and saying. This paper
presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual
Keyword Dataset (IVKD). Also, a Visual Semantic Attention Model(VSAM) is proposed to obtain visual
keywords for generating the annotation. In VSAM, the object-level visual features are extracted by an object
detector after pre-training on IVKD. Then the object features are fed in an Optimized Pointer Network(OPN)
to generate visual keywords. The experiments show that the precision of visual keyword generation reaches
91.7% by the proposed model VSAM.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
27638 VOLUME 9, 2021
S. Zhang et al.: VSAM-Based Visual Keyword Generation for Image Caption
FIGURE 2. Image caption based on dynamic attention mechanism. The value on a bounding box is the probability of the object appeared in a
ground truth caption. Captions are generated by neural baby talk [11].
Li et al. [21] introduced an object-level attention model to semantic attention model is proposed to obtain the visual key-
address the problems of object missing and misprediction. words. It includes two parts: an object detector to simulate the
They obtained global pixel-level features and local object- process of that humans extract object features from an image,
level features using deep CNNs, then integrated local object which is essentially features-driven attention; an optimized
features with global image features through the soft atten- pointer network to model the process of that humans select
tion method. The global feature is a 4096-dimension vector the most concerned and representative ones among all visual
extracted from the fc7 layer of VGG16-net and the local words, which is essentially knowledge-driven attention.
object-level feature was 4096-dimension vector extracted
from Faster R-CNN. To generate natural and accurate cap- A. THE FRAMEWORK OF VSAM
tions that are generally better grounded in images, Lu et al. The framework of VSAM is shown in Fig.4. N objects are
[11] combined the bottom-up and up-down attention mecha- detected by an object detector (for example a pre-trained
nism with an adaptive object-level attention model. As shown Faster R-CNN). As shown in Fig.5, the object features include
in Fig.3.(c)), the model generates a template with slot loca- labels, feature maps and coordinates. In region ri (i ∈ (1, N )),
tions, then fills visual concepts into the slot. Visual concepts a location vector Vil ∈ R1×200 is obtained by projecting
are chosen from objects-level features by a pointer network. region coordinates [xmin , ymin , xmax , ymax ] to m-dimension.
The focused objects dynamically change when the sentence is An object visual feature vector ViP ∈ R1×2048 is obtained
generated and all words in the output sentence rely on visual by projecting the pooling of feature maps of RoI align layer
g
features. The pointer network in [11] is only applied as a in Faster R-FCNN. A word embedding vector Vi ∈ R1×300
decoder to decide which input is selected as an attentional is the glove vector [26] of the label. The location vector and
output, and the inputs of pointer network are the independent visual feature vector together represent the visual concepts
object-level image features encoded by an object detector. of an object, while the word embedding vector represents
There are also multi-level attention models. Chen et al. [22] the semantic concepts of an object. Three vectors are con-
proposed SCA-CNN that incorporated Spatial and Channel- catenated to an object-level visual semantic feature vector
g
Vi = ViP ; Vil ; Vi .
wise Attentions in a CNN to dynamically modulate the sen-
tence generation in multi-layer feature maps. Chen et al. [23] Visual keywords of an image are identified from the
considered co-occurrence dependencies among attributes caption by the keyword extractor. In a keyword extractor,
before generating the caption. The attention model provides Stanford lemmatization toolbox [27] is used to obtained the
two types of image features, the region-based features for prototype of words in a sentence. The base form of each
inference module and the attribute-based ones for genera- word in an annotation is matched with the base form of
tion module, respectively. Yao et al. [24] analyzed an image word in a visual keyword vocabulary(detailed in [Link].A).
in three levels (pixel-level, region-level and object-level) to The inference from detected objects to described objects are
delve into a thorough image understanding for captioning. learned by an optimized pointer network(OPN).
Pan et al. [25] proposed an X-Linear attention block to exploit
both spatial and channel-wise bilinear attention distributions. B. THE PRINCIPLE OF VISUAL KEYWORD GENERATION
This block captures the 2nd order interactions between the Given an image I, VSAM is to generate the visual keywords
input single-modal or multi-modal features. Experiments k = {k1 , · · · , kT } (T is the number of visual keywords),
showed that multi-modal inputs performed better than single- corresponding to a subset of salient objects. Following the
modal inputs in attention models. standard supervised learning paradigm, we find the parameter
The existing attention models in Encoder-Decoder frame- θ ∗ of VSAM by maximizing the likelihood of the correct
work decode every word from image features, even includ- visual keywords:
ing prepositions, auxiliary words, and conjunctions. What’s X
θ ∗ = arg max log p(k I; θ)
more, the attention weights dynamically change at each step (1)
θ
(I ,k)
of decoding. It makes mistakes in judging which objects is
salient. In this paper, we proposed a novel visual seman- Using the chain rule, the joint probability distribution can
tic attention model to determine what to describe before be decomposed over a sequence of tokens:
generating sentences. The VSAM extracts visual keywords YT
p(k/I ; θ) = p(kt /k1:t−1 , I ; θ) (2)
from image based on both feature-driven visual attention and t=1
knowledge-driven semantic attention, considering the com- The visual keywords are chosen from the object-level
plex relations (semantic relations, positional relations and visual semantic feature vector Vi by the OPN. Thus, the
attention order) between objects. probability is decomposed to:
p(kt /k1:t−1 , I ; θ) = p(kt /k1:t−1 , Vi ; θ 3 )p(Vi /I ; θ 1 , θ 2 ) (3)
III. VISUAL SEMANTIC ATTENTION MODEL
In this paper, the words extracted from visual concepts are where θ 1 is the parameter of Faster R-CNN, θ 2 is the parame-
defined as visual words, among which the ones appeared ter of the embedding layers which embed the detected results
in the caption and corresponding to the focused/concerned of Faster R-CNN to visual semantic feature vectors Vi and θ 3
objects in an image are called visual keywords. A visual is the parameter of the OPN.
There are relevance among objects in descriptions, for where W1 ∈ Rd×d , W2 ∈ Rd×d , ω ∈ Rd×1 are parameters to
example when a ‘‘traffic light’’ is salient in the image, ‘‘car’’ be learned. Softmax normalizes the vector ut to be an condi-
is probably chosen as a keyword in a description. In our tional distribution with dictionary size equal to the length of
dataset IVKD, the information of attention order is presented the input. The visual keyword is pointed out corresponding to
as the keywords order. The first keyword is more attentional the highest probability.
than the second. The OPN is composed by an encoder and a
decoder: it encodes the complex relations (semantic relations, C. AN OPTIMIZED POINTER NETWORK (OPN)
positional relations and attention order) between objects, and In original PN, the input sequences in a batch need to be
makes an input sequence to a hidden vector; the decoder padded with specific characters to the same dimension, so that
computes the conditional probability of an output sequence. the matrix operation works based on Mini-Batch Gradient
It makes a pointer to select one of the inputs as an output Descent (MBGD). The padded characters participate the cal-
based on attention mechanism. culation of the network before masked. This causes the devia-
As shown in Fig.6, the hidden states vectors of encoder tion in calculation results of real elements and makes network
and decoder are defined as ei and dt respectively. ei and dt be insensitive to the truth length of sequences. We inferred
the importance and the relevance of visual keywords. For TABLE 1. Precision and recall of different models.
example, if the ground truth sets are [donut, table, knife],
[donut, table], [donut, baby], donut is the highest frequency
words, therefore donut is the most important keyword. Words
‘donut’ and ‘table’ are most likely to appear in one sentence,
therefore they are closely related.
PN P M
TP =
TPi,j
i=1 j=1
N P M
P
FP = FPi,j (13)
i=1 j=1
are exactly what the ground truth captions have. This ben-
PN PM
FN = FNi,j efits from that the attention model in VSAM is indepen-
i=1 j=1 dent of the sentence generation. The object detection and
where N is the number of testing images, M is the number of saliency judgment are not disturbed by the generation of
ground truth sets corresponding to an image. template words. What’s more, the visual keywords gener-
Method VKE II uses a union of M visual keywords sets as ated by VSAM are more relevant in visual concepts and
a ground truth set corresponding to an image. Then TP, FP semantic concepts, such as ‘bed’ and ‘teddy bear’ in (b).
and NP are calculated as formulation (14). This benefits from the feature-driven and knowledge-driven
N
hard attention mechanism implemented by OPN. The OPN
learns human ability to select interest objects to describe.
P P
TP = vali,key × wi,key
i=1 key∈TPS This independent knowledge-driven attention is not inclined
N
to describe all visual salient objects as what other models tend
P P
FP = vali,key (14)
i=1 key∈FPS to do, but considers the semantic relevance between salient
objects.
PN P
FN = vali,key × wi,key
In Table. 2, NBT described many unfocused objects,
i=1 key∈FNS
which are not existed in ground truth annotations, such
where TPS, FPS, FNS represents a set of true positive visual as ‘chair’ in (a) and (b). AOA falsely detected objects,
keywords, a set of false positive keywords, and a set of false such as ‘dog’ instead of ’horse’ in (d), and output it as
negative keywords respectively. If a keyword belongs to the a visual keyword. Object detection inaccuracy is due to
corresponding set, val is set to 1; otherwise, it is 0. wkey that the encoder-decoder framework adopted by NBT and
represents the frequency of a keyword. This method removes AOA generated sentence directly from images feature vec-
the relevance of visual keywords in one ground truth set. tor. The inaccuracy of judging the salience of objects is
For example, hwe could i get a union set [donut, table, knife] due to that AOA and NBT align all semantic concepts
with wkey ∈ 3 , 3 , 3 from three ground truth sets [donut,
3 2 1 (including visual words and template words) with visual
table, knife], [donut, table] and [donut]. If the predicted visual concepts, which exposed the shortcomings of their attention
keyword set is [donut, table, baby], the TPS is [donut, table], models.
the FPS is [baby], and the FNS is [knife]. For VSAM, limited by the visual keyword vocabulary,
the incomplete extraction of keywords from captions leads to
D. EXPERIMENT RESULTS AND ANALYSIS the imperfect of ground truth visual keywords. Another, the
We compare the performance of VSAM with that of five regions in different rectangles overlap too much. It causes
different models through two evaluation methods (detailed in the confusion of objects saliency judgment. For example,
[Link].C). They are pre-trained models obtained from Github the ‘dog’ in (e) is described in all human h captions(ground
i
and tested on the same dataset MSCOCO. truth is [dog, motorcycle] with weight 55 , 55 ), but VSAM
As shown in Table. 1, when using evaluation method doesn’t generate the visual keyword ‘dog’ but ‘person’. The
VKE I, the precision of VSAM is 0.112∼0.126 better than reason is that in OPN the same feature vector in input
that of the others, while the recall of VSAM is close to sequence would be only selected once in output sequence.
the others’. When evaluated through VKE II, the precision When the ‘motorcycle’ is selected, the ‘dog’ which over-
of VSAM reaches to 91.7%, which outperforms other five laps too much with ‘motorcycle’ would not be selected.
models by 0.054 ∼ 0.069. The recall of VSAM falls a little. Also, the OPN is tend to output semantically relevant visual
This is because the VSAM was trained with one set of visual keywords. In most cases, ‘person’ is more relevant with
keywords. The experiments show that VSAM can generated ‘motorcycle’ than ’dog’. Another, when there are too many
visual keywords more precisely. visual keywords in one image, such as (c), it is difficult
Some samples of three different models’ outputs are shown for VSAM to point out every one of them, because gener-
in Table. 2. We can see that, VSAM extracted more pre- ally humans do not describe more than four objects in one
cise visual keywords, such as girl and table in (a), which sentence.
TABLE 2. Visual keywords generated by different models (The gray ones are visual words, the others are visual keywords.).
TABLE 3. statistic of FN, FP and TP of vsam (the green ones are good results, the greener the better, while the red ones are bad results, the redder the
worse.)
In Table. 3, the image quantity of FN, FP and TP are ground truths) from testing images. 88% images have one
counted under different GTS(quantity of visual keywords in or two visual keywords in each, namely, in most captions
FIGURE 9. VSAM outputs with TP = 0 (The red ones represent the visual keywords generated by VSAM.)
there are not more than two visual keywords. Under GTS=1, V. CONCLUSION
there are 97.4% images whose FN equals to zero. When GTS Image caption is an interdisciplinary research of computer
increases to 2, 3, and 4, the quantity percentage of images vision and natural language processing. Before to translate
whose FN equals to zero comes down to 84.5%, 74.2% the visual content to a human-readable sentence, the com-
and 53.5% respectively. If GTS ≤2, there are more than puter has to detect the image features, recognize the objects
90% images whose FP equals to zero. When GTS increases and determine which object to describe. Attention mech-
to 3 and 4, the quantity percentage of images whose FP anism makes sense for an more accurate correspondence.
equals to zero comes down too, 81.7% and 77.5% respec- However, the existing attention models are built with the
tively. Under GTS=1, VSAM shows excellent performance decoder, and the focused content changes dynamically with
in missed detection testing. VSAM shows good performance the generated words. It results that in many cases the salient
in false detection testing when GTS≤2. Under GTS=1, there contents are not described in the caption, or the objects
are 94.6% images whose TP is not equal to zero. Under described are not the salient ones.
GTS=2, in 72.4% images VSAM correctly generates one To generate the accurate and valuable caption for an
visual keyword, and in 21% images, two visual keywords. image and align visual concepts and semantic concepts bet-
When GTS increases to 3 and 4, the quantity percentage of ter, we proposed the concept of visual keywords. A visual
images whose TP would better be 3 or 4 are only 1.8% and semantic attention model based on an optimized pointer
0% respectively. Therefore, VSAM could correctly generate network is proposed to extract visual keywords from an
at least one visual keyword in most cases, and it is hope to image. We also proposed two evaluation methods, consid-
improve the TP when GTS≥2. ering the importance and relevance of the visual keywords,
As shown in Fig.8, 100 samples are randomly selected to evaluate the precision and recall of attention models. The
from testing images. The positive numbers represent true pre- experiments show that both the precision and recall of visual
diction, and the negative numbers represent false prediction. keyword generation is above or close to 90% by the proposed
The numbers are not integers because of the wkey in VKE II. model VSAM.
And we can see FP and FN of most samples are both equal It is believed that the precision of VSAM could be
to zero, that is why the VSAM concludes high performance improved later when the object detector is becoming much
in precision and recall. Here we analyze what happens in more robust, as well as the categories of visual keywords in
those images whose TP equals to zero, as shown in Fig.9. vocabulary are rich enough. Expanding the visual keyword
We found that VSAM is failure to generate visual keyword vocabulary not only extracts more comprehensive visual key-
from extreme long shot, on which the object detector could words from an image but also expands the categories of
not work well. VSAM is built on object detection and try object detection. When objects overlap too much, the VSAM
to determine what to describe, while in extreme long shot, pays more attention to semantic relevance but not salient
human prefer to describe the location, the weather, the scene overlap objects. Pixel-based labeling of objects detection will
and the feeling etc. improve the ability of saliency judgment dragged down by
In summary, VSAM has much better performance of visual object area coverage.
keyword generation in precision than the performance of all In future, the VSAM is expected to contribute to a
compared five models, especially when there are no more novel image caption framework for generating more accu-
than two visual keywords in long/medium/close shots. It is rate and abundant sentences. And we are making effort
not so robust in too many objects of one image, or in extreme to generate One Sentence Photo News Reporting for an
long shot. emergency. It is hoped that the research will arouse the
interest and attention in the technical application of image [22] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua,
caption. ‘‘SCA-CNN: Spatial and channel-wise attention in convolutional networks
for image captioning,’’ in Proc. CVPR, Jul. 2017, pp. 5659–5667, doi:
10.1109/CVPR.2017.667.
REFERENCES [23] H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, ‘‘Show, observe and tell:
Attribute-driven attention model for image captioning,’’ in Proc. IJCAI,
[1] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, Jul. 2018, pp. 606–612, doi: 10.24963/ijcai.2018/84.
J. Hockenmaier, and D. Forsyth, ‘‘Every picture tells a story: Gener- [24] T. Yao, Y. Pan, Y. Li, and T. Mei, ‘‘Hierarchy parsing for image caption-
ating sentences from images,’’ in Proc. ECCV, Crete, Greece, 2010, ing,’’ in Proc. ICCV, Oct. 2019, pp. 2621–2629.
pp. 15–29. [25] Y. Pan, T. Yao, Y. Li, and T. Mei, ‘‘X-linear attention networks for image
[2] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, captioning,’’ in Proc. CVPR, Jun. 2020, pp. 10971–10980.
and T. L. Berg, ‘‘Baby talk: Understanding and generating simple image [26] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for
descriptions,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, word representation,’’ in Proc. EMNLP, 2014, pp. 1532–1543.
pp. 2891–2903, Dec. 2013. [27] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and
[3] Y. Yang, C. L. Teo, H. Daumé, III, and Y. Aloimonos, ‘‘Corpus-guided D. McClosky, ‘‘The Stanford CoreNLP natural language processing
sentence generation of natural images,’’ in Proc. EMNLP, Edinburgh, toolkit,’’ in Proc. ACL, 2014, pp. 55–60.
Scotland, Jul. 2011, pp. 444–454. [28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
[4] P. Kuznetsova, V. Ordonez, T. L. Berg, and Y. Choi, ‘‘TreeTalk: Com- and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’
position and compression of trees for image descriptions,’’ Trans. Assoc. in Proc. ECCV, vol. 8693, 2014, pp. 740–755, doi: 10.1007/978-3-319-
Comput. Linguistics, vol. 2, pp. 351–362, Dec. 2014. 10602-1_48.
[5] M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han, [29] G. A. Miller, ‘‘WordNet: A lexical database for English,’’ Commun. ACM,
A. Mensch, A. Berg, T. Berg, and H. Daumé, III, ‘‘Midge: Generating vol. 38, no. 11, pp. 39–41, 1995. [Online]. Available: [Link]
image descriptions from computer vision detections,’’ in Proc. EACL, [Link]/, doi: 10.1145/219717.219748.
2012, pp. 747–756. [30] J. Yang, J. Lu, D. Batra, and D. Parikh. (Jul. 2017). A Faster PyTorch
[6] D. Elliott and F. Keller, ‘‘Image description using visual dependency Implementation of Faster R-CNN. [Online]. Available: [Link]
representations,’’ in Proc. EMNLP, Seattle, WA, USA, Oct. 2013, com/jwyang/[Link]
pp. 1292–1302. [31] D. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimiza-
[7] D. Elliott and A. de Vries, ‘‘Describing images using inferred visual depen- tion,’’ in Proc. ICLR, 2015, pp. 1–18. [Online]. Available: [Link]
dency representations,’’ in Proc. ACL, IJCNLP, Beijing, China, Jul. 2015, org/abs/1412.6980
pp. 42–52. [32] L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, ‘‘Normalized
[8] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, ‘‘Composing simple and geometry-aware self-attention network for image captioning,’’ in
image descriptions using Web-scale N-grams,’’ in Proc. CoNLL, Portland, Proc. CVPR, Jun. 2020, pp. 10327–10336, doi: 10.1109/CVPR42600.
OR, USA, Jun. 2011, pp. 220–228. 2020.01034.
[9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, [33] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, ‘‘Self-
X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig, ‘‘From captions critical sequence training for image captioning,’’ in Proc. CVPR, Jul. 2017,
to visual concepts and back,’’ in Proc. CVPR, Jun. 2015, pp. 1473–1482, pp. 7008–7024, doi: 10.1109/CVPR.2017.131.
doi: 10.1109/CVPR.2015.7298754. [34] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, ‘‘Attention on attention
[10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, ‘‘Show and tell: for image captioning,’’ in Proc. ICCV, Seoul, South Korea, Oct. 2019,
A neural image caption generator,’’ in Proc. CVPR, Jun. 2015, pp. 4634–4643.
pp. 3156–3164.
[11] J. Lu, J. Yang, D. Batra, and D. Parikh, ‘‘Neural baby talk,’’ in Proc. CVPR,
Jun. 2018, pp. 7219–7228.
[12] S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. Mooney, T. Darrell,
and K. Saenko, ‘‘Captioning images with diverse objects,’’ in Proc. CVPR,
Jul. 2017, pp. 5753–5761.
[13] L. Ke, W. Pei, R. Li, X. Shen, and Y.-W. Tai, ‘‘Reflective decoding network
for image captioning,’’ in Proc. ICCV, Oct. 2019, pp. 8888–8897.
[14] T. J. Buschman and E. K. Miller, ‘‘Top-down versus bottom-up control
of attention in the prefrontal and posterior parietal cortices,’’ Science,
vol. 315, no. 5820, pp. 1860–1862, Mar. 2007.
[15] M. Corbetta and G. L. Shulman, ‘‘Control of goal-directed and stimulus- SUYA ZHANG was born in Henan, China, in 1995.
driven attention in the brain,’’ Nature Rev. Neurosci., vol. 3, no. 3, She received the B.S. degree in broadcasting and
pp. 201–215, Mar. 2002. television engineering from the Communication
[16] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, University of China, Beijing, China, in 2018,
R. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image where she is currently pursuing the degree with
caption generation with visual attention,’’ in Proc. ICML, 2015, the School of Information and Communication
pp. 2048–2057.
Engineering.
[17] J. Lu, C. Xiong, D. Parikh, and R. Socher, ‘‘Knowing when to look: From 2017 to 2018, she had her researches in
Adaptive attention via a visual sentinel for image captioning,’’ in Proc.
the college student innovation training program
CVPR, Jul. 2017, pp. 375–383, doi: 10.1109/CVPR.2017.345.
Research and Design of Intelligent Data Journal-
[18] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen,
ism. Since 2018, she has been a member of the State Key Laboratory
‘‘Review networks for caption generation,’’ in Proc. NIPS, 2016,
pp. 2361–2369. of Media Convergence and Communication, Communication University of
[19] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and
China. She got an opportunity to be an Intern with the Department of AI-
L. Zhang, ‘‘Bottom-up and top-down attention for image captioning and Laboratory, Xinhua Zhiyun Ltd., Company, in 2020, and worked on the
visual question answering,’’ in Proc. CVPR, Jun. 2018, pp. 6077–6086, algorithm optimization of optical character recognition and harvesting raw
doi: 10.1109/CVPR.2018.00636. tables from infographics. Her research interests include computer vision and
[20] X. Li, S. Jiang, and J. Han, ‘‘Learning object context for dense cap- neural language processing.
tioning,’’ in Proc. AAAI, Jul. 2019, pp. 8650–8657, doi: 10.1609/aaai. Ms. Zhang received the Zhongshi Guangxin Motivational Scholarship,
v33i01.33018650. School Motivational Scholarship during undergraduate education, the third-
[21] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, ‘‘Image caption with class scholarship during postgraduate education, and the overall second prize
global-local attention,’’ in Proc. AAAI, San Francisco, CA, USA, 2017, in the Competition on Harvesting Raw Tables from Infographics in 2020,
pp. 4133–4139. 25th International Conference on Pattern Recognition.
YANA ZHANG was born in Hangzhou, Zhejiang, ZHAOHUI LI was born in Henan, Pingding Shan,
China, in 1980. She received the Ph.D. degree in China, in 1969. She received the B.S. degree in
information and communication engineering from electrical engineering from Shandong University,
the Communication University of China, Beijing, Shandong, China, in 1987, and the M.S. degree
in 2013. in electrical engineering from the Beijing Insti-
She was a Visiting Scholar with the Department tute of Technology, Beijing, China, in 2002, and
of Electrical and Computer Science, University of the Ph.D. degree in electrical engineering from
California at San Diego, San Diego, CA, USA, the Communication University of China, Beijing,
in 2014. Since 2015, she has been an Associate in 2008.
Professor with the Department of Broadcasting Since 2008, she has been an Associate Professor
and Television Engineering, Communication University of China. She is the with the Department of Broadcasting and Television Engineering, Commu-
author of two books, more than 30 papers, and more than four inventions. She nication University of China. Her research interests include video coding
holds two patents. Her research interests include intelligent video processing optimization-based on visual perception, video forgery detection, and video
and media information security. She is a member of China DRM. She is a source identification.
Reviewer of the journal China Communications. Dr. Li received several awards and honors, including the second prize and
Dr. Zhang was a recipient of the second prize Science and Technology the third prize of Science and Technology Innovation of the State Admin-
Innovation Award of SARFT, in 2007, and the second and third rank of Film istration of Radio, Film and Television and the second prize of Excellent
and Television Science and Technology Best Paper Award, in 2017. She Achievement Project of Communication University of China.
received the title of Famous Teacher Award of Communication University
of China, in 2020.