0% found this document useful (0 votes)
5 views6 pages

Guiding Abstractive Text Summarization

Uploaded by

nivas kolla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

Guiding Abstractive Text Summarization

Uploaded by

nivas kolla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Guiding Generation for Abstractive Text Summarization

based on Key Information Guide Network

Chenliang Li and Weiran Xu∗ and Si Li and Sheng Gao


Beijing University of Posts and Telecommunications, Beijing
chenliangli,xuweiran,lisi,gaosheng@[Link]

Abstract text as input and then output the summary, which


Neural network models, based on the at-
certainly leads to a lack of key information.
tentional encoder-decoder model, have good Zhou et al. (2017) propose a selective gate net-
capability in abstractive text summarization. work to retain more key information in the sum-
However, these models are hard to be con- mary. However, the selective gate network, which
trolled in the process of generation, which is controlled by the representation of the input
leads to a lack of key information. We propose text, controls the information flow from encoder
a guiding generation model that combines the to decoder for just once. If some key informa-
extractive method and the abstractive method.
tion does not pass the network, it is hard for them
Firstly, we obtain keywords from the text by a
extractive model. Then, we introduce a Key to appear in the summary. See et al. (2017) pro-
Information Guide Network (KIGN), which pose a pointer-generator model, which uses the
encodes the keywords to the key information pointer mechanism (Vinyals et al., 2015) to copy
representation, to guide the process of gener- words from the input text, to deal with the out-of-
ation. In addition, we use a prediction-guide vocabulary (OOV) words. Without external guid-
mechanism, which can obtain the long-term ance, it is hard for the pointer to identify key-
value for future decoding, to further guide the
words. To address these problems, we combine
summary generation. We evaluate our model
on the CNN/Daily Mail dataset. The exper- the extractive model and the abstractive model and
imental results show that our model leads to use the former one to obtain keywords as guidance
significant improvements. for the latter one.
In this paper we propose a guiding generation
1 Introduction model for abstractive text summarization. Firstly,
Text summarization aims to generate a brief sum- we use a extractive method to obtain the keywords
mary from an input document while retaining the from the text. Then, we introduce a Key Infor-
key information. There are two broad approaches mation Guide Network (KIGN), which encodes
to summarization: extractive and abstractive. Ex- the keywords to the key information representa-
tractive models (Mihalcea and Tarau, 2004; Ya- tion and integrates it into the abstractive model, to
sunaga et al., 2017) usually extract a few sentences guide the process of generation. The guidance is
or keywords from the source text, while abstrac- mainly in two aspects: the attention mechanism
tive models (Rush et al., 2015; Nallapati et al., (Bahdanau et al., 2014) and the pointer mecha-
2016) generate new words and phrases that not in nism. In addition, we propose a novel prediction-
the source text to construct the summary. guide mechanism based on He et al. (2017), which
Recently, inspired by the success of encoder- predicts the extent of key information covered in
decoder model (Sutskever et al., 2014), abstractive the final summary, to further guide the summary
summarization models (Nallapati et al., 2016; See generation. Experiments show that our model
et al., 2017) are able to generate the summaries achieves significant improvements.
with high ROUGE scores. While these models
proved to be capable of capturing the regularities 2 Related work
of the text summarization, they are hard to be con- Neural encoder-decoder models. Abstractive
trolled in the process of generation. Without ex- models(Rush et al., 2015; Chopra et al., 2016)
ternal guidance, these models just get the source have been widely used in text summarization. Nal-

Corresponding Author: Weiran Xu lapati et al. (2016) use a pointer network (Vinyals

55
Proceedings of NAACL-HLT 2018, pages 55–60
New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics
struck(0.6)
Key Information ... hit(0.3)
pointer
Guide Network softmax shake(0.05)
...
keywods: earthquake struck wednesday
iran
attention FW

VW

...

Text: An earthquake magnitude struck sourthern iran on wednesday ... <START> An earthquake ...
Encoder Decoder

Figure 1: Our key information guide model. It consists of key information guide network, encoder and decoder. In
the key information guide network, we encode the keywords to the key information representation k.

et al., 2015) to deal with the unknown word prob- where v, Wh , Ws are learnable parameters, hi is
lem. the hidden state of the input token xi .
Keywords extraction. TextRank algorithm The context vector ct , which represents what
(Mihalcea and Tarau, 2004), which extracts key- has been read from the source text, is concatenated
words from the source text, is unsupervised. with the decoder hidden state st to predict the next
Prediction-guide mechanism. Inspired by the word with a softmax layer over the whole vocabu-
success of AlphaGO, He et al. (2017) propose a lary:
prediction network to predict the long-term value
of the final summary. Our prediction-guide mech- P (yt |y1 , ..., yt−1 ) = sof tmax(f (st , ct )) (4)
anism is use to guarantee the more key information
covered in the final summary. where f represents a linear function.

3 Our Model 3.2 Key information guide network


Most encoder-decoder models (Zhou et al., 2017;
In this section, we describe (1) our baseline
See et al., 2017) just get the source text as input
encoder-decoder model, (2) our key information
and then output the summary, which is hard to be
guide network, and (3) our prediction-guide mech-
controlled in the process of generation and leads
anism.
to a lack of key information in the summary. We
3.1 Encoder-decoder model based attention propose a key information guide network to guide
the process of generation from two aspects: the
Our baseline model is similar to that of Nallap- attention mechanism and the pointer mechanism.
ati et al. (2016). The tokens of the input arti- In detail, we extract keywords from the text by
cle x = {x1 , x2 , ..., xN } are fed into the encoder, using TextRank algorithm. As shown in Figure 1,
which maps the text into a sequence of encoder the keywords are fed one-by-one into the key in-
hidden states {h1 , h2 , ..., hn }. At each decoding formation guide network, and then we concatenate
time step t, the decoder reads the previous word the last forward hidden state ~hn and backward hid-
embedding wt−1 and the previous context vector den state h~1 as the key information representation
ct−1 as inputs to obtain the decoder hidden state k:
st . The context vector ct is calculated by using the " #
h~
attention mechanism: k = ~1 (5)
hn
eti = v T tanh(Wh hi + Ws st ) (1) Attention mechanism: Traditional attention
αte = sof tmax(et ) (2) mechanism is hard to identify keywords, which
N
X just uses the decoder state as a query to get the at-
e
ct = αti hi (3) tention distribution of the encoder hidden states.
i=1 We use the key information representation k as

56
extra input to the attention mechanism, changing define yt∗ as the target word for the decoding time
equation (1) to: step t and the overall loss is:
T
eti = v T tanh(Wh hi + Ws st + Wk k) (6) 1X
L=− logP (yt∗ |y1∗ , ..., yt−1

, x) (10)
T
where Wk is a learnable parameter. We use the t=0

new eti to obtain new attention distribution αte 3.3 Prediction-guide mechanism at test time
(Equation 2) and new context vector ct (Equation
At test time, when predicting the next word, we
3).
consider not only the above probability (Equa-
Our key information representation k makes the
tion 9), but also a long-term value predicted by
attention mechanism more focus on the keywords.
the prediction-guide mechanism. The prediction-
That is seem like to introduce prior knowledge to
guide mechanism is based on He et al. (2017).
the model.
Our prediction-guide mechanism, which is a
Then, we apply the key information representa-
single-layer feed forward network with sigmoid
tion k and use the new context vector ct to calcu-
activation function, predicts the extent of the key
late a probability distribution over all words in the
information covered in the final summary. At each
vocabulary, changing equation (4) to:
decoding time step t, we take mean P pooling over
Pv (yt |y1 , ..., yt−1 ) = sof tmax(f (st , ct , k)) (7) = 1t tl=1 sl , the en-
the decoder hidden states s̄t P
code hidden states h̄n = n1 ni=1 hi and the key
where v represents that yt is from the target vocab- information representation k as inputs to calculate
ulary. the long-term value.
Pointer mechanism: Due to the limitation of We sample two partial summaries yp1 and yp2
the vocabulary size, some keywords may not be for each x with random stop to get s̄t . Then, we
in the target vocabulary, which will certainly lead finish the generation from yp to obtain M aver-
to a lack of them in the final summary. Therefore age decoder hidden states s̄ of the completed sum-
we take the key information representation k, the maries S(yp ) (using beam search), and compute
context vector ct and the decoder hidden state st the average score:
as inputs to calculate a soft switch psw , which is 1 X
used to choose between generating a word from AvgCos(x, yp ) = cos(s̄, k) (11)
M
the target vocabulary or copying a word from the s̄∈S(yp )
input text:
where cos is the function of cosine similarity.
psw = σ(wkT k + wcT ct + wsTt st + bsw ) (8) We hope the predicted value of v(x, yp1 ) can
be larger than v(x, yp2 ) if AvgCos(x, yp1 ) >
where wkT , wcT , wsT and bsw are parameters, σ is AvgCos(x, yp2 ). Therefore, the loss function of
the sigmoid function. the prediction-guide network is as follows:
Our pointer mechanism, which is equipped with X
the key information representation, has the ability Lpg = ev(x,yp2 )−v(x,yp1 ) (12)
(x,yp1 ,yp2 )
to identify the keywords. We use the new atten-
tion distribution αti
e as the probability of the input
where AvgCos(x, yp1 ) > AvgCos(x, yp2 ).
token wi and obtain the following probability dis- At test time, we first compute the normalized
tribution to predict the next word: log probability of each candidate, and then lin-
early combine it with the value predicted by the
P (yt = w) = psw Pv (yt = w)
X prediction-guide network. In detail, given an ab-
e (9)
+ (1 − psw ) αti stractive model P (y|x) (Equation 9), a prediction-
i:wi =w guide network v(x, y) and a hyperparameter α ∈
(0, 1), the score of partial sequence y for x is com-
Note that if w is an out-of-vocabulary word,
puted by:
Pv (yt = w) is zero.
During training, we minimize a maximum- α × logP (y|x) + (1 − α) × log v(x, y) (13)
likelihood loss at each decoding time step, which
is most widely used in sequence generation. We where α ∈ (0, 1), is a hyperparameter.

57
Model ROUGE-1 ROUGE-2 ROUGE-L
Enc-dec+attn baseline (50k vocab) 31.33 11.81 28.83
Abstractive model (Nallapati et al., 2016) 35.46 13.30 32.65
Baseline+pointer 36.44 15.66 33.42
KIGN 37.76 16.56 34.49
Prediction-guide 37.24 16.27 34.14
KIGN+Prediction-guide 38.95 17.12 35.68

Table 1: ROUGE F1 scores for models on the CNN/Daily Mail test set. All our ROUGE scores have a 95%
confidence interval of at most ±0.25 as reported by the official ROUGE script.

4 Experiments 濄濄濃
濥濢濨濚濘激濄
濥濢濨濚濘激濅
濄濃濃
濥濢濨濚濘激濟
4.1 Experiment setting 濌濃

ROUGE F Score
濋濃
We use the CNN/Daily Mail dataset(Nallapati
et al., 2016; Hermann et al., 2015) and use scripts 濊濃

supplied by Nallapati et al. (2016) to obtain the 濉濃

same version of the data, which has 28,7226 train- 濈濃

ing pairs, 13,368 validation pairs and 11,490 test 濇濃

pairs. We use two 256-dimensional LSTMs for 濆濃


濃濁濈 濃濁濈濈 濃濁濉 濃濁濉濈 濃濁濊 濃濁濊濈 濃濁濋 濃濁濋濈 濃濁濌 濃濁濌濈
the bidirectional encoder and one 256-dimensional
LSTM for the decoder. In our key information Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L F1
guide network, the approach of encoding key- scores of KIGN+Prediction-guide model w.r.t different
words is same to the encoder. In addition, we use a hyperparameter α.
vocabulary of 50k words for both source and target
and do not pre-train the word embeddings - they
are learned from scratch during training. During we trained the single-layer feed forward network
training and testing, we truncate the text to 400 to- based on the KIGN model. Finally, at test time,
kens and limit the length of the summary to 100 to- we combine the KIGN model and the prediction-
kens. We train using Adagrad (Duchi et al., 2011) guide mechanism to generate the summary.
with learning rate 0.15 and an initial accumulator
value of 0.1. The batch size is set as 16. Following 4.2 Results and discussions
the previous work, our evaluation metric is F-score We compare our model with the baseline model
of ROUGE (Lin and Hovy, 2003). (enc-dec+attn), hierarchical networks (Nallapati
In addition, for the prediction-guide mecha- et al., 2016) and the baseline model equipped with
nism, we set the single-layer feed forward network pointer-mechanism since we use the pointer mech-
with 800 nodes. For the hyperparameter α, we anism in our model.
test the performances of KIGN+Prediction-guide Table 1 shows that our key information
model using different α during decoding. As can guide network scores exceed the baseline model
be seen from the figure 2, the performance is sta- equipped with the pointer-mechanism by (+1.3
ble for the α ranging from 0.8 to 0.95. When α ROUGE-1, +0.9 ROUGE-2, +1.0 ROUGE-L). In
is set as 0.9, we can obtain the highest F-score addition, we just add the prediction-guide mech-
of ROUGE. Besides, we set the M as 8 and adapt anism on the baseline model equipped with the
mini-batch training with batch size to be 16. The pointer-mechanism to understand the contribution
network is trained with AdaDelta (Zeiler, 2012). of each part. The scores of that exceed the baseline
During training and at test time we truncate the model equipped with the pointer-mechanism by
input tokens to 400 and limit the length of the (+0.8 ROUGE-1, +0.6 ROUGE-2, +0.7 ROUGE-
output summary to 100 tokens for training and L). Finally, combining the key information guide
120 tokens at test time, which is similar to See network and the prediction-guide mechanism, we
et al. (2017). We trained our keywords network achieve a better performance. Our best model
model less than 200, 000 training iterations. Then scores exceed the baseline model with pointer-

58
Text(truncated): google claims to have cracked a problem obtain keywords from the input text. Then, we in-
that has flummoxed anyone who has tried to read a doctor’s
note - how to read anyone’s handwriting. the firm claims
troduce a key information guide network, which
the latest update to its android handsets can under 82 lan- encodes the keywords to the key information rep-
guages in 20 distinct scripts, and works with both printed resentation, to guide the process of generation. In
and cursive writing input with or without a stylus. it even
allows users to simply draw emoji they want to send. scroll addition, we propose a prediction-guide mecha-
down for video. the california search giant claims the latest nism to further guide the generation at test time.
update to its android handsets can understand handwriting in Experiments show that our model leads to signifi-
82 languages in 20 distinct scripts. google says its handwrit-
ing recognition works by building on large-scale language cant improvements.
modeling, robust multi-language ocr.
Acknowledge
Gold: google handwriting input works on android phones
and tablets. handsets can under 82 languages in 20 distinct We thank the anonymous reviewers for useful
scripts. works with both printed and cursive writing input
with or without a stylus. comments. This work was supported by Beijing
Natural Science Foundation (4174098), National
Baseline+pointer-mechanism: google claims to have Natural Science Foundation of China (61702047),
cracked a problem that has flummoxed anyone who has tried
to read a doctor ’s note how to read anyone ’s handwriting. National Natural Science Foundation of China
(61703234) and the Fundamental Research Funds
Our model: google claims the latest update to its android
for the Central Universities (2017RC02).
handsets can under 82 languages in 20 distinct scripts, and
works with both printed and cursive writing input with or
without a stylus.
References
Figure 3: Comparison of the output of two models
on a news article. Bold words in text are the key Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2014. Neural machine translation by
information. (Baseline: enc-dec+attn; Our model:
jointly learning to align and translate. CoRR,
KIGN+prediction-guide) abs/1409.0473.

Sumit Chopra, Michael Auli, and Alexander M. Rush.


mechanism by (+2.5 ROUGE-1, +1.5 ROUGE-2, 2016. Abstractive sentence summarization with at-
+2.2 ROUGE-L). In this paper, we do not imple- tentive recurrent neural networks. In Proceedings of
ment coverage mechanism in our model, which the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
can greatly improve the score of ROUGE (See Human Language Technologies, pages 93–98. Asso-
et al., 2017). ciation for Computational Linguistics.

4.3 Case study John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning
Figure 3 is an example to show the coverage of and stochastic optimization. J. Mach. Learn. Res.,
the key information between the text and the sum- 12:2121–2159.
mary and the bold words are the key information
Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang,
of the text. We compare the output of two models and Tieyan Liu. 2017. Decoding with value net-
and give the gold summary. It shows that the main works for neural machine translation. In I. Guyon,
idea of the text is about google handwriting input U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
working on android handsets and some function S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages
introduction. The baseline model equipped with 177–186. Curran Associates, Inc.
pointer-mechanism produces the summary, which
just shows that google have cracked the problem Karl Moritz Hermann, Tomas Kocisky, Edward
of reading handwriting, while the summary gener- Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
leyman, and Phil Blunsom. 2015. Teaching ma-
ated by our model covers almost all the key infor- chines to read and comprehend. In C. Cortes, N. D.
mation of the text. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing
5 Conclusion Systems 28, pages 1693–1701. Curran Associates,
Inc.
In this work, we propose a guiding generation
Chin-Yew Lin and Eduard Hovy. 2003. Auto-
model for abstractive text summarization. We matic evaluation of summaries using n-gram co-
combine the extractive model and the abstractive occurrence statistics. In Proceedings of the 2003
model. Firstly, we use the extractive method to Human Language Technology Conference of the

59
North American Chapter of the Association for
Computational Linguistics.
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-
ing order into texts. In Proceedings of EMNLP
2004, pages 404–411, Barcelona, Spain. Associa-
tion for Computational Linguistics.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Caglar Gulcehre, and Bing Xiang. 2016. Ab-
stractive text summarization using sequence-to-
sequence rnns and beyond. In Proceedings of The
20th SIGNLL Conference on Computational Natural
Language Learning, pages 280–290. Association for
Computational Linguistics.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
2015. A neural attention model for abstractive sen-
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 379–389. Association for
Computational Linguistics.
Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1073–
1083. Association for Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.


Sequence to sequence learning with neural net-
works. In Proceedings of the 27th International
Conference on Neural Information Processing Sys-
tems - Volume 2, NIPS’14, pages 3104–3112, Cam-
bridge, MA, USA. MIT Press.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.


2015. Pointer networks. In C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing
Systems 28, pages 2692–2700. Curran Associates,
Inc.
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
Ayush Pareek, Krishnan Srinivasan, and Dragomir
Radev. 2017. Graph-based neural multi-document
summarization. In Proceedings of the 21st Con-
ference on Computational Natural Language Learn-
ing (CoNLL 2017), pages 452–462. Association for
Computational Linguistics.
Matthew D. Zeiler. 2012. ADADELTA: an adaptive
learning rate method. CoRR, abs/1212.5701.
Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.
2017. Selective encoding for abstractive sentence
summarization. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1095–
1104. Association for Computational Linguistics.

60

You might also like