Abstract
Text summarization is a natural language processing technique which is used for extracting key ideas from a given document. Advanced summarization methods should be able comprehend high-level semantic in the text. In this paper prevalent language models (LM) and frameworks for text summarization based on them are reviewed. It is shown that best quality summaries are achieved with the use of recent LMs. Furthermore, some of considered frameworks are suitable for multiple natural language processing tasks and are able to learn in unsupervised way. Since the ability to learn and communicate in natural language are important characteristics of Artificial General Intelligence, these researches are highly relevant to the development of AGI.
Introduction
Automatic text summarization is one of important techniques of natural language processing (NLP), which enables to get shorter version of document representing the core information. Nowadays information is available in abundance for many topics on the internet, making summarizations may be useful for many purposes including media monitoring, question answering and recommendation systems. However, text summarization techniques also appear to be important for the development of artificial general intelligence (AGI), since it is required for communicating in natural language and learning from human-readable resources. The first approaches to this task used statistical language models (LM). With the development of neural network LMs most advances have been achieved. Thus, it is shown in this work that pre-trained language representation model is a key element for high quality text summarization.
Save your time!
We can take care of your essay
- Proper editing and formatting
- Free revision, title page, and bibliography
- Flexible prices and money-back guarantee
Place an order
Text summarization problem and various approaches
Text summarization task is mainly divided in two types: extractive summarization with the aim of selecting only most important subset of sentences and abstractive summarization which aims to produce a shorter text with new sentences describing main ideas of source document. The last approach requires extensive natural language processing to obtain linguistically fluent text. Therefore, abstractive summarization is more complex task and the most of works are about extractive summarization [6].
Some of the early works on extractive summarization used hand-crafted features of sentences such as: position, presence of positive or negative keywords, sentence centrality, resemblance, relative length of sentence, etc. [5]. These features are then used for building models based on mathematical regression, genetic algorithms or probabilistic neural networks. Also, approaches based on graphs, trees and rules have been introduced [6]. Further improvement has been achieved using encoder-extractor architecture with the use of recurrent neural networks (RNN) [22]. In this work [22] hierarchical structure of document is encoded by sentence level and document level encoders based on bidirectional gated recurrent units (GRU) [2]. For making extraction of sentences combination of another GRU network and multilayer perceptron is used.
Most of these works were tested on DUC 2004 dataset [23] which contains 500 articles and their short summaries written by humans. However, DUC 2004 is relatively small comparing to newly introduced (2015) CNN/Daily Mail dataset [7] consisting of about 300 000 training pairs. It is common to use ROGUE (Recall-Oriented Understudy for Gisting Evaluation) metric [9] for evaluating text summarizations quality. This metric compares automatically produced summaries with ideal summaries created by humans by computing number of overlapping n-grams, word sequences and word pairs.
Text summarization advances with pre-trained language models
Pre-trained language representation models are useful in many NLP tasks, with their recent development the most of works done on text summarization problems rely on LMs. In this chapter prevalent LMs and frameworks for text summarization based on them are reviewed. Developing form Word2Vec [11] and GloVe [13] — which used a fixed embedding for each word — LMs such as ELMo [14] learn contextualized word representation by pre-training a LM (bi-directional LSTM for ELMo) in an unsupervised way.
Following the similar idea of ELMo, OpenAI GPT [15] expands the unsupervised LM to a much larger scale by training on a giant collection of free text corpora.
ELMo’s LM was bi-directional, but the OpenAI Transformer only trains a forward LM, so that problem was solved in BERT [3]. Compared to GPT, the largest difference and improvement of BERT is to make training bi-directional. The model learns to predict both context on the left and right. According to authors, bidirectional nature of their model is the single most important new contribution.
For the task of extractive summarization, the following works are proposed. In [12] document passed to BERT model for inference to output embeddings, and then clustered the embeddings with K-Means, selecting the embedded sentences that were closest to the centroid as the candidate summary sentences. The advantage of this approach is that it can produce summaries in unsupervised mode.
In [10] adaptations were added while fine-tuning process to learn sentence representations (in original BERT output vectors are grounded to tokens instead of sentences). The extractive model is built by stacking pre-trained encoder (BERT) with simple classification layer or with RNN and Transformer layers [19] to capture document-level features better (best performing model with Transformer was named BertSumExt). For abstractive summarization standard encoder-decoder framework [17] is used. The encoder is the pre-trained Bertsum and the decoder is a 6-layered Transformer initialized randomly.
Neural approaches to abstractive summarization consider the task as a sequence-to-sequence problem, where target summaries contain words or phrases that are not necessarily found in the original text. In contrast to older works, many of considered methods are capable of doing both: abstractive and extractive summarizations.
The research [20] is one of the first to extend BERT to the sequence generation task. In this work, BERT is utilized on both encoder and two-staged decoder of a sequence-to-sequence model for abstractive text summarization. Encoder is used to get document embeddings while for the decoder, there are two stages in the model. In the first stage, Transformer-based decoder is used to generate a draft output sequence (summary). In the second stage, each word of the draft sequence is masked and fed into BERT, then by combining the input sequence and the draft representation generated by BERT, a Transformer-based decoder is used to predict the refined word for each masked position. Word-level refine decoder completes the task similar to the task in BERT’s pre-train process, therefore by using the ability of the contextual LM the decoder can generate more fluent and natural sequences.
MASS [18] proposed masked sequence-to-sequence pre-training for language generation tasks, that reconstructs a single randomly selected sentence fragment given the remaining part of the sentence in the encoder-decoder framework. The encoder reads the source sequence and generates a set of representations; the decoder estimates the conditional probability of each target token given the source representations and its preceding tokens. Attention mechanism [1] is further introduced between the encoder and decoder to find which source representation to focus on when predicting the current token. But MASS also needs to be fine-tuned on text summarization task.
UniLM [4] proposed jointly training on three types of LM tasks: unidirectional (left-to-right and right-to-left), bidirectional (word-level mask, with next sentence prediction), and sequence-to-sequence (word-level mask) prediction.
Text-to-Text Transfer Transformer (T5) [16] introduced a unified framework that converts every language problem into a text-to-text format. This framework showed the advantage of scaling up model size (up to 11 billion parameters) and pre-training corpus. T5 was pre-trained with randomly corrupted text spans of varying mask ratios and sizes of spans.
BART [8] introduced a denoising autoencoder to pre-train sequence-to-sequence models. BART uses a corrupted text with both: randomly shuffling the order of the original sentences; using a novel in-filling scheme, where spans of text are replaced with a single mask token, and learned to reconstruct the original text. For generation tasks, the noising function was text infilling which used single mask tokens to mask random sampled spans of text.
PEGASUS [21] tackles to solve the problem that pre-trained Transformers were used to be fine-tuned on downstream NLP tasks including text summarization. Pre-training on massive text corpora large Transformer-based encoder-decoder models with self-supervised objective were proposed. The key idea is to mask or remove whole sentences from a document and generating these gap-sentences from the rest of the document works well as a pre-training objective for downstream summarization tasks. This approach is the current state-of-the-art on a big number of diverse summarization tasks.
Conclusion
The use of pre-trained LMs became prevalent in recent state-of-the-art researches. The application of LMs in text summarization frameworks not only enhance the results, but also makes it suitable for abstractive summarization in many cases. Moreover, some of this frameworks are able to learn downstream NLP tasks in unsupervised manner which is of a particular interest for development of AGI.
References
- Bahdanau, D. et al.: Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs, stat]. (2016).
- Cho, K. et al.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [cs, stat]. (2014).
- Devlin, J. et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186 Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423.
- Dong, L. et al.: Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv:1905.03197 [cs]. (2019).
- Fattah, M.A., Ren, F.: GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Computer Speech & Language. 23, 1, 126–144 (2009). https://doi.org/10.1016/j.csl.2008.04.002.
- Gambhir, M., Gupta, V.: Recent automatic text summarization techniques: a survey. Artif Intell Rev. 47, 1, 1–66 (2017). https://doi.org/10.1007/s10462-016-9475-9.
- Hermann, K.M. et al.: Teaching Machines to Read and Comprehend. arXiv:1506.03340 [cs]. (2015).
- Lewis, M. et al.: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv:1910.13461 [cs, stat]. (2019).
- Lin, C.-Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. pp. 74–81 Association for Computational Linguistics, Barcelona, Spain (2004).
- Liu, Y., Lapata, M.: Text Summarization with Pretrained Encoders. arXiv:1908.08345 [cs]. (2019).
- Mikolov, T. et al.: Distributed Representations of Words and Phrases and their Compositionality. In: Burges, C.J.C. et al. (eds.) Advances in Neural Information Processing Systems 26. pp. 3111–3119 Curran Associates, Inc. (2013).
- Miller, D.: Leveraging BERT for Extractive Text Summarization on Lectures. 7.
- Pennington, J. et al.: Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 Association for Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/D14-1162.
- Peters, M.E. et al.: Deep contextualized word representations. arXiv:1802.05365 [cs]. (2018).
- Radford, A. et al.: Improving Language Understanding by Generative Pre-Training. 12 (2018).
- Raffel, C. et al.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat]. (2019).
- See, A. et al.: Get To The Point: Summarization with Pointer-Generator Networks. arXiv:1704.04368 [cs]. (2017).
- Song, K. et al.: MASS: Masked Sequence to Sequence Pre-training for Language Generation. arXiv:1905.02450 [cs]. (2019).
- Vaswani, A. et al.: Attention Is All You Need. arXiv:1706.03762 [cs]. (2017).
- Zhang, H. et al.: Pretraining-Based Natural Language Generation for Text Summarization. arXiv:1902.09243 [cs]. (2019).
- Zhang, J. et al.: PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. arXiv:1912.08777 [cs]. (2019).
- Zhou, Q. et al.: Neural Document Summarization by Jointly Learning to Score and Select Sentences. arXiv:1807.02305 [cs]. (2018).
- DUC 2004 Documents for Summarization, Tasks, and Measures, https://duc.nist.gov/duc2004/, last accessed 2020/02/10.