探究基於異質圖和上下文語言模型之自動文件摘要技術

由於網路蓬勃發展，每日都會產生成千上萬筆的文字訊息，但不是每個人都有時間逐一瀏覽，因此我們需要一個技術來幫助我們快速的理解每篇文章中的重要內容。自動文件摘要技術油然而生，該技術可以幫助我們從單一或數個文檔中迅速且準確地擷取關鍵的信息。自動摘要方法可以分為兩種類型：節錄式抽取式(extractive)摘要與重寫式(abstractive)摘要。前者將文章中重要的句子提取出來構成摘要內容，後者則是在理解文章內容後重建文章成字數精簡且富含重點的摘要結果。本論文的目標是建立一個能提取語意的節錄抽取式摘要模型，且其摘要句之間具有較少的冗餘。本文使用基於圖的神經網絡(graph-based neural networks, GNN)來學習富含上下文語意的句嵌入。圖在真實世界應用上，通常會具有多種不同性質的節點，因此我們使用異構圖神經網絡(heterogeneous graph neural network, HGNN)作為我們的基礎，並從三個不同的面向來提升模型效能。首先，在編碼階段(encoding stage)，我們致力於在此時期加入更多的資訊。例如，我們使用基於雙向編碼器表示的變形器(bidirectional encoder representation from transformers, BERT)的語言模型，來提供摘要模型更多的上下文資訊。除此之外，句子與句子之間的關係，或是句子本身的內部關係等，這些句子屬性皆會在此階段納入考量。緊接著，在句子評分(sentence rescoring)的階段，我們提供了幾種重新評分的方法。其中，我們可以將句子在文章中的順序納入考量，但必須經過標準化以減少誤差。最後，在挑選句子(sentence selection)的階段，我們改善了句子挑選器來減少冗餘。實驗結果證明，此論文提出的各式方法在公開的摘要集中，皆獲得相當不錯的成效。

關鍵字

節錄式摘要；圖神經網路；異質圖神經網路；語言模型

並列摘要

The explosive growth of big data requires methods to capture key information effectively. Automatic summarization can quickly and accurately help us capture key information from single or multiple documents. In general, automatic summarization can be classified into two types: extractive summarization extracts existing sentences to form the abstract, and abstractive summarization reconstructs a meaningful summary after comprehension. The goal of this thesis is to generate semantic extractive summarization with less redundancy. To achieve these goals, the thesis uses graph-based neural networks to learn contextual sentence embedding. Because real-world graph applications usually have multiple node types, we implement a heterogeneous graph neural network as our baseline model and explore three aspects to improve its performance. First, efforts are made to incorporate more contextual information in the encoding stage. Language models based on bidirectional encoder representations from transformers are used to provide more contextual representations. Additional sentence properties, such as inter- and intra-sentential relationships, are also considered. Second, this paper provides several methods for sentence rescoring. The order of sentences is considered by balancing their positions across the paragraph and then normalizing them to mitigate bias. Finally, sentence selectors are also improved to reduce redundancy. The experimental results show that all three methods significantly improve performance .

並列關鍵字

Extractive Summarization ； Graph Neural Networks ； Heterogeneous Graph Neural Networks ； Contextualized Language Model

參考文獻

Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in neural information processing systems, 13.

Google Scholar

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Google Scholar

Carbonell, J. (1997). Automated query-relevant summarization and diversity-based reranking. In Proc. of the IJCAI-97 Workshop on AI in Digital Libraries (pp. 9–14).

Google Scholar

Carbonell, J., & Goldstein, J. (1998, August). The use of MMR, diversity-based re-ranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 335–336).

Google Scholar

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Google Scholar

主題瀏覽