應用語句關係網路計算語句向心性之新聞事件摘要方法

摘錄式摘要技術的核心在於評估語句的摘要代表性，藉以排序語句作為摘錄語句時的依據。本研究將語句視為節點，藉由語句相似度來決定節點間是否存在連結，依此建構出語句關係網路模型。接著，衡量節點在網路中的重要性或對於其他相連節點的影響性，提出：（1）Degree Centrality、（2）Normalized Similarity-based Degree Centrality、（3）HITS Centrality、（4）PageRank Centrality，及（5）iSpreadRank Centrality的節點向心性分析；並以語句向心性作為語句的摘要代表性，藉此達到排序語句的目的。最後，導入CSIS（Cross-Sentence Information Sub-sumption）過濾重複性資訊，依序擷取語句組成摘要。實驗使用DUC 2004資料集來驗證上述摘要方法的可行性。在ROUGE-1的指標下，結合不同語句向心性之摘要效能依序是：iSpreadRank ＞ Normalized Similarity-based Degree ＞ PageRank ＞ Degree ＞ HITS。整體而言，實驗得知應用語句關係網路計算語句向心性之摘要方法確實可行。

關鍵字

多文件摘要；摘錄式摘要；語句關係網路；網路節點向心性；語句排序

並列摘要

Purpose: One widely-adopted summarization paradigm, sentence extraction, aims at extracting important sentences and composing them into a summary. The foundation towards sentence extraction is to assess importance of sentences in the summary so as to rank sentences for extraction. This paper employs graph-based text analysis to model documents and investigates measures of graph-based centrality as sentence salience in summarization. Design/methodology/approach: This paper models documents on the same (or related) topic as a sentence similarity network, in which a sentence is regarded as a node and relationship between sentences only exists if they are semantically related. Several methods for evaluating the importance of a node (i.e., a sentence) in the network are then proposed, namely: (1) Degree Centrality; (2) Normalized Similarity-based Degree Centrality; (3) HITS Centrality; (4) PageRank Centrality; and (5) iSpreadRank Centrality. All are designed on the basis of the idea that the importance of a node is determined not only by the number of nodes to which it connects, but also by the importance of its connected nodes. As to summary generation, CSIS (Cross-Sentence Information Sub-sumption) is employed for anti-redundancy while extracting sentences according to the sentence ranking produced based on the centrality of sentences. Findings: The proposed summarization method was evaluated using the ROUGE evaluation suite on the DUC 2004 news stories collection. Experimental results show that, while considering the ROUGE-1 metric, the performance ranking is: iSpreadRank ＞ Normalized Similarity-base Degree ＞ PageRank ＞ Degree ＞ HITS. Another experiment, conducted to combine sentence centrality with surface-level features, also presents competitive results, compared with the best participant in the DUC 2004 evaluation. Research limitations/implications: Directions for future research would be: (1) instead of symbolic-level analysis, to take into account semantics, such as synonymy, polysemy, and term dependency, while determining if two sentences are semantically related; (2) to investigate graph-based centrality developed in social network analysis for evaluating sentence salience in summarization; (3) to improve the cohesion and coherence of summaries using natural language processing techniques, such as sentence planning and generation. Practical implications: The proposed summarization method is in an unsupervised manner; thus no training dataset is required. Since no domain-specific knowledge or deep linguistic analysis is exploited, the method is domain- and language-independent. However, it might lead to poor understanding of the input texts and would probably produces poor summaries, due to neither deep analysis of natural language processing performed, discourse structure considered, nor domain-specific knowledge involved in the process of summarization. Originality/value: The contributions of this work are threefold. First, this paper offers a sentence similarity network to model topic-related documents. Second, novel graph-based sentence ranking methods are explored to rank the importance of sentences for extraction. Finally, the proposed method had been proven successful in a case study with the DUC 2004 benchmark dataset.

並列關鍵字

multidocument summarization ； extraction-based summarization ； sentence similarity network ； network-based sentence centrality ； sentence ranking

參考文獻

陳光華()。

Google Scholar

葉鎮源(2002)。文件自動化摘要方法之研究及其在中文文件的應用(碩士論文)。國立交通大學資訊科學系。

Google Scholar

謝仰哲(2008)。國中生友誼與學習諮詢網路之社會網路分析(碩士論文)。國立臺灣師範大學資訊教育學系。

Google Scholar

Aliguliyev, R.M.(2010).Clustering techniques and discrete particle swarm optimization algorithm for multi-document summarization.Computer Intelligence.26(4),420-448.

Google Scholar

Anderson, J.R.(1983).A spreading activation theory of memory.Journal of Verbal Learning and Verbal Behavior.22(3),261-295.

Google Scholar

國際替代計量

應用語句關係網路計算語句向心性之新聞事件摘要方法

全文下載

主題瀏覽