以文句網路分群架構萃取多文件摘要

近年由於資訊科技發展迅速，電子文件數量大增加，為避免讀者花費過多時間吸收文件意涵，透過在文件中萃取重要文句製作摘要可幫助讀者快速吸收。然而傳統的文件摘要萃取方法僅透過該文句是否含有重要詞彙去判斷，較無更高層級的概念，如主題等；且摘要萃取文句並未對整個新聞事件做較為全面性之陳述。本研究使用圖形化摘要方法萃取多文件摘要，為指標表示方法(Indicator representation approaches)的一種，將文件切割使用較小的片段表示，本研究採用文句表示。而利用此較小之片段建立起圖形關聯網路後使用分群與數種鏈結分析方法對節點進行評分，並將其群集權重納入評分的考量後使用被選中的節點製作摘要。實驗採用DUC 2002以及TAC2010之資料集測試系統效能，並以ROUGE衡量摘要品質；經實驗證明，本研究之多文件摘要方法在不同的摘要任務下品質皆具有一定程度，在DUC 2002之50字與100字多文件摘要ROUGE-1值分別可達0.2996與0.3412，與當年研討會之參賽者近似之效能，而200字多文件摘要ROUGE-1值亦有0.4559，具有中等效能；在TAC 2010之Guided Summarization之第一部份之ROUGE-1值可達0.3513，超越所有當年參賽者，而ROUGE-2值亦可達0.0707，亦有中等程度之效能。

關鍵字

文字探勘；圖形網路；分群方法；多文件摘要

並列摘要

Information technology has developed rapidly in recent years, and the number of electronic documents has increased, too. To avoid readers spend too much time realizing the content of article, it’s useful to help them understand quickly that extracting important sentences and then making summarization. However, the traditional extracting method only judges whether the sentences contain the important terms or not, and it doesn’t use the concept of topic, either. In addition, the traditional extracting method also doesn’t focus on the whole news event to make a comprehensive explanation. This paper uses Graph-based Summarization method to extract multi-document summarization, which is a kind of Indicator representation approaches to divide document in smaller fragment, and this study uses sentence to represent it. After using smaller fragment to build Graph-based network, this paper uses clustering and many kinds of link analysis methods to score the nodes. After that, this study takes cluster weight into consideration for scoring and uses the sentence nodes to make summarization. The experiment uses DUC 2002 and TAC 2010 dataset, and uses ROUGE to evaluation the quality of summarization. The result shows that all the methods can reach a well level. The ROUGE-1 score of DUC 2002 50 words and 100 words can reach 0.2996 and 0.3412, it approximate to the peers in DUC 2002. The ROUGE-1 score of the first part of TAC 2010 Guided Summarization can reach 0.3513, and it’s higher than other peers. Finally, the ROUGE-2 score can reach 0.0707, it also has medium quality.

並列關鍵字

Text mining ； Graph-based network ； Clustering method ； Multi-document Summarization

參考文獻

［1］李浩平，「運用NGD建立適用於使用者回饋資訊不足之文件過濾系統」，國立中央大學，碩士論文，民國100年。

［5］ Antiqueira, L., Jr., O. N. O., Costa, L. d. F., and Nunes, M. d. G. V. (2009). “A complex network approach to text summarization”. Information Sciences, 179, 584-599.

［6］ Bando, L. L., Scholer, F., and Turpin, A. (2010). Constructing Query-biased Summaries: a Comparison of Human and System Generated Snippets. in Proceedings of the third symposium on Information interaction in context. pp. 195-204.

［7］ Biemann, C., and Bosch, A. v. d. (2011). Structure Discovery in Natural Language. Springer Heidelberg Dordrecht London New York.

［8］ Cai, X., and Li, W. (2011). “A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously”. Information Sciences, 181, 3816–3827.

被引用紀錄

高永威（2006）。以彩色派翠網對 Security-Enhanced Linux 安全政策資訊流進行驗證之研究〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0207200917341118

王蓮淨（2015）。以主題事件追蹤為基礎之摘要擷取〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201512072947

國際替代計量

以文句網路分群架構萃取多文件摘要

未授權

主題瀏覽