透過您的圖書館登入
IP:18.118.136.46
  • 學位論文

多文件摘要系統基於Mutual Reinforcement原理

Multi-Document Summarization System Based on Mutual Reinforcement Principle

指導教授 : 李嘉晃

摘要


根據研究報告指出,網際網路的蓬勃發展造成每年產生的數位化文件與影像等資料之總數皆呈倍數成長。 為了有效率地了解這些電子文件的資訊,本論文發展自動摘要系統將這些大量的數位化文件去蕪存菁,在不流失其原本的資訊的條件下,讓使用者快速且有效地了解這些資訊的內容。 本論文所提出的自動摘要系統考慮了三個不同面向來對句子作評分以作為挑選摘要句子的依據:1. 字詞與句子之間的關係;2. 標題與句子之間的關係;3. 句子與句子之間的關係。在對句子評分之前,本系統利用Alignment演算法與Mutual Reinforcement原理移除資料集中資訊量較低的句子,以避免這些低資訊量的句子被選取成摘要句子。 而上述所提及的三個不同面向則是分別利用HITS演算法、餘弦相似度計算方法與PageRank演算法來實現。 本論文使用的資料集為DUC資料集,其為英文資料集且組成文件為新聞類文章。 根據ROUGE評估工具的評估結果顯示,本摘要系統所產生的系統摘要達到不錯的效能。

並列摘要


According to the research report, the rapid development of the Internet results in the amount of the digital document, video, or other data to grow in double rate per year. In order to find out the information of these electronic files efficiently, this thesis develops an automatic summarization system to sieve out the non-information data of digital documents. Therefore, users can find out the contents of information efficiently without losing the meaning of the original documents. The automatic summarization system proposed in this thesis considers three different aspects for the sentence scoring: first, the relationship between words and sentences; second, the relationship between the titles and sentences; finally, the relationship between sentences and sentences. Before the sentences scoring, this summarization system uses Alignment algorithm and Mutual Reinforcement Principle to remove the sentences that have fewer information on the original dataset to avoid these sentences with fewer information to be selected as a part of the summary. The HITS algorithm, the cosine similarity calculation methods and the PageRank algorithm are employed respectively to achieve the above three different aspects. The dataset used in this thesis is the DUC dataset, and the constituent documents of the DUC dataset are the English news articles. The evaluation results of the evaluation tools ROUGE show the performance of the summary generate by this summarization system is good.

參考文獻


[ 1 ] Hans Peter Luhn, Keyword-in-context index for technical literature. American Documentation, 11(4):288–295. ISSN: 0002-8231.
[ 2 ] Stergos Afantenosa, Vangelis Karkaletsis, Panagiotis Stamatopoulos, Summarization from medical documents: a survey, Artificial Intelligence in Medicine, 33(2), 157-177.
[ 5 ] Jade Goldstein Stewart, Genre Oriented Summarization, A PhD Thesis of Carnegie Mellon University, December 2008.
[ 6 ] Gerard Salton, Andrew Wong, and Chung Shu Yang, A vector space model for Information Retrieval, In Proceedings of Journal of the American Society for Information Science, 18(11):613-620, November 1975.
[ 8 ] Daniel Marcu, The Automatic Construction of Large-Scale Corpora for Summarization, In Proceedings of the 22nd ACM SIGIR Conference, 1999.

被引用紀錄


林淑鈴(2014)。整合自動摘要技術於中文新聞RSS閱讀器之研究〔碩士論文,國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2014.01155
林貞婷(2011)。中華郵政公司經營績效分析 -隨機邊界法之應用〔碩士論文,中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201100454
蘇月娥(2014)。國營事業機構二元陞遷制度之研究-以臺灣港務股份有限公司為例〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2014.02551

延伸閱讀