網路多文件摘要整合及呈現

資訊化時代來臨，伴隨著網際網路的盛行，資料的取得也比過去來的容易，對於網路文件閱讀習慣之需求也日益增加；因此，如何在大量的資料下，快速擷取有用的資訊也越來越重要；而文件摘要即是從大量重複的文件中，擷取出有用的資訊之工作。本論文提出一個針對網際網路之特定領域網頁進行摘要的處理與呈現。首先，將相關的網頁依不同的段落內容分別儲存成大量的原始語料，利用計算語言學的方法進行斷詞、特殊字拆字和轉碼來進行語料之前置處理。接著採用段落間及句子間的相似度比對，找出類似的主題段落將措詞類似的段落群聚在一個群組；並針對每個群組計算其關鍵字詞；最後將句子裡包含最多關鍵字詞的句子挑選出來作為摘要呈現的結果。實驗結果顯示，摘要成果在可讀性、資訊完整、流暢度、無重複的現象都有達到八成以上的滿意度，顯示本系統的成果令使用者都能接受此摘要的呈現，此外，本研究所產生之摘要經斷句化和門檻值加權計算後在滿意度調查結果顯示有助於提升摘要結果。

關鍵字

網頁摘要；多文件摘要；相似度計算

並列摘要

With the advent of the informationalization and the prevalence of the Internet, information is more accessible than years past. It is becoming crucial to select suitable content within the massive amount of material, thus the technique of document abstraction has became important since it is able to extract usable information from massive data. This thesis proposes the process for the abstraction of specific webs on Internet and presents the result to the users. The major tasks of the proposed system consist of 3 blocks. First, collect the original corpus respectively based on the distinctive contents within relevant webs. Then, deal with the original corpus by means of Computational Linguistics, including the methods of word segmentation and tagging. Third, employ the similarity measurement between paragraphs and sentences to form a category including analogical expressions among the topic paragraphs. At last, extract the keywords within categories, in which certain sentences that contain most keywords would become the results of abstraction. The results points out that the abstraction system achieves 89-90% satisfaction score in evaluation of the readable, fluency, comprehensive and non-redundant information extraction. As a consequence, it signifies that the abstraction system is acceptable for users.

並列關鍵字

web abstraction ； multi document avstraction ； similarity measurement

參考文獻

[4] 王美淳，2003，「利用共生詞彙特性發展一個二階段文件群集法」，中原大學資訊管理研究所，碩士論文。

[10] 陳稼興、謝佳倫、許芳誠，2007，「以遺傳演算法為基礎的中文斷詞研究」，國立中央大學、真理大學，第二卷第2期，pp.27-44。

[7] 林千翔、張嘉惠，2007，「基於特製隱藏式馬可夫模型之中文斷詞研究」，國立中央大學，碩士論文。

[5] 吳仕先，2002，「文件資料之概念主題檢索」，元智大學，碩士論文。

[3] 陳光華，1999，「資訊檢索技術之核心」，國立臺灣大學圖書資訊學系助理教授。

國際替代計量

網路多文件摘要整合及呈現

全文下載

主題瀏覽