資訊化時代來臨,伴隨著網際網路的盛行,資料的取得也比過去來的容易,對於網路文件閱讀習慣之需求也日益增加;因此,如何在大量的資料下,快速擷取有用的資訊也越來越重要;而文件摘要即是從大量重複的文件中,擷取出有用的資訊之工作。 本論文提出一個針對網際網路之特定領域網頁進行摘要的處理與呈現。首先,將相關的網頁依不同的段落內容分別儲存成大量的原始語料,利用計算語言學的方法進行斷詞、特殊字拆字和轉碼來進行語料之前置處理。接著採用段落間及句子間的相似度比對,找出類似的主題段落將措詞類似的段落群聚在一個群組;並針對每個群組計算其關鍵字詞;最後將句子裡包含最多關鍵字詞的句子挑選出來作為摘要呈現的結果。 實驗結果顯示,摘要成果在可讀性、資訊完整、流暢度、無重複的現象都有達到八成以上的滿意度,顯示本系統的成果令使用者都能接受此摘要的呈現,此外,本研究所產生之摘要經斷句化和門檻值加權計算後在滿意度調查結果顯示有助於提升摘要結果。
With the advent of the informationalization and the prevalence of the Internet, information is more accessible than years past. It is becoming crucial to select suitable content within the massive amount of material, thus the technique of document abstraction has became important since it is able to extract usable information from massive data. This thesis proposes the process for the abstraction of specific webs on Internet and presents the result to the users. The major tasks of the proposed system consist of 3 blocks. First, collect the original corpus respectively based on the distinctive contents within relevant webs. Then, deal with the original corpus by means of Computational Linguistics, including the methods of word segmentation and tagging. Third, employ the similarity measurement between paragraphs and sentences to form a category including analogical expressions among the topic paragraphs. At last, extract the keywords within categories, in which certain sentences that contain most keywords would become the results of abstraction. The results points out that the abstraction system achieves 89-90% satisfaction score in evaluation of the readable, fluency, comprehensive and non-redundant information extraction. As a consequence, it signifies that the abstraction system is acceptable for users.