  • 學位論文


Summarizing Unstructured Documents Automatically based on Content Analysis and Hierarchical Clustering

指導教授 : 賀嘉生


在翻查閱讀大量文件資料時,自動摘要系統摘錄出來的摘要可以節省使用者許多閱讀文件的時間。所謂的摘要是由許多能夠表達文章內容的句子組成,不論是由文章內容摘錄出或者依照內容重新寫過。一個良好的摘要必須要能夠完整的涵蓋文章內容並且能讓使用在短時間內了解文章內容。 一個句子通常是由名詞、代名詞、動詞、形容詞、副詞、感歎詞、介係詞和連接詞所組成,而名詞通常攜帶著最多的涵義並且可以透過動詞,形容詞或者其他名詞來修飾。在這個研究當中,使用了 Part-of-Speech 的技術來分析句子的組成進而找出重要的關鍵字。而關鍵字間通常隱含著彼此的關連性,這個關連性通常來自於兩個字共同出現在句子當中。透過階層式聚類的方式,將相同的句子聚集成一類,然後將具類後的結果,交由正規概念分析法(Formal Concept Analysis) 分析,找出關鍵字之間的隱含(Implication)與階層關係。句子在文件中的位置也會影響到句子的重要性,因此在計算句子的重要性時,句子的位置同樣的必需加入考量。系統依照每個句子的關鍵字數量、隱含的關鍵字數及句子的長短計算出每個句子的重要性(Weight),再依照句子在文件中出現的位置進行調整,最後系統將重要性高的句子挑選出來做為摘要呈現給使用者。 在這篇論文中,經由語意的分析、本文分析、正規概念分析法與資料檢索(Information Retrieval)等技術,發展出了一個良好有效的摘要系統有著良好的準確度(Precision and Recall)與使用者滿意度。


Text summarization system can save the time for user when reading large number of documents. The summary of text summarization system usually composed of meaningful sentence which represent content of text. A summary should cover whole document content and help user understand document quickly. Sentence usually composes by the verb, the noun, the pronoun, the adjective, the adverb, the preposition, the conjunction and the interjection. The nouns usually carry about most information in document and modifies by verbs, nouns and adjectives. In this research, a part-of-speech tagger was intruded to analyze sentences in document and find out important keywords. The relations between keyword usually come from their co-occurrences in document. This study using hierarchical clustering method cluster sentences and apply concept formal analysis to find out the implications between keywords. The position of sentence appears in document also influence the importance of sentence. Finally the system selects sentences which represent document according to the weight of keywords, implications between keywords and position in document. In this research, we present an automatic text summarization system which can extract important keywords from document automatic and offer a short summary represent document. This system has high Precision and recall and the user satisfy the summary result.


Edmundson, H. P. (1968). “New Methods in Automatic Extraction.” Journal of the AMC 16(2): 264-285
Hovy, E., and Lin, C.-Y. (1999). “Automated Text Summarization in SUMMARIST.” Advances in Automatic Text Summarization, MIT Press.
Hull, D.A. (1994). “Information Retrieval Using Statistical Classification.” Stanford University. Ph.D. dissertation.
Luhn, H. P. (1958) “The Automatic Creation of Literature Abstracts.” IBM Journal of Research and Development 2(2): 159-165
Mauldin, M.L. (1991). “Conceptual Information Retrieval—A Case Study in Adaptive Partial Parsing”. Boston, MA: Kluwer Academic Publishers.
