A Novel Methodology for Automated Ontology-Based Patent Document Summarization

指導教授 : 張瑞芬


根據世界智慧財產組(WIPO, 1996)指出,專利資訊是中含有全世界90%~95%的商品化研發成果,相對於其他技術報告或期刊報導僅含有5~10%的核心技術來說,專利文件是唯一能夠完整揭露核心技術的知識文件。經WIPO調查顯示,只要公司能善用專利資訊,將可以節省40%的研發成本以及縮短60%的研發時程,因此,專利文件在知識經濟的時代扮演著極為重要的角色;然而,由於專利文件的日益遽增,人們無法有效地閱讀、組織和充分了解,另外,專利文件中包含了許多技術及法律上的專業詞彙,更增加了專利文件閱讀的困難性,因此,如何有效地組織、理解並從專利文件中擷取出重要的資訊變為知識管理領域中的ㄧ重要課題。在本論文中,我們提出了一個以本體論為基之智慧型專利文件自動摘要系統,並以動力手工具及化學機械研磨領域之知識文件來測試自動摘要系統之成效。首先,系統藉由事先定義好的動力手工具和化學機械研磨本體論樹狀架構以及TF-IDF為基之技術來擷取出專利文件中之領域關鍵字和出現頻率次數較高的字詞,並在擷取出的關鍵字詞基礎上,探勘出內容中重要的詞彙,再依據一遞迴演算法來擷取出重要的多字詞,並將重複的資訊予以整併;接著,由K-Mean分群演算法進行段落分群,將文件中擁有相同概念主題的段落聚集在一起;隨後,利用先前所取出之所有關鍵詞彙來衡量每一段落群集之資訊重要程度以挑選出候選摘要段落;最後,將候選摘要配合事先建置好的模板產出文字形式的摘要。除文字摘要之外,系統會將文件中有對應到的本體論架構樹狀節點的字詞予以標示註解並產出一視覺化圖形形式之樹狀摘要。


According to the report of World Intellectual Property Organization (WIPO), patent documents are the only type of documents that can totally disclose core techniques, and there are 90% to 95% R&D achievements in commercialization comparing to 5% to 10% disclosure rate of other types of documents (e.g. technical reports, and journal articles). By the investigation of WIPO, as long as a company can make the best use of patent information, it can save R&D costs by 40% and shorten the R&D time by 60%. As a consequence, patent information has been playing an important role in the era of knowledge-based economy. However, the numbers of patent documents are increasing dramatically, and most researchers cannot process, organize and understand them with an effective manner. Moreover, it is increasingly difficult for researchers to fully understand patent documents with a lot of technical and legal vocabularies in the context. In this paper, we propose an ontology-based key-phrase recognition technology for the construction of an automated summarization system. In addition, the patents of Power Hand Tool and Chemical Mechanical Polishing are used to verify the effectiveness of proposed summarization system. First, the system extracts domain key words by using a pre-defined ontology, and uses TF-IDF method to extract high frequency terms. Second, a clustering algorithm, K-Mean, is adopted, and the content with similar concept will be gathered together. Third, the candidate paragraphs are picked up from each cluster by using key words and phrases to measure every paragraph importance in each cluster. Finally, the candidate paragraphs are combined with template that is defied in advance, and the text summary is generated at this stage. In addition, the system will mark, annotate and highlight the nodes of ontology tree that are corresponding to words in the document, and produce a visualized feature of summary.


