透過您的圖書館登入
IP:3.142.198.129
  • 學位論文

大量專利類別自動分類演算法研究

An automatic classification algorithm for a large number of patent categorization

指導教授 : 陳彥良
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


自動專利分類系統可以快速比對識別現有專利的可能衝突,對發明者以及專利律師而言,可幫他們節省許多人工比對成本與時間,因此是相當有價值的研究。近年來,使用國際專利分類(IPC)來進行專利文件的分類已日益普遍,而此一國際專利分類則是一個複雜的階層式分類系統,它包含了8個部(section)、128個主類(class)、648個次類(subclass),約有7,200個主目(main group)及72,000個次目(subgroup)。儘管已有一些研究著眼於IPC的自動分類,但截至目前為止,並沒有任何分類方法適合用來進行次目層級的自動分類(IPC的底層分類),因此,本研究提出一個全新的分類方法,稱之為三階段分類演算法(簡稱為TPC演算法),它可以進行次目層級的自動分類,並獲得合理的正確率。此一方法是由三個階段所組成,前兩個階段運用了支持向量機進行可能類別的預測,而最後一個階段則運用分群演算法決定最終的次目標籤。本研究使用世界智慧財產權組織的WIPO-alpha專利資料集進行實驗,其結果顯示TPC演算法可以在次目層級的自動分類上,達到36.07%的正確率,此一數據若與隨機猜測一個次目標籤的機率相比,約已提升了26,020倍的正確率。此外,我們額外搜集96,654份與WIPO-alpha專利資料集不重複的專利文件,再與WIPO-alpha專利資料集合併進行測試,實驗結果顯示正確率提升至38.01%。

並列摘要


An automatic patent categorization system would be invaluable to individual inventors and patent attorneys, saving them time and effort by quickly identifying conflicts with existing patents. In recent years, it has become more and more common to classify all patent documents using the International Patent Classification (IPC), a complex hierarchical classification system comprised of 8 sections, 128 classes, 648 subclasses, about 7,200 main groups, and approximately 72,000 subgroups. So far, however, no patent categorization method has been developed that can classify patents down to the subgroup level (the bottom level of the IPC). Therefore, this dissertation presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy. The method is composed of three phases, where the first two are performed using SVM classification and the last one employs clustering. The experimental results for the TPC algorithm, using the WIPO-alpha collection, indicate that our classification method can achieve 36.07% accuracy at the subgroup level. This is approximately a 26,020-fold improvement over a random guess. In addition, a collection of 96,654 distinct patent documents that we collect from Internet has been combined with WIPO-alpha collection. We evaluate the TPC algorithm on this collection and it achieved an accuracy of 38.01% at the subgroup level.

參考文獻


[1] J.-H. Kim and K.-S. Choi, “Patent document categorization based on semantic structural information”, Information Processing and Management, Vol. 43(5), pp. 1200-1215, 2007.
[2] C. J. Fall, A. Törcsvári, P. Fievét and G. Karetka, “Automated categorization of German-language patent documents”, Expert Systems with Applications, Vol. 26(2), pp. 269–277, 2004.
[7] J. L. Fagan, “The Effectiveness of a Nonsyntactic Approach to Automatic Phrase Indexing for Document Retrieval”, Journal of American Society for Information Science, Vol. 40(2), pp. 115-132, 1989.
[8] L. P. Jones, E. W. Gassie and S. Radhakrishnan, “INDEX: The Statistical Basis for an Automatic Conceptual Phrase-indexing System”, Journal of American Society for Information Science, Vol. 41(2), pp. 87-98, 1990.
[9] H. Paijmans, “Comparing the Document Representation of Two IR Systems: CLARIT and TOPIC”, Journal of American Society for Information Science, Vol. 44(7), pp. 383-392, 1993.

延伸閱讀