簡易檢索 / 詳目顯示

研究生: 吳乙青
Goh Ee Cheng
論文名稱: 應用文字探勘技術建置專利自動分類系統
Using Text Mining Technology to Construct the Automatic Patent Classification System
指導教授: 陳灯能
Chen, Deng-Neng
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系所
Department of Management Information Systems
畢業學年度: 107
語文別: 英文
論文頁數: 58
中文關鍵詞: 專利IPC分類號支援向量機Naive BayesXGBoost隨機森林
外文關鍵詞: Patent, IPC, XGBoost, SVM, Naive Bayes, Random Forest
DOI URL: http://doi.org/10.6346/NPUST201900329
相關次數: 點閱:31下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統
  • 現今專利的申請量迅速增長,每天都有很多發明者向各國家地區的專利局提交專利的申請。為了提高專利研究的效率,每項專利都會有各自的歸類。因此,專利分類一直是研究和實踐課題中最重要的部分。許多研究都集中在英文的專利自動分類系統上,而忽略了中文專利的重要性。中文專利的數量已慢慢增長,我們也可以透過中文的專利去了解當今亞洲國家的技術。自動專利分類系統可以快速比對識別現有專利的可能衝突,對發明者和專利審查委員而言,可幫他們節省許多人工比對成本與時間,因此是相當有價值的研究。近年來,使用國際專利分類(IPC)來進行專利文件的分類已日益普遍,而國際專利分類則是一個複雜階層式的分類系統。因此,本研究將通過支援向量機,隨機森林,XGBoost和簡單貝氏分類等的文本挖掘技術和中文分詞對中文專利的國際分類號進行分類。最終,本研究在台灣專利局的H部的中文專利中,XGBoost的實驗結果可高達93.52%的精確率。

    Nowadays, the number of patents granted has rapidly increased. Every day, a lot of inventors have filed patent applications to the regional patent offices in different countries. To make patent research more efficient, every patent is classified. Therefore, patent classification has been one of the most significant parts for research and practical topic. Many studies have focused on the English patent automatic classification system, thereby ignoring the importance of Chinese patents. Chinese patents allow us to better understand the technology of the countries in Asia today. The automated patent classification system can quickly compare the possible conflicts with existing patents, and it can be a valuable study for inventors and patent examiners to save labor costs and time. In recent years, the use of the International Patent Classification (IPC) for the classification of patent documents has become increasingly common, while the International Patent Classification is a complex hierarchical classification system. In this paper, we apply text-mining techniques through the Chinese word segmentation with SVM, Random Forest, XGBoost and Naive Bayes techniques to classify the H section of Chinese patent documents. With all the techniques of this research, the identification of similar patents in the same category can be realized. The experimental results for the XGBoost classification can achieve 93.52% precision at the class level in the H section of the Chinese patent documents.

    摘要 I
    Abstract III
    謝誌 V
    Table of Contents VI
    List of Tables IX
    List of Figures X
    1. Introduction 1
    1.1 Background and Motivation 1
    1.2 Purpose 4
    1.3 Structure of the Thesis 6
    2. Literature Review 7
    2.1 Patent 7
    2.1.1 Patent Classification 8
    2.1.2 International Patent Classification (IPC) 9
    2.2 Natural Language Processing 11
    2.2.1 Chinese Word Segmentation 12
    2.2.2 FoolNLTK 14
    2.2.3 Jieba 15
    2.4 Word Embedding 17
    2.4.1 Vector Space Model 17
    2.4.2 fastText 18
    2.4.3 Word2vec 19
    2.5 Text Classification 22
    2.5.1 Support Vector Machine (SVM) 22
    2.5.2 Random Forest 23
    2.5.3 XGBoost 24
    2.5.4 Naive Bayes 26
    2.6 K-fold Cross-Validation 26
    3. Methodology 29
    3.1 Model Architectures 29
    3.2 Method 30
    3.2.1 Data Collection and Formatting 30
    3.2.2 Domain-Specific Thesaurus Generation 33
    3.2.3 Preprocessing 35
    3.2.4 Model Training 38
    3.2.5 Classification Model Evaluation 39
    4. Performance Evaluation 40
    4.1 System Environment 40
    4.2 Preparation of datasets 41
    4.3.1 Experiment 1: The Performance of Classification in the Description 45
    4.3.2 Experiment 2: The Performance of Classification in the combination of Description and Brief 46
    4.3.3 Experiment 3: The Performance of Classification in the combination of Description and Title 47
    4.3.4 Experiment 4: The Performance of Classification in the combination of Description, Brief, and Title 48
    4.4 Brief Comments on the Results 48
    5. Conclusions 50
    5.1 Research Result 50
    5.2 Research Contribution 51
    5.3 Research Limitations 52
    5.4 Recommendations for further studies 52
    6. References 54

    Adams, S. (2000, December). Using the International Patent Classification in an online environment. World Patent Information, 291-300.
    2. Adams, S. (2001, March). Comparing the IPC and the US classification systems for the patent searcher. World Patent Information, 15-23.
    3. Asch, V. V. (2013). Macro- and micro-averaged evaluation measures.
    4. Bates, M. (1993, February). Models of natural language understanding. Human-Machine Communication by Voice Vol. 92, 9977-9982.
    5. Benzineb, K., & Guyot, J. (2011). Automated Patent Classification. Current Challenges in Patent Information Retrieval, 239-261.
    6. Breiman, L. (2001, Oct). Random Forest. Machine Learning Vol.45, 5-32.
    7. Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. Paper presented at the CIKM '04 Proceedings of the thirteenth ACM international conference on Information and knowledge management, Washington, D.C., USA
    8. Chen, T., & Guestrin, C. (2016, Jun). XGBoost: A Scalable Tree Boosting System. Machine Learning, 785-794.
    9. Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Paper presented at the Proceedings of the 25th international conference on Machine learning, Finland.
    10. Day, M.-Y., & Lee, C.-C. (2016, Aug). Deep Learning for Financial Sentiment Analysis on Finance News Providers. IEEE, 18-21.
    11. Domingos, P., & Pazzani, M. (1997, November). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 103-130.
    12. Ernst, H. (1998, September). Patent portfolios for strategic R&D planning. Journal of Engineering and Technology Management, 279-308.
    13. Ernst, H. (2003, September). Patent information for strategic technology management. World Patent Information, 233-242.
    14. Fall, C. J., A.Torcsvari, P.Fievet, & G.Karetka. (2004, February). Automated categorization of German-language patent documents. Expert Systems with Applications, 269-277.
    15. Ferilli, S., Esposito, F., & Grieco, D. (2014). Automatic Learning of Linguistic Resources for Stopword Removal and Stemming from Text. Paper presented at the Procedia Computer Science.
    16. Gomez, J. C., & Moens, M.-F. (2014, Oct). A Survey of Automated Hierarchical Classification of Patents. Professional Search in the Modern World, 215-249.
    17. Goodfellow, S. D., Goodwin, A., Greer, R., Peter C.LaussenMazwi, j., & Eytan, D. (2017). Classification of Atrial Fibrillation Using Multidisciplinary Features and Gradient Boosting. Paper presented at the 2017 Computing in Cardiology Conference.
    18. Jaffe, A. B., Trajtenberg, M., & Henderson, R. (1993, August). Geographic Localization of knowledge spillovers as evidenced by patent citations. The Quarterly Journal of Economics, 577-598.
    19. Joachims, T. (2005, June). Text categorization with Support Vector Machines: Learning with many relevant features. Machine Learning: ECML-98, 137-142.
    20. Joulin, A., Grace, E., Bojanowski, P., & Mikolov, T. (2017, Aug). Bag of Tricks for Efficient Text Classification. Association for Computational Linguistics, 427-431.
    21. K.Joshi, A. (1991, September). Natural Language Processing. Science, 1242-1249.
    22. Kampichler, C., Wieland, R., Calme, S., Weissenberger, H., & Arriaga-Weiss, S. (2010, November). Classification in conservation biology: A comparison of five machine-learning methods. Ecological Informatics, 441-450.
    23. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016, Mar). Neural Architectures for Named Entity Recognition. Proceedings of NAACL, 260-270.
    24. Li, G., Li, W., Tian, X., & Che, Y. (2017, August). Short-Term Electricity Load Forecasting Based on the XGBoost Algorithm. Smart Grid, 275-285.
    25. Li, Z., Tate, D., Lane, C., & Adams, C. (2012, October). A framework for automatic TRIZ level of invention estimation of patents using natural language processing, knowledge-transfer, and patent citation metrics. Computer-Aided Design, 987–1010.
    26. Ling, W., Dyer, C., Black, A. W., & Trancoso, I. (2015). Two/Too Simple Adaptations of Word2Vec for Syntax Problems. Paper presented at the Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado.
    27. McCallum, A., & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. Learning for Text Categorization, 41-48.
    28. Mendez, J. R., Iglesias, E. L., Fdez-Riverola, F., Diaz, F., & Corchado, J. M. (2005). Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain. Paper presented at the Conference of the Spanish Association for Artificial Intelligence.
    29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Paper presented at the NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, USA.
    30. Narayanan, V., Arora, I., & Bhatia, A. (2013, May). Fast and Accurate Sentiment Classification Using an enhanced naive bayes model. Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science, 194-201.
    31. Peng, F., Feng, F., & McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. Paper presented at the COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland.
    32. Ramos, J. (1999). Using TF-IDF to Detemine Word Relevance in Document Queries. ArXiv.
    33. Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv.
    34. Saif, H., Fernandez, M., He, Y., & Alani, H. (2014). On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter. Paper presented at the Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland.
    35. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 613-620.
    36. Smith, H. (2002, December). Automation of patent classification. World Patent Information, 269-271.
    37. Sun, J., Zhong, G., Huang, K., & Dong, J. (2018, October). Cooperative game theory-based random forests with consistency. Neural Networks, 20-29.
    38. Tan, A.-h. (1999). Text Mining: The state of the art and the challenges. Proceesings of the PAKDD, 65-70.
    39. Teahan, W. J., McNab, R., Wen, Y., & Witten, I. H. (2000, September). A Compression-based Algorithm for Chinese Word Segmentation. Computational Linguistics, 375-393.
    40. Ting, S., IP, W., & AHC, T. (2011). Is Naive Bayes a good classifier for document classification. International Journal of Software Engineering and its Applications 37-46.
    41. Trappey, A. J. C., Trappey, C., Wu, C.-Y., & Lin, C.-W. (2012, January). A patent quality analysis for innovative technology and product development. Advanced Engineering Informatics, 26-34.
    42. VAPNIK, C. C. V. (1995). Support-Vector Networks Machine Learning, 20, 273-297.
    43. Vapnik, V. N., Labs, A. T. B., Holmdel, & NJ. (1995). The nature of statistical learning theory.
    44. Vijayarani, S., Ilmathi, J., & Nithya. (2015). Preprocessing Techniques for Text Mining - An overview. International Journal of Computer Science & Communication Networks, 7-16.
    45. Vijvers, W. G. (1990). The international patent classification as a search tool. World Patent Information, 26-30.
    46. WIPO. (2018b, February). MEETING OF INTELLECTUAL PROPERTY OFFICES (IPOS) ON ICT STRATEGIES AND ARTIFICIAL INTELLIGENCE (AI) FOR IP ADMINISTRATION
    47. Wu, C.-H., Ken, Y., & Huang, T. (2010, September). Patent classification system using a new hybrid genetic algorithm support vector machine. Applied Soft Computing, 1164-1177.
    48. Xiang, L. J. (2003). A Study of Applying Data Mining Classification Techniques to Patent Analysis.
    49. Xue, N. (2003, February). Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing Vol.8 29-48.
    50. Zhang, D., Xu, H., Su, Z., & Xu, Y. (2015, March). Chinese comments sentiment classification based on word2vec and SVMperf. Expert Systems with Applications: An International Journal, 1857-1863.
    51. 李欣海. (2013). 随机森林模型在分类与回归分析中的应用. 應用昆蟲學報, 1190-1197.

    無法下載圖示 校外公開
    2024/08/06
    QR CODE