透過您的圖書館登入
IP:3.22.70.230
  • 學位論文

建立專利資料之向量空間模型以支援跨語言檢索

Building Vector Space Model for Patent Data to Support Cross-Language Retrieval

指導教授 : 陳彥良
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


文件是包含文字與圖表的非結構化資料,且大多數不含類別標籤。向量空間模型方法是一常見文件表示方式,但傳統方法存在以下兩個問題:其一是挑選重要字詞作為向量基底特徵時,只考量一字詞在某一特定文件集合中是否最具辨別能力;另一則是套用在含有類別標籤的文件上時,對於一字詞在不同類別間是否具辨別能力僅考量平坦結構的類別標籤。 為改善上述二問題,本研究設立以下三項目標。目標一:設計一新方法在挑選最具代表性特徵時,考量各特徵在階層式類別標籤中的關係。目標二:設計一新方法:IPC基礎的向量模型,使用字詞之外特徵讓所建立之向量模型更有效地表示文件。目標三:將精煉IPC基礎的向量模型使其適用於多語言情境中,讓它有更廣泛的延伸用途。 針對目標一進行實驗,測試是否加入類別標籤的階層關係考量,能篩選出更具辨別與表示能力的字詞。實驗結果顯示向量型特徵若以按比例挑選之方式揀選,則可擁有較高覆蓋力;另一方面若以加權總合挑選之方式揀選,則可得到較高準確率。對於目標二進行另一實驗來測試是否使用IPC碼作為向量基底可提升效能。實驗結果指出以IPC為基礎的索引字詞挑選法可達成較高的準確率與滿意度。最後針對目標三進行實驗以測試跨語言專利文件比對方法的效能。實驗與評估結果呈現IPC基礎的概念橋梁比傳統方法表現優異。

並列摘要


Documents are the unstructured data containing textual data and diagrams. Most of them exist without any class label. Traditionally, the VSM methods are commonly used to present documents but it has two problems. The first one is that they only consider the discrimination ability of a term in a specific set of documents while the methods are used to select important terms as the features to form a vector base. The second problem is that they consider the discrimination ability of a term among different class labels only in the flat structure when a term consists in the documents with class labels. In order to deal with the problems, there are three major objectives to be achieved in this research. Firstly, a new approach is designed to select the most representative features (i.e., terms) to form a VSM with the consideration of hierarchical class labels. The second objective is to design a new method to build an IPC-based VSM using features other than terms to present documents more efficiently. Finally, the third objective is to refine the IPC-based VSM to adapt to the multi-language condition as an extended usage. For the first objective, this research conducted an experiment to test if the consideration of hierarchical relations among class labels can sift out terms with higher representative and greater discrimination abilities for presenting patent documents. Through the experiments, this research reveals that a VSM whose features are selected via proportional selecting manners has higher coverage; and a VSM whose features are selected via weighted-summed selecting manners has higher accuracy. For the second objective, another experiment was conducted to see whether using IPC codes as indexing vocabulary can arise the performance of retrieving similar documents or not. The experimental results indicate that the IPC-based indexing vocabulary selection method achieves a higher accuracy and is more satisfactory. Finally, the experiment for the third objective is to test the performance of the proposed solution for cross-language patent document matching. The results of the experiment and evaluation demonstrated that the proposed IPC-based concept bridge outperformed the traditional methods.

參考文獻


[43] A. J. C. Trappey, C.V. Trappey, and E. C. H. Hsieh, “Automatic Categorization of Patent Documents for R&D Knowledge Self-organization”, Journal of Management, 23(4), pp. 413-424, 2006.
[1] P. Castells, M. Fernández, and D. Vallet, “An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval”, IEEE Transactions on Knowledge and Data Engineering, 19(2), pp. 261-272, 2007.
[2] Y. H. Tseng, C. J. Lin, and Y. I. Lin, “Text mining techniques for patent analysis”, Information Processing & Management, 43, pp. 1216-1247, 2007.
[4] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, New York, USA, 2008.
[5] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval”, Information Processing & Management, 24(5), pp. 513-523, 1988.

延伸閱讀