Folding-in、SVD-Updating、Folding-up、Eigenvalueanalysis演算法在改善隱含語意索引於大型語料庫之效能比較

隱含語意索引(Latent semantic indexing)為一種普遍的資訊檢索的技術，藉由著奇異值分解(Singular value decomposition, SVD)將詞彙-文件矩陣(term-document matrix)的維度縮減並藉此濾除文件中之雜訊，以增強檢索之準確度。LSI除了應用在資訊檢索上，影像檢索、Cluster、跨語言資訊檢索(Cross-Language Information Retrieval, CLIR)以及Crawler都可以使用LSI的技術。然而LSI受限於無法處理大量資料以及SVD運算時間過長的缺點使得無法普遍的被應用。本篇論文探討了先前先進針對LSI的缺失所提出的演算法進行實作與互相比對。首先針對無法處理大量資料的問題，參考了folding in、SVD-updating與folding up這三種演算法；針對SVD處理運算時間過長的問題參考了Eigenvalue analysis的方式。然而本篇論文將這幾種演算法進行互相的配對進行準確度與執行時間的比較。在進行比較的對象我們採用MED、CISI、CRAN三個小型的語料庫；大語料庫則使用NTCIR所提供的CIRB040r的大型語料庫以及我們從中文維基百科下載的維基語料。從這些實驗中希望得到在不同的演算法搭配在各種語料庫上是否能得到所期望的效能。

關鍵字

folding in ； SVD-updating ； folding up ；奇異值分解；隱含語意索引； Eigenvalue analysis

並列摘要

Latent semantic indexing (Latent semantic indexing) is a general information retrieval technology, which by singular value decomposition (Singular value decomposition, SVD) the terms- document matrix (term-document matrix) and to reduce the document dimensions and noise to enhance the retrieval accuracy. In addition to the application of LSI on information retrieval, image retrieval, Cluster, cross-language information retrieval (Cross-Language Information Retrieval, CLIR) and the Crawler can use LSI technology. However, LSI is subject to large amounts of data can not be processed, as well as computing time SVD shortcomings made it impossible to be applied generally. This paper discusses the previous lack of advanced LSI for the proposed implementation of the algorithms with each other than the right. First of all, can not be processed for a large number of information, with reference to folding in, SVD-updating with the folding up of the three algorithms; SVD processing algorithms for the problem of too much time with reference to the way Eigenvalue analysis. However, this paper several algorithms to match each other for accuracy and execution time comparison. Carrying out the object of comparison we use MED, CISI, CRAN three small corpus; NTCIR large corpus is used for large-scale corpus CIRB040r, as well as from the Chinese Wikipedia Wikipedia corpus downloaded. From these experiments I hope to be in different algorithms with a variety of corpora on the availability of the desired performance.

並列關鍵字

LSI ； Eigenvalue analysis ； SVD-updating ； folding up ； folding in ； SVD

參考文獻

[2] Gavin W. O''Brien. “Information management tools for updating an SVD-encoded indexing scheme.”, University of Tennessee, Knoxville, TN, 1994.

[7] Jing Gao, Jun Zhang, “Clustered SVD strategies in latent semantic indexing.” Information Processing and Management, 2005.

[8] Johanna Geis. Latent Semantic Indexing and Information Retrieval A quest with BosSE. 2006.

[9] Bruce Hendrickson, “Latent semantic analysis and Fiedler retrieval.” Linear Algebra and ITS Applications, 2006.

[11] April Kontostathis, "Essential Dimensions of Latent Semantic Indexing (LSI)." Proceedings of the 40th Hawaii International Conference on System Sciences, 2007.

國際替代計量

Folding-in、SVD-Updating、Folding-up、Eigenvalueanalysis演算法在改善隱含語意索引於大型語料庫之效能比較

主題瀏覽