LSI-based Document Retrieval

Latent Semantic Indexing (LSI) is a retrieval technique that employs Singular Value Decomposition (SVD) and maps each document vector into a lower dimensional space to achieve concept matching. LSI has been proved that it has a better performance than traditional lexical searching methods and has the ability to overcome synonym and polysemy problems. Our purposes were to construct an LSI model to facilitate the retrieving process, and to propose potential uses of LSI in education. We used five test collections, two Chinese and three English to verify our LSI model. The standard test collection, MED, was used to verify the correctness of our system, and the collections of ERIC and English educational abstracts were used to test the feasibility of LSI in educational materials; in addition, two Chinese test collections were used to examine the LSI usability on Chinese documents. Our major concerns in the tests were term weighting, stemming, reduction dimensions, and relevance feedback. Results showed that the LSI system model worked well not only for English documents but also for character-based Chinese documents. The LSI method could effectively group semantically relevant documents. The better weighting types were log idf, log entropy, log gfidf, tf idf, and tf gfidf. Results also indicated significant improvement in retrieval after stemming. Relevance feedback with different weighting ratio worked well. And the best dimension value in ERIC documents was around 50 or 60. In conclusion, we believed that LSI is a suitable system model for retrieving relevant documents. Keywords: latent semantic indexing (LSI), information retrieval (IR), singular value decomposition (SVD), relevance feedback.

關鍵字

latent semantic indexing ； information retrieval ； singular value decomposition ； relevance feedback

並列摘要

並列關鍵字

latent semantic indexing ； information retrieval ； singular value decomposition ； relevance feedback

參考文獻

[26] Shih-Hung Wu, Pey-Ching Yang, Von-Wun Soo. (1998) An Assessment of Character-based Chinese News Filtering Using Latent Semantic Indexing. Computational Linguistics and Chinese Language Processing, vol.3, no.2, pp.61-78.

[1] Arthur C. Graesser, Peter Wiemer-Hastings, Katja Wiemer- Hastings, Derek Harter, Natalie Person, and the Tutoring Research Group (2000): Using Latent Semantic Analysis to Evaluate the Contributions of Students in AutoTutor. Interactive Learning Environments; V8, No2, p129-147.

[3] DaeHo Baek, HeuiSeok Lim, HaeChang Rim (2000). Latent Semantic Indexing Model for Boolean Query Formulation. ACM SIGIR’00; p310-312.

[4] Darrell Laham, Winston Bennett, Jr., Thomas Landauer (2000). An LSA-Based Software Tool for Matching Jobs, People, and Instruction. Interactive Learning Environments; V8, No3, p171-185.

[6] Dian Irene Witter. Downdating the Latent Semantic Indexing Model for Information Retrieval. MS Thesis, The University of Tennessee. December 1997.

國際替代計量

LSI-based Document Retrieval

主題瀏覽