Latent Semantic Indexing (LSI) is a retrieval technique that employs Singular Value Decomposition (SVD) and maps each document vector into a lower dimensional space to achieve concept matching. LSI has been proved that it has a better performance than traditional lexical searching methods and has the ability to overcome synonym and polysemy problems. Our purposes were to construct an LSI model to facilitate the retrieving process, and to propose potential uses of LSI in education. We used five test collections, two Chinese and three English to verify our LSI model. The standard test collection, MED, was used to verify the correctness of our system, and the collections of ERIC and English educational abstracts were used to test the feasibility of LSI in educational materials; in addition, two Chinese test collections were used to examine the LSI usability on Chinese documents. Our major concerns in the tests were term weighting, stemming, reduction dimensions, and relevance feedback. Results showed that the LSI system model worked well not only for English documents but also for character-based Chinese documents. The LSI method could effectively group semantically relevant documents. The better weighting types were log idf, log entropy, log gfidf, tf idf, and tf gfidf. Results also indicated significant improvement in retrieval after stemming. Relevance feedback with different weighting ratio worked well. And the best dimension value in ERIC documents was around 50 or 60. In conclusion, we believed that LSI is a suitable system model for retrieving relevant documents. Keywords: latent semantic indexing (LSI), information retrieval (IR), singular value decomposition (SVD), relevance feedback.
Latent Semantic Indexing (LSI) is a retrieval technique that employs Singular Value Decomposition (SVD) and maps each document vector into a lower dimensional space to achieve concept matching. LSI has been proved that it has a better performance than traditional lexical searching methods and has the ability to overcome synonym and polysemy problems. Our purposes were to construct an LSI model to facilitate the retrieving process, and to propose potential uses of LSI in education. We used five test collections, two Chinese and three English to verify our LSI model. The standard test collection, MED, was used to verify the correctness of our system, and the collections of ERIC and English educational abstracts were used to test the feasibility of LSI in educational materials; in addition, two Chinese test collections were used to examine the LSI usability on Chinese documents. Our major concerns in the tests were term weighting, stemming, reduction dimensions, and relevance feedback. Results showed that the LSI system model worked well not only for English documents but also for character-based Chinese documents. The LSI method could effectively group semantically relevant documents. The better weighting types were log idf, log entropy, log gfidf, tf idf, and tf gfidf. Results also indicated significant improvement in retrieval after stemming. Relevance feedback with different weighting ratio worked well. And the best dimension value in ERIC documents was around 50 or 60. In conclusion, we believed that LSI is a suitable system model for retrieving relevant documents. Keywords: latent semantic indexing (LSI), information retrieval (IR), singular value decomposition (SVD), relevance feedback.