改善以序列為基礎之文件檢索系統之有效性與彈性

The purpose of a text retrieval system is to locate documents from a large, textual document collection that meet a user’s needs. The SIR system is such a system that is based on the sequence model. As it was designed and implemented as a sequential, rather than a parallel application, it becomes less efficient when the size of the data collection gets larger. Another drawback of the SIR system is that the index must be rebuilt entirely when the data collections are modified. Also, compared with other models, the query evaluation process of the sequence model is time consuming. In this thesis, we seek to make improvements that address these problems. To facilitate parallel query processing, we implement three kinds of index partitioning schemes in the system, and evalauete their load balancing characteristics. To improve the scalability of index building, we design and implement a mechanism that allows the SIR system to support incremental index updates. We also make other improvements such as support of queries with homophones and support of more types of token, that make the system more flexible.

並列關鍵字

Incremental Update ； Index Partitioning Schemes ； Information Retrieval ； Parallel Inverted Index ； Parallel Processing ； Text Retrieval

參考文獻

[1] Ricardo Baeza-Yates and Bertheir Ribeiro-Neto. Modern Information Retrieval.

full-text information retrieval. In VLDB ’94: Proceedings of the 20th International

Conference on Very Large Data Bases, pages 192–202. Morgan Kaufmann Publishers

Inc., 1994.

[3] B. Ribeiro-Neto C. Badue, R. Baeza-Yates and N. Ziviani. Distributed query processing

國際替代計量

改善以序列為基礎之文件檢索系統之有效性與彈性

全文下載

主題瀏覽