基於Elias-Fano編碼法的反向索引：版本文件的分析與研究

反向索引是一種重要而且也是最常使用在文件檢索上的一種技術。然而，現今文件的容量越來越大，對於要做出不同種有用的關鍵字查詢功能，例如，關鍵字出現在哪些文件裡，文件對於這個關鍵字來說的重要程度為何，關鍵字在文件中出現的位置等等，傳統的反向索引將會消耗很大的空間成本。舉例來說，在實務上，只是針對大約300MB的文件資料，我們要用傳統的反向索引，去做關鍵字在文件中出現位置的查詢，在不犧牲查詢時間的情況下，就要消耗大約1.5GB的硬碟空間去儲存它的索引。為了解決這個實務上的問題，我們研究了許多資料壓縮或是索引壓縮的方法。在這篇論文，我們實作出一個在實際空間消耗表現較佳的索引架構，這個架構是基於在現有方法 - Partitioned Elias-Fano 編碼法之上。我們用版本文件資料集 - 維基百科，在實務上去評量我們架構的效能，發現它只消耗了大約150MB的硬碟空間去儲存它的索引，就可以達成大約300MB的資料做出不同功能的查詢。針對這個架構上的查詢方法，我們也提出了兩種不同策略，並且做實驗去深入討論這兩種策略的優劣性與它們分別適合用的情況。

關鍵字

反向索引；資料壓縮；演算法；版本文件； Query

並列摘要

Inverted Index is an important and well-known method for document retrieval. However, as the volume of documents is growing very quickly nowadays, we have to pay much cost for the basic inverted index to achieve common useful functions like document listing, time-travel, top $k$, and the occurrence reporting of phrase queries. For example, for supporting the occurrence reporting query, we need around 1.5 GB index space to store the index in the disk just for around 300 MB data. To solve this space problem, many index compression techniques have been studied. In this thesis, we propose a practical index framework on good space performance with inverted index based on the recently proposed partitioned Elias-Fano encoding, and conduct experiments on real data sets. Our index can support the query functions correctly with only around 150 MB index space for 300 MB input real data. We develop two different methods to query our index, and from the results of our experiments, we discuss what kind of data is more suited for each of these two methods.

並列關鍵字

Inverted-Index ； Versioned-Document ； Query ； Data-Compression ； Algorithm

參考文獻

[5] Dirk Bahle, Hugh E Williams, and Justin Zobel. Compaction techniques for nextword indexes. In spire, pages 33–45, 2001.

[8] David C Blair and Melvin E Maron. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, 28(3):289–299, 1985.

[9] Francisco Claude, Antonio Farina, Miguel A Mart ́ınez-Prieto, and Gonzalo Navarro. Compressed q-gram indexing for highly repetitive biological sequences. In BioInformat- ics and BioEngineering (BIBE), 2010 IEEE International Conference on, pages 86–91. IEEE, 2010.

[12] Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM), 21(2):246–260, 1974.

[15] Wing-Kai Hon, Manish Patil, Rahul Shah, and Shih-Bin Wu. Efficient index for re- trieving top-k most frequent documents. Journal of Discrete Algorithms, 8(4):402–417, 2010.

國際替代計量

基於Elias-Fano編碼法的反向索引：版本文件的分析與研究

全文下載

主題瀏覽