應用短字串索引在中英文全文資料檢索之研究

傳統資料庫很難擷取文件或非結構化資料的資訊，因此全文檢索系統在現代辦公室自動化中扮演極重要角色。因為中文與英文在結構、語意或文法上均有極大差異，所以應用自動化詞鍵索引在中文或中英混含資料的全文檢索上非常困難。本文將研究如何利用二元字串（2-gram）產生索引，以建立中英文全文檢索系統。中英文的二元字串包括中文、英文及數字三部份，其長度各不相同，我們將研究如何固定其長度，以利索引檔中主鍵之建立。最後討論如何計算檢索詞與文件資料的相似性。本研究將應用二元字串索引並建立大葉工學院圖書資訊系統中有關圖書題名之全文檢索系統。

關鍵字

全文檢索；字串比對；詞鍵索引；詞鍵向量；二元字串；字串向量；相似值

並列摘要

It is difficult to implement the retrieval of text based or bibliographic information on the most traditional database. Several full text retrieval techniques have been proposed, however, all of them are dealing with English based text. We will propose an N-gram indexing system to retrieve Chinese text based information. In text based information retrieval operations, we do not insist on a complete match between query and document terms before particular documents are retrieved. Instead, the retrieval of an item may depend on a sufficient degree of coincidence between the sets of identifiers attached to queries and documents produced by some approximate or partial matching method. Based on 2-gram indexing system, we will propose methods for calculation of the similarity. Finally we apply the technique to our library information system so that it is able to retrieve text based information such as title of book and abstract of journal.

並列關鍵字

full text retrieval ； term indexing ； term vector ； N-gram indexing ； N-gram vector ； hashing ； computation of similarity

國際替代計量

應用短字串索引在中英文全文資料檢索之研究

全文下載

主題瀏覽