以機器學習方法處理跨語言檢索合併問題

多語言檢索主要是允許使用者給予一種語言的查詢，檢索出多種語言的相關文件。一般而言，處理多語言檢索，首先利用查詢，在各個語言的語料庫中找出在該語言中的相關文件；利用合併的方法，將此些不同語言的相關文件合併成最終多語言的相關文件集。在此論文中的主要議題是如何使用最佳的合併方法，來達到不錯的效能。此研究中，我們使用機器學習的方法去建立一個跨語言的合併模型；透過此合併模型去調整每篇文件的合併分數。首先，探討處理跨語言檢索問題過程中，有哪些是可能影響跨語言檢索效能的因素。我們從三個層面做探討；翻譯層面、文件本身的層面以及較為一般性層面的特徵。在翻譯層面，過去有不少研究顯示，跨語言檢索時，翻譯品質的好壞對檢索結果的效能佔有很大程度的影響性；除此之外，我們將查詢中的每一個字給予分類成一個類別，類別則由人為的方式下去做定義。發現有幾個類別在資料檢索過程中，佔有較大程度的影響性，甚至發現不同類別之間亦存在著某些程度的相關連；其中佔有一定影響性的類別，其翻譯品質好壞，對跨語言檢索更為重大。在文件本身層面，利用文件本身以及文件標題的長度來做為此文件所含有的資訊量指標。從此些層次取出特徵，利用機器學習的方法，不只學習出跨語言的合併模型，亦學習出在機器學習過程中哪些特徵是較具影響性的。實驗結果顯示，利用機器學習的方法，所達到的檢索效能較傳統合併的方法效能佳；且發現翻譯品質的好壞，包含組織名稱，事件名稱，抽象名詞以及專業名詞的翻譯品質對跨語言檢索最有影響性。

關鍵字

跨語言檢索；結果合併；機器學習

並列摘要

Multilingual information retrieval aims to able users enter query in one language and access relevant documents in various languages. Usually, implementation of MLIR (multilingual information retrieval) is first retrieving each language to obtain bilingual retrieved documents lists from each language collection. Then, how to merge these bilingual lists is the main issue in this work. In this work, we use machine learning approach, FRank, to build a merge model; merging these multiple bilingual lists using the merge model score and retrieval score. Firstly, we identify some effective factors which may influence MLIR process from three levels general level, translation level and document level. On translation level, previous study showed translation quality is crucial for cross-language information retrieval. Besides, we classify each query term into a category which are pre-defined manually. From our experiment, some categories play more important roles in a query while information retrieval; moreover, there are some relationships between categories. The translation quality of those influential categories is crucial for MLIR. On document level, we extract document and document title length as the quantity of informative. On each level, we totally extract 62 features; utilizing these features, we not only train a merge model but also identify what are the effective features for MLIR merging process. In our experiment, we can achieve the best performance among all traditional merging strategies, including raw-score merging, round-robin merging, normalized by top K merging, logistic regression and 2-step re-indexing merging method. Besides, from the features picked up by FRank as weak learners, we can identify translation quality of some query term categories, translatable query terms and ambiguous degree while translating are effective while MLIR merging.

並列關鍵字

Multilingual Information Retrieval ； Data Fusion ； Machine Learning

參考文獻

Cheng, P. J., J. W. Teng, et al. (2004). "Translating unknown queries with web corpora for cross-language information retrieval." Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval: 146-153.

Kishida, K., K. H. Chen, et al. (2004). "Overview of CLIR task at the fourth NTCIR workshop." Proceedings of NTCIR 4.

Le Calve, A. and J. Savoy (2000). "Database merging strategy based on logistic regression." Information Processing and Management 36(3): 341-359.

Lin, W. C. and H. H. Chen (2002). "Merging Mechanisms in Multilingual Information Retrieval." Working Notes for the CLEF 2002 Workshop: 97-102.

Lu, C., Y. Xu, et al. (2007). "Improving translation accuracy in web-based translation extraction." Proceedings of NTCIR-6 Workshop.

國際替代計量

以機器學習方法處理跨語言檢索合併問題

全文下載

主題瀏覽