以機器學習研究資訊檢索之排列問題

本論文主旨為研究如何利用機器學習的技術來增進資訊檢索的效能。首先，在本論文的第一部分，我們提出一個全新的機器學習演算法來處理資訊檢索中的排序問題，這個演算法我們將它稱為『精準排序』（Fidelity Rank）或又稱為 FRank。在此方法中，我們提出了一個稱為『精準損失函數』（Fidelity Loss Function）。此函數具有些良好的性質可以使資訊檢索中的排序結果更加精準，比如：緩慢上升的損失，及每個文件對都可以到達其最低損失等特性。經實驗結果証實，我們提出的『精準排序』演算法，不論是在傳統的資訊檢索問題上、還是網路搜索問題上，其效能都可以優越於其他的方法。接著，在本論文的第二部分，我們將第一部分所提出的『精準排序』學習演算法應用到多語言資訊檢索中著名的合併問題上。就我們所知，這個作法是第一個將機器學習演算法用到多語言資訊檢索的合併問題上。在這個部分，我們提出了許多有可能會影響合併效能的特徵，經實驗結果，我們發現學習出來的合併模型可以大大地改善合併的結果，並且透過學習出來的合併模型更可以幫助我們找出真正影響合併效能的重要特徵。最後，在本論文的第三部分，我們想要試著將現有的機器學習演算法延伸出許多值得擁有的特性，比如：多樣性。一般的資訊檢索使用者，對於較為混淆的問句，都傾向於搜索到可以涵蓋不同主題的結果。因此，在第三部分，我們試著將多樣性這樣的考量也加入到學習的過程當中，提出一個稱為『二步式排序支持向量機』（Two-step Ranking SVM）的方法。在此方法中，我們搭配了『支持向量分類技術』（Support Vector Classification）和『支持向量迴歸技術』（Support Vector Regression）來增加檢索結果的多樣性、並且還能保持檢索結果的品質。經實驗結果顯示，所提到的作法確能保持排序品質，並且能擴大檢索結果的涵蓋主題範圍。

關鍵字

資訊檢索；機器學習

並列摘要

Learning to rank is becoming important in many fields, especially in information retrieval. In this thesis, a novel learning-based ranking algorithm, Fidelity Rank (FRank), is first proposed to learn an effective ranking function. FRank not only inherits the useful properties of the probabilistic ranking framework, but also possesses new properties helpful for ranking, including slow-growing loss and the ability to reach zero for each document pair. The results demonstrate that FRank outperforms other ranking algorithms for conventional IR problem as well as Web-based searching. Then, we apply the FRank algorithm to enhance the merge quality in multilingual information retrieval (MLIR). To the best of our knowledge, this practice is the first attempt to use a learning-based ranking algorithm to construct a merge model for MLIR merging. The experimental results show that the merge model constructed by FRank can significantly improve merging quality. In addition to the effectiveness, via the merge model, we can further identify key factors that influence the merging process; this information might provide us more insight and understanding into MLIR merging. Finally, we investigate how to extend learning-based ranking techniques with more desirable property -- diversity. For ambiguous queries, if there is no further information about user's intention, an IR system should better provide a ranking list of documents with all possible interpretations. For this diversification problem, we propose a two-step Ranking SVM technique, in which the support vector classification and regression techniques are utilized accordingly to enhance the diversity while maintain the ranking quality. According to the experimental results, the two-step learning technique not only improves ranking quality, but also broaden the coverage within the retrieved results.

並列關鍵字

Learning to Rank ； Information Retrieval ； Machine Learning

參考文獻

[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, June 2005.

[11] K. Crammer and Y. Singer. Pranking with ranking. Advances in Neural Information Processing Systems, 14:641–647, 2002.

[13] D. Fallows. Search engine use. Technical report, Pew Internet & American Life Project Surveys, Washington, DC, 2008. http://www.pewinternet.org.

[15] M. Grubinger, P. Clough, A. Hanbury, and H. Muller. Overview of the ImageCLEF 2008 Photographic Retrieval Task. In Working Notes of the 2008 CLEF Workshop. Aarhus, Denmark, 2008.

[18] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.

國際替代計量

以機器學習研究資訊檢索之排列問題

全文下載

主題瀏覽