監督式學習之搜尋結果合併問題中訓練資料篩選方法

搜尋結果合併問題，是在於合併數個資訊檢索系統或搜尋引擎的結果，以達到更準確的相關排序。提升搜尋引擎的品質有很多方法，合併數種不同的搜尋引擎是其中一個研究方向，如何能截長補短，是各家研究的重點。有效地合併數個資料檢索系統的結果，在眾多研究當中已顯示對於增加排序的準確性有顯著功效。它已被證明，這可以提高檢索效率和精度超過該原來的數個資訊檢索系統。本文的目標是，選擇一個訓練數據的子集合，其中以此子集合為訓練數據，產生的融合模型，以此融合模型融合所有訓練數據，會得到最多的改善。我們已probF use算法[10]為例。提出兩種方法:貪婪演算法和皮匠演算法，貪婪的方法有兩種選擇:獨立選得訓練數據與考慮訓練數據之間的關係。皮匠演算法是一種數據融合問題的框架，他會選出數個訓練數據的子集，以這些子集產生融合模型，每一個融合模型，都針對訓練數據的部分做做最佳化，最後線性將這些融合模型合併起來成最終融合模型。經由訓練數據的篩選，我們提高了合併之後的搜尋結果，大量的實驗包含TREC – 3，4，5及NTCIR – 3，4顯示，經過訓練數據篩選的融合演算法的融合結果，顯著地較相同的融合演算法卻沒有經過數據篩選更好，因為選擇了良好的訓練數據產生適當的融合模型。我們提出兩個有用的訓練數據篩選方法，並且定義成正式的框架，任何融合演算法都可廣泛應用。

關鍵字

資訊檢索

並列摘要

Search-result merging is to merge several results from different search en- gines to get better performance. Several early studies have shown combining different information retrieval (IR) models can greatly improve the retrieval effectiveness and accuracy over any individual model can get. In machine learning and statistics applications, researchers often apply data fusion algorithm to ensemble the results from different models to combine their different abilities as well. One of the state of arts data fusion algorithms in recent years is probF use [10]. It considers the past information of each model that is then used to predict the confidence of the model. Although it has shown promising performance in many studies, it doesn’t take the diversity of query into consideration together but the model only. In our analysis for the experiments performed on TREC-3,4,5 and NTCIR-3,4, we found that the performance of one model varies from one query to another. Inspired by the discovery, we assume that not all training examples are effective. We proposed two novel approaches, Greedy approach and Boosting approach, to select training data to optimize the improvement from data fusion algorithm. Greedy approach has two selection policies, dependent and independent, and both of them greedily select training examples. Dependent selection takes the concurrence of training examples into account; independent selection chooses every training example individually one after one. Boosting approach is a framework we design for data fusion problem, which emphasizes different training examples and generates a linear ensemble base on the weights of the different training data. Extensive experiments were performed on several data sets, including TREC-3,4,5 and NTCIR-3,4, the outcome was very promising. With either of our data selection methods, probF use algorithm clearly performs better than before. Our work can not only improved the effectiveness of existing fusion algorithm, but also reduce the training time consuming. One can apply the boosting framework we proposed to any data fusion algorithm on the fly as well.

並列關鍵字

information retrieval

參考文獻

[4] N. K. G. E. M. Voorhees and B. Johnson-Laird. The collection fusion problem. In The Third Text REtrieval Conference (TREC-3), pages 95–104, 1994.

[5] J. A. S. Edward A. Fox. Combination of multiple searches. In The 2nd Text REtrieval Conference (TREC-2), page 243–252. National Institute of Standards and Technology Special Publication 500-215, 1994.

[8] J. H. Lee. Analyses of multiple evidence combination. SIGIR Forum, 31(SI):267– 276.

[11] F. Mart’ınez-Santiago, L. A. Ure na-L’opez, and M. Mart’ın-Valdivia. A merging strategy proposal: The 2-step retrieval status value method. Inf. Retr., 9(1):71–93, 2006.

[13] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 81–90, New York, NY, USA, 2007. ACM.

國際替代計量

監督式學習之搜尋結果合併問題中訓練資料篩選方法

全文下載

主題瀏覽