二分排行(Bipartite Ranking) 是一種在機器學習中基礎的排行 (Ranking) 問題,其目的在從輸入資料中學習如何將相關的樣品正確的排在非相關樣品之前。對式 (Pair-wise) 解法是其中一種解決二分排行的主要方式,它將二分排行問題轉變成在樣品「對」的二元分類 (Binary Classification) 問題,以學習在一對樣品中何者該排在另一物之前的模型來解決二分排行問題。然而,這類方法通常不適用於大型的輸入資料,因為「對」的數目往往是輸入資料大小的平方,過量的「對」會造成計算資源不足而無法解決問題。另一方面,點式(Point-wise) 也是一種解決二分排行的常見方式,它以樣品「點」的二元分類問題來近似二分排行問題,在輸入資料中學習樣品「點」是否相關。因為「點」的數量往往遠小於「對」的數量,使得這類方法可以用於大型資料上,但是可能會得到較低的正確率。綜合以上討論,我們了解要正確且有效率的解決大型二分排行問題是一件困難的工作。因此,在這篇論文中,我們提出了結合二分排行與二元分類的架構 (Combined Ranking and Classification) 以正確得解決二分排行問題。這個架構利用了將「點」視為一種「虛對」的想法,融合了對式與點式的二分排行方法。除此之外,為了有效率的解決大型二分排行問題,我們在 CRC 的架構下設計了主動採樣(Active Sampling) 演算法。此方法設計的想法來自於機器學習中的主動學習 (Active Learning) 問題,這個採樣法讓我們在大量的樣品「對」中只利用少量的樣品「對」有效率的達到不錯的正確率。最後,在14 個現實大型資料集中,實驗結果顯示我們所提出的主動採樣對和點演算法搭配上線性支持向量機 (SVM) 可以有效率的解決大型二分排行問題,且通常達到比目前許多先進的二分排行演算法還要高的準確性。
Bipartite ranking is a fundamental ranking problem that learns to order relevant instances ahead of irrelevant ones. One major approach for bipartite ranking, called the pair-wise approach, tackles an equivalent binary classification problem of whether one instance out of a pair of instances should be ranked higher than the other. Nevertheless, the number of instance pairs constructed from the input data could be quadratic to the size of the input data, which makes pair-wise ranking generally infeasible on large-scale data sets. Another major approach for bipartite ranking, called the point-wise approach, directly solves a binary classification problem between relevant and irrelevant instance points. This approach is feasible for large-scale data sets, but the resulting ranking performance can be inferior. That is, it is difficult to conduct bipartite ranking accurately and efficiently at the same time. In this thesis, we propose a general Combined Ranking and Classification (CRC) framework to accurately conduct bipartite ranking. The framework unifies point-wise and pair-wise approaches and is simply based on the idea of treating each instance point as a pseudo-pair. Moreover, we develop a novel scheme within the framework to conduct bipartite ranking efficiently. The scheme, called Active Sampling, is inspired from the rich field of active learning and can reach a competitive ranking performance while focusing only on a small subset of the many pairs during training. Experiments on 14 real-word large-scale data sets demonstrate that the proposed algorithm of Active Sampling within CRC, when coupled with a linear Support Vector Machine, usually outperforms state-of-the-art point-wise and pair-wise ranking approaches in terms of both accuracy and efficiency.