本篇論文提出兩個方法解決在不平衡資料下的分類問題。首先提出利用自我訓練時,比之前自我訓練的方法少的參數維度以及提升效能的方法。透過在已標記的訓練資料上,對每個分類器單獨訓練出預測信心值門檻,用來分辨高信心的未標記資料,並將結果作聯集給它們虛擬類別加入已標記資料中重新訓練。藉此不但降低了選參數的時間,效能也跟複雜的參數差不多。再者我們提出有效率地訓練不平衡資料的方法,從速度快的down-sampling開始透過類似booststrap的方法,將模型逼近得與up-sampling一樣,由於使用的資料量少,速度獲得了提升。我們在KDD cup 2008的極端不平衡資料中為它們實驗,實驗結果顯示在自我訓練中我們的方法選擇參數表現較之前方法稍好;而在效率上提出的方法是直接使用up-sampling的1.3倍快,而且在AUC上的表現差距不多。
There are two methods proposed to address classification problems of imbalanced data. First, we propose a method that has smaller parameter space and more performance when using self-training. We train confidence thresholds for each classifier using labeled data to identify high confident data, and label them pseudo labels for re-train. Through this scheme we get less training time for parameters and get better performance. Second, we proposed an efficient training method for imbalanced data. We start with down-sampling and using a method like bootstrap. The model will approximate the model of up-sampling. Using less training data leads to less training time. We do experiments on KDDCUP 2008 data. The result shows that our threshold-based self-training has better performance and the approximated model has the same performance as up-sampling but cost only 0.75 times training time of up-sampling.