對不平衡的資料有效率的訓練和自我訓練的門檻分析

本篇論文提出兩個方法解決在不平衡資料下的分類問題。首先提出利用自我訓練時，比之前自我訓練的方法少的參數維度以及提升效能的方法。透過在已標記的訓練資料上，對每個分類器單獨訓練出預測信心值門檻，用來分辨高信心的未標記資料，並將結果作聯集給它們虛擬類別加入已標記資料中重新訓練。藉此不但降低了選參數的時間，效能也跟複雜的參數差不多。再者我們提出有效率地訓練不平衡資料的方法，從速度快的down-sampling開始透過類似booststrap的方法，將模型逼近得與up-sampling一樣，由於使用的資料量少，速度獲得了提升。我們在KDD cup 2008的極端不平衡資料中為它們實驗，實驗結果顯示在自我訓練中我們的方法選擇參數表現較之前方法稍好；而在效率上提出的方法是直接使用up-sampling的1.3倍快，而且在AUC上的表現差距不多。

關鍵字

半監督式學習；自我訓練；不平衡資料

並列摘要

There are two methods proposed to address classification problems of imbalanced data. First, we propose a method that has smaller parameter space and more performance when using self-training. We train confidence thresholds for each classifier using labeled data to identify high confident data, and label them pseudo labels for re-train. Through this scheme we get less training time for parameters and get better performance. Second, we proposed an efficient training method for imbalanced data. We start with down-sampling and using a method like bootstrap. The model will approximate the model of up-sampling. Using less training data leads to less training time. We do experiments on KDDCUP 2008 data. The result shows that our threshold-based self-training has better performance and the approximated model has the same performance as up-sampling but cost only 0.75 times training time of up-sampling.

並列關鍵字

semi-supervised learning ； self-training ； imbalanced data ； kddcup 08

參考文獻

[1] Luca Didaci, Fabio Roli: Using Co-training and Self-training in Semi-supervised Multiple Classifier Systems. SSPR/SPR 2006: 522-530

[2] G. M. Weiss. Mining with rarity - problems and solutions: A unifying framework. SIGKDD Explorations, 6(1):7–19, 2004

[3] A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145-1159, 1997.

[5] Rong Zhang, Alexander I. Rudnicky, "A New Data Selection Principle for Semi-Supervised Incremental Learning," Pattern Recognition, International Conference on, vol. 2, pp. 780-783, 18th International Conference on Pattern Recognition (ICPR'06) Volume 2, 2006.

[8] R. G. Swensson, “Unified measurement of observer performance in detecting and localizing target objects on images,” Med. Phys. 23, 1709–1725 s1996d.

國際替代計量

對不平衡的資料有效率的訓練和自我訓練的門檻分析

主題瀏覽