標籤不均衡資料與學習有效分類器之樣本挑選法

在實際應用面上，當建置分類系統時，會先遭遇兩個主要的困難需求：首要是難以在短時間內收集完整的資料集合，再來就是難以人工標識(labeling)所有資料的標籤。由於主動式學習法(active learning)主要目的是能標識有用資料並用以求得高準確率的分類器，以降低需人工需求，而漸進式學習法(incremental learning)主要能有效率地用批次資料更新分類器，因此，本篇論文主要深入探討主動式學習法與漸進式學習法，研究問題著重於標識不均衡資料集合(imbalanced datasets)，以及有效率地建置分類器。主要想法是辨識有用(informative)資料後，再給與標識並用來訓練分類器。我們的主動式學習方法設計為在決定標識資料時，並不會被不均衡資料分佈的問題所影響，因為我們的方法只採用特定與目標有關的資料(非所有觀察資料)決定標識與否。除此之外，我們的漸進式學習方法設計理念為辨識有用(informative)資料並用以有效地修正分類器，這些資料可以是分類器預測錯誤的或是預測結果信心指數(confidence)過低的。在這漸進式學習法的問題設定下，初始階段裡，因為批次資料累積不足，造成分類器準確率過低，對此，我們量身為每一個目標資料建立分類器，只採用與目標有關的資料建立分類器，以提高分類預測準確率。在實驗環境設定裡，我們的方法與比較對象皆會在自製資料集合、UCI資料集合、與中正大學資料集合上執行，透過實驗結果的觀察與理論分析，呈現出我們的方法確實能有效處理實際應用面所遇到的標識問題與分類器更新問題。

關鍵字

漸進式學習法；不均衡資料分類；不均衡資料集合；主動式學習法；樣本選擇法

並列摘要

When building a classification system, two practical issues should be carefully concerned. Firstly, it is difficult to collect a complete dataset in a short period of time. Secondly, it is expensive to label collected data by human effort. In this thesis, we study further research issues in active learning which aims to label informative samples and in incremental learning which generates the classifier using sequential datasets. Thus we concentrate on designing approach to label imbalanced datasets and to learn efficient classifiers. Our main concept is to select informative samples used for labeling data or for adjusting classifiers. Our active learning approaches aim to query unlabeled samples without being affected by the imbalanced classification problem. They select the specified labeled samples to determine whether an unlabeled sample is queried or not. Moreover, the objective of our incremental learning approaches is to select informative samples to efficiently adjust the classifier. Those samples could be misclassified or classified in low confidence. We also concern that the dataset which is sequentially collected is still insufficient. In this condition, we select labeled samples that are relevant to generate specific classifiers for the target sample. In our experiments, approaches are evaluated on synthetic datasets and some real-world datasets from UCI repository and the campus of National Chung Cheng University. Through the experimental results and theoretical analysis, it is presented that our approaches have the abilities of effectively handling the practical issues in labeling data and adjusting classifiers.

並列關鍵字

Sample Selection ； Incremental Learning ； Imbalanced Data Classification ； Imbalanced Datasets ； Active Learning

參考文獻

[59] H. T. Nguyen, A. Smeulders, Active learning using pre-clustering, in: International

[64] A. Beygelzimer, D. Hsu., J. Langford, T. Zhang, Agnostic active learning without

[76] Y. Chen, S. Chang, C. Chou, W. Peng, S. Lee, Exploring community structures

[114] L. Manevitz, M. Yousef, One-class svms for document classi¯cation, Machine Learn-

[159] F. Chang, C. H. Chou, C. C. Lin, C. J. Chen, A prototype classi¯cation method

國際替代計量

標籤不均衡資料與學習有效分類器之樣本挑選法

未授權

主題瀏覽