透過您的圖書館登入
IP:18.217.228.35
  • 學位論文

標籤不均衡資料與學習有效分類器之樣本挑選法

Sample Selection on Labeling Imbalanced Datasets and Learning Efficient Classifiers

指導教授 : 李新林
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在實際應用面上,當建置分類系統時,會先遭遇兩個主要的困難需求:首要是難以在短時間內收集完整的資料集合,再來就是難以人工標識(labeling)所有資料的標籤。由於主動式學習法(active learning)主要目的是能標識有用資料並用以求得高準確率的分類器,以降低需人工需求,而漸進式學習法(incremental learning)主要能有效率地用批次資料更新分類器,因此,本篇論文主要深入探討主動式學習法與漸進式學習法,研究問題著重於標識不均衡資料集合(imbalanced datasets),以及有效率地建置分類器。主要想法是辨識有用(informative)資料後,再給與標識並用來訓練分類器。我們的主動式學習方法設計為在決定標識資料時,並不會被不均衡資料分佈的問題所影響,因為我們的方法只採用特定與目標有關的資料(非所有觀察資料)決定標識與否。除此之外,我們的漸進式學習方法設計理念為辨識有用(informative)資料並用以有效地修正分類器,這些資料可以是分類器預測錯誤的或是預測結果信心指數(confidence)過低的。在這漸進式學習法的問題設定下,初始階段裡,因為批次資料累積不足,造成分類器準確率過低,對此,我們量身為每一個目標資料建立分類器,只採用與目標有關的資料建立分類器,以提高分類預測準確率。在實驗環境設定裡,我們的方法與比較對象皆會在自製資料集合、UCI資料集合、與中正大學資料集合上執行,透過實驗結果的觀察與理論分析,呈現出我們的方法確實能有效處理實際應用面所遇到的標識問題與分類器更新問題。

並列摘要


When building a classification system, two practical issues should be carefully concerned. Firstly, it is difficult to collect a complete dataset in a short period of time. Secondly, it is expensive to label collected data by human effort. In this thesis, we study further research issues in active learning which aims to label informative samples and in incremental learning which generates the classifier using sequential datasets. Thus we concentrate on designing approach to label imbalanced datasets and to learn efficient classifiers. Our main concept is to select informative samples used for labeling data or for adjusting classifiers. Our active learning approaches aim to query unlabeled samples without being affected by the imbalanced classification problem. They select the specified labeled samples to determine whether an unlabeled sample is queried or not. Moreover, the objective of our incremental learning approaches is to select informative samples to efficiently adjust the classifier. Those samples could be misclassified or classified in low confidence. We also concern that the dataset which is sequentially collected is still insufficient. In this condition, we select labeled samples that are relevant to generate specific classifiers for the target sample. In our experiments, approaches are evaluated on synthetic datasets and some real-world datasets from UCI repository and the campus of National Chung Cheng University. Through the experimental results and theoretical analysis, it is presented that our approaches have the abilities of effectively handling the practical issues in labeling data and adjusting classifiers.

參考文獻


[1] O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, The MIT Press,
[2] R. Xu, D. W. II, Survey of clustering algorithms, IEEE Transactions on Neural
[3] M. E. Tipping, C. M. Bishop, Probabilistic principal component analysis, Journal
of Royal Statistical Society, Series B 61 (3) (1999) 611{622.
[4] A. Hyvarinen, E. Oja, Independent component analysis: algorithms and applica-

延伸閱讀