  • 期刊


Improving the Accuracy of Classification for Minority Class in an Unbalanced Dataset



在電子商務的環境下,分類(Classification)技術有助於了解和預測客戶的行為。對於分類(Classification)技術而言,訓練資料(Training Data)中資料的分佈,往往是影響分類技術正確率(Accuracy)的重要因素之一。然而,在許多實際應用的資料中,目標屬性(Target Attribute)的類別分佈經常呈現不平衡的分佈;也就是大多數資料是屬於多數類別(Majority Class)的資料,而只有少數資料是屬於少數類別(Minority Class)資料。在這種情況下,分類器(Classifier)會傾向於將大部分要預測的資料之目標屬性值預測為多數類別,也就是少數類別資料的預測能力非常差。因此,當資料的目標屬性類別呈現不平衡的分佈(Imbalanced Class Distribution)時,如何篩選出適合的(也就是平衡分佈的)訓練資料集是非常重要的。本篇論文提出一個以分群為基礎的的減少多數抽樣法(Cluster-Based Under-Sampling),挑選具代表性的多數類別資料進入訓練資料集,以提高不平衡資料集中,少數類別資料的分類正確性。透過實驗的結果顯示,本論文所提出的方法,優於其它先前的研究。


In an electronic commence environment, classification technique can help us understand and predict the behaviors of customers. The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.


Caragea, D.,Cook, D.,Honavar, V.(2001).Gaining insights into support vector machine pattern classifiers using projection-based tour methods.Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD).(Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD)).
Chawla, N.V.(2003).C4.5 and imbalanced datasets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure.(Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets).
Chawla, N.V.,Bowyer, K.W.,Hall, L.O.,Kegelmeyer, W.P.(2002).SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research.16,321-357.
Chawla, N.V.,Lazarevic, A.,Hall, L.O.,Bowyer, K.W.(2003).Smoteboost: Improving prediction of the minority class in boosting.Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD).(Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)).
Chyi, Y. M.(2003).Classification Analysis Techniques for Skewed Class Distribution Problems.Department of Information Management, National Sn Yat-Sen University.


