  • 學位論文


Developing Data Mining Models for Class Imbalance Problems

指導教授 : 陳隆昇


在分類問題中,類別不平衡問題(Class Imbalance Problems)會使分類器在訓練時產生偏誤,導致其對少數類別(Minority Class Examples)有相當低的預測正確率。這個問題是因為不平衡的資料所造成,在此種型態資料中,一個類別的樣本數會遠超過其它類別的樣本數,使類別樣本的分佈呈現偏斜狀況(Skewed Class Distribution),而相較於多數類別樣本,少數樣本通常是較有趣的類別。例如,醫學診斷資料的少見疾病、監測資料中的錯誤資料、信用卡審查中的詐騙資料等。當從不平衡資料萃取知識時,傳統的資料探勘方法會對多數類別樣本追求高的分類正確率,但對少數類別有極差的預測正確率,所以它們並不適合用來處理類別不平衡的資料。為了解決類別不均問題,本研究的目的有(1)從決策樹(Decision Tree, DT),邏輯式迴歸(Logistic Regression, LR),馬氏距離(Mahalanobis Distance, MD)與支撐向量機(Support Vector Machines, SVM),找出較穩健的分類器。(2)提出兩種新方法,分別是「馬氏距離與支撐向量機之兩階段分類法」(MD-SVM)與「SOM權重法」(SWAI)。實驗結果顯示,所提的MD-SVM與SWAI方法,相較於傳統處理不平衡資料的方法如調整錯誤分類成本法、隨機減少多數法、分群抽樣法等,在偵測少數類別範例上有較佳的績效表現。


In classification problems, the class imbalance problem would cause a bias on the training of classifiers and result in a low predictive accuracy over the minority class examples. This problem is caused by imbalanced data in which almost all examples belong to one class and far fewer instances belong to others. Compared with the majority examples, the minority examples are usually more interesting class, such as rare diseases in medical diagnosis data, failures in inspection data, frauds in credit screening data, and so on. When inducing knowledge from an imbalanced data set, traditional data mining algorithms will seek high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, they are not suitable for handling the class imbalanced data. In order to tackle the class imbalance problem, this study aims to (1) find a robust classifier from different candidates including Decision Tree (DT), Logistic Regression (LR), Mahalanobis Distance (MD), and Support Vector Machines (SVM); (2) propose two novel methods called MD-SVM (a new two-phase learning scheme) and SWAI (SOM Weights As Input). Experimental results indicated our proposed MD-SVM and SWAI has better performance in identifying the minority class examples compared with traditional techniques such as under-sampling, cost adjusting, and cluster based sampling.


[3] 郭琇靜 (2007),應用支援向量機與製程統計特徵於線上偵測製程異常之研究,碩士論文,國立虎尾科技大學工業工程與管理研究所,雲林。
[1] A. An, and Y. Wang, (2001), “Comparisons of classification methods for screening potential compounds,” Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 11-18.
[2] A. Orriols-Puig, and E. Bernadó-Mansilla, (2009), “Evolutionary rule-based systems for imbalanced datasets,” Soft Computing, vol. 13, pp. 213-225.
[3] A. Fernández, S. García, M.J. del Jesus, and F. Herrera, (2008), “A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets,” Fuzzy Sets System, Vol. 159, pp. 2378-2398.
[4] A. Fernández, M.J. del Jesus, and F. Herrera, (2009), “On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets,” Expert Systems with Applications, Vol. 36, pp. 9805-9812.


