發展解決類別不平衡問題方法之探討

類別不平衡問題(Class Imbalance Problems)在機器學習領域已經十分受到重視。此問題主因在於訓練範例中，某一類別數量會遠大於另一類別。當從不平衡資料(Imbalanced Data)中進行資料探勘時，傳統機器學習方法，對於多數類範例將會產生比較高的準確率，而對於偵測少數類別有很高的錯誤率。為了解決此一問題，本研究提出了兩個新的方法，分別稱為「修改分群抽樣法」，「Modified cluster based sampling, MCBS」與「植基倒傳遞類神經網路之投票機制」，「BPN based voting scheme, BPS」方法。並利用七組類別不平衡資料與三個實際部落格語意分類案件，以四摺交叉驗證法(4-fold cross-validation)來驗證所提方法之有效性。MCBS法是要改善傳統分群抽樣法的缺點。而BPS法藉由BPN網路的訓練，以獲得各投票分類器的最佳權重，來強化傳統的投票分類機制。另外，將所提之方法，應用於文件資料分類上，主要是因為文件資料具有高維度和小樣本等問題。實驗結果顯示，相較於傳統處理不平衡資料的方法如隨機增加少數法、分群抽樣法、自組織映射網路權重法、兩階段學習法、單類別學習等。本研究所提方法不僅在偵測少數類別有較高的能力，同時在分類績效表現上也比較穩定。

關鍵字

決策樹；資料探勘；倒傳遞類神經網路；分類；類別不平衡問題

並列摘要

Class imbalance problems have attracted much attention in the field of machine learning. This problem is mainly attributed to training examples, in which, the number of particular class examples will be much larger than the other classes. When learning from such imbalance data, traditional machine learning algorithm will have a relatively high accuracy over the majority examples, and lead to an unacceptable error rate for the minority class instances which are usually important. In order to solve this problem, this study attempts to propose two novel methods, called “Modified cluster based sampling, MCBS” and “BPN based voting scheme, BPS”. Seven data sets from UCI data bank and three real cases of bloggers’ sentiment classification have been provided to verify the effectiveness of the proposed methods. In addition, four fold cross validation experiments have been implemented for obtaining high quality solutions. MCBS is to improve the shortcomings of traditional clustering sampling method. The BPS method enhance traditional voting scheme by using BPN network to get the optimal vote weights. In addition, the proposed methodologies have been applied to classify textual sentiment data which usually has problems of high dimension and small sample size problems. Experimental results indicated that, compared with conventional treatment methods for imbalance data, such as under-sampling, cluster based sampling, self-organizing map network weights method, two stage learning strategy, and one class learning, the proposed methods can not only increase the ability of detecting minority examples, but also have stable classification performance.

並列關鍵字

Decision tree ； Data mining ； Back-propagation neural network ； Class imbalance problems ； Classification

參考文獻

[6]姚志成 (2005)，運用資料探勘技術建構脂肪肝預測模式，碩士論文，中原大學，資訊管理學系，中壢。

[11]陳世彥 (2008)，植基於規則推導的電腦輔助醫療診斷，碩士論文，東海大學，資訊工程與科學研究所，台中。

[4]林瑞山 (2004)，「類神經網路於預測晶圓測試良率之應用」，碩士論文，國立成功大學，工學院工程管理系，台南。

[5]邱慧如 (2009)，發展新語意導向指標以分類部落客之語意，碩士論文，朝陽科大資管系，台中。

[10]張毓珊 (2009)，發展處理類別不平衡問題之資料探勘模式，碩士論文，朝陽科技大學，資訊管理系，台中。

國際替代計量

發展解決類別不平衡問題方法之探討

主題瀏覽