改善不平衡資料集中少數類別資料之分類正確性的方法

在電子商務的環境下，分類(Classification)技術有助於了解和預測客戶的行為。對於分類(Classification)技術而言，訓練資料(Training Data)中資料的分佈，往往是影響分類技術正確率(Accuracy)的重要因素之一。然而，在許多實際應用的資料中，目標屬性(Target Attribute)的類別分佈經常呈現不平衡的分佈；也就是大多數資料是屬於多數類別(Majority Class)的資料，而只有少數資料是屬於少數類別(Minority Class)資料。在這種情況下，分類器(Classifier)會傾向於將大部分要預測的資料之目標屬性值預測為多數類別，也就是少數類別資料的預測能力非常差。因此，當資料的目標屬性類別呈現不平衡的分佈(Imbalanced Class Distribution)時，如何篩選出適合的（也就是平衡分佈的）訓練資料集是非常重要的。本篇論文提出一個以分群為基礎的的減少多數抽樣法(Cluster-Based Under-Sampling)，挑選具代表性的多數類別資料進入訓練資料集，以提高不平衡資料集中，少數類別資料的分類正確性。透過實驗的結果顯示，本論文所提出的方法，優於其它先前的研究。

關鍵字

分類；分群；減少多數抽樣法；不平衡資料集

並列摘要

In an electronic commence environment, classification technique can help us understand and predict the behaviors of customers. The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.

並列關鍵字

Classification ； Clustering ； Under-sampling ； Imbalanced Dataset

參考文獻

Caragea, D.,Cook, D.,Honavar, V.(2001).Gaining insights into support vector machine pattern classifiers using projection-based tour methods.Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD).(Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD)).

Google Scholar

Chawla, N.V.(2003).C4.5 and imbalanced datasets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure.(Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets).

Google Scholar

Chawla, N.V.,Bowyer, K.W.,Hall, L.O.,Kegelmeyer, W.P.(2002).SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research.16,321-357.

Google Scholar

Chawla, N.V.,Lazarevic, A.,Hall, L.O.,Bowyer, K.W.(2003).Smoteboost: Improving prediction of the minority class in boosting.Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD).(Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)).

Google Scholar

Chyi, Y. M.(2003).Classification Analysis Techniques for Skewed Class Distribution Problems.Department of Information Management, National Sn Yat-Sen University.

Google Scholar

被引用紀錄

俞允晨（2017）。高維度不平衡資料演算法之變數篩選〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2017.00509

吳思葦（2015）。運用資料探勘於銀行業潛在顧客預測模型之研究〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2015.00034

陳祝美（2011）。不同行業別虛設行號預警模型之初探〔碩士論文，元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-2801201414590900

國際替代計量

改善不平衡資料集中少數類別資料之分類正確性的方法

未授權

主題瀏覽