基於不平衡資料分類方法之研究

對於數據的處理方法，各領域都會遇到不同的難題，其中不平衡資料是一項較為棘手的課題。目前學術界有針對多數類的欠採樣，也有針對少數類的過採樣，但只要處理不妥，就容易在欠採樣時造成樣本本身重要資訊遺失，或是在過採樣時造成分類器過擬合。也有不少研究針對分類器進行改良、優化，但資料本身的品質優劣較大程度的影響了分類結果，分類器本身的改良對於分類結果較無顯著的幫助。　　本研究結合了SMOTE(Synthetic Minority Oversampling Technique)合成少數法及NearMiss欠採樣法來解決資料不平衡的問題，並和過採樣法、SMOTE法分別建立決策樹分類模型進行比較，最後透過實驗得知使用NMS(NearMiss-2 SMOTE)採樣法在四種不同數據的實驗中皆為最佳採樣方法，在少數類樣本的分類正確率也為各種採樣方法中最高的。

關鍵字

SMOTE合成少數法(Synthetic Minority Oversampling Technique) ；過採樣(Over Sampling) ；欠採樣(Under Sampling) ； NearMiss ；決策樹(Decision Tree)

並列摘要

For data processing methods, various fields will encounter different problems, and unbalanced data is a more difficult subject. At present, academia has under-sampling for the majority of classes and over-sampling for the minority classes, but as long as it is not handled properly, it is easy to cause important information about the sample itself to be lost during under-sampling, or to over-fit the classifier during over-sampling. There are also many studies that improve and optimize the classifier, but the quality of the data itself has a greater impact on the classification results, and the improvement of the classifier itself has no significant help to the classification results. 　　This study combines SMOTE (Synthetic Minority Oversampling Technique) and NearMiss solve the problem of data imbalance, and compare with the oversampling method and SMOTE method to establish the decision tree classification model. Finally, through experiments, it is found that the NMS (NearMiss-2 SMOTE) sampling method is the best in the four different data experiments. The best sampling method, the classification accuracy rate of the minority samples is also the highest among various sampling methods.