應用機器學習分類演算法於二元不平衡資料之預測

機器學習演算法在處理不平衡資料時對少數類別的誤判率高，而實務上少數類別的判別又相對重要。據此，本文旨在探討提升二元不平衡資料少數類別預測能力的機器學習演算法。透過實證方式比較分析結合SMOTE演算法（SMOTE、Borderline-SMOTE及SVM-SMOTE重抽樣演算法）與集成學習分類模型（隨機森林及XGBoost）在二元不平衡類別資料的分類預測能力。實證分析採用三個二元不平衡資料集，並以Precision-Recall曲線下面積（AP）及ROC曲線下面積（AUC）作為模型分類效能的衡量指標。實證結果顯示，三種重抽樣方法中，SVM-SMOTE可明顯改善少數類別的預測能力；兩個分類模型中，隨機森林的表現比XGBoost優異。整體而言，本文所提結合隨機森林與SVM-SMOTE重抽樣法的混合模型表現最佳，可用以提升二元不平衡資料少數類別的分類預測能力。

關鍵字

不平衡資料；集成學習；分類演算法； SMOTE ；類別不平衡問題

並列摘要

It is an important issue for machine learning algorithms to deal with the imbalanced data in predicting the minority category. This study aims to explore the classification performance of binary imbalanced data based on Synthetic Minority Oversampling Technique (SMOTE) resampling algorithms and ensemble learning techniques. In this study, three resampling methods, namely SMOTE, Borderline-SMOTE and Support Vector Machine Synthetic Minority Oversampling Technique (SVM-SMOTE) integrated with ensemble learning techniques are presented and compared for their classification performance based on an empirical analysis of three binary imbalanced datasets. Two ensemble learning techniques, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) are selected as the classification models. The Average Precision (AP) and the Area under the Curve of ROC (AUC) are used to evaluate the classification performance of the models. Study results show that the SVM-SMOTE method can improve predictive ability of minority categories. Moreover, the RF performs better than the XGBoost for classifying binary imbalanced data. In summary, the hybrid model that combines the SVM-SMOTE resampling method with the RF classification model has the best performance for predicting binary imbalanced data and can be used to improve the classification accuracy of the minority category. Therefore, the hybrid model is suggested for dealing with the class imbalanced problem.

並列關鍵字

Imbalanced data ； Ensemble learning ； Classification algorithm ； SMOTE ； Class imbalanced problem

國際替代計量

應用機器學習分類演算法於二元不平衡資料之預測

全文下載

主題瀏覽