發展先進資料增強技術以分析良莠比例失衡資料

在執行實例分類任務時，資料不平衡是各產業中常見且棘手的問題，在使用機器學習模型分析此類資料時，模型無法學到欲關注的目標，例如通常為少量的瑕疵樣本。為了解決不平衡類別資料造成的分類難題，常見解法有三：針對判錯樣本的模型學習、資料重抽樣及產生人工合成少數類別樣本。然而以上的方法仍有不足之處，如資料重抽樣容易導致模型過度擬合或刪除重要的樣本資訊、針對判錯樣本加強學習則在遇到極端不平衡的資料時，改善幅度有限、產生人工資料樣本有可能會產生與多數類別樣本相似的資料。鑒於上述三種解法改善幅度有限，本研究提出基於主成分之馬氏距離 (Principal Component-based Mahalanobis Distance, PCMD) 的生成方法，此方法之目的為在建立屬於各少數類別自有的空間後，再生成少數類別樣本。先將標準化的資料以PCA降維後，再以各個少數類別樣本為中心，計算每筆資料對其的馬氏距離，並透過卡方檢定去過濾資料完成分布的更新，最後再藉由得出多數類別樣本之最短或次短距離作為生成樣本的限制來產生新樣本，藉由此種考慮多數類別、以各個少數類別為中心、並更新其分布的過程來生成與少數類別相似的人工樣本，接著使用五種分類模型: Logistic Regression (LR)、Random Forest (RF)、Support Vector Machine (SVM)、eXtreme Gradient Boosting (XGBoost) 及Light Gradient Boosting Machine (LGBM) 進行分析，最後與現有產生人工資料的方法SMOTE、ADASYN、VAE、GAN作為標竿進行預測之效果比較。研究結果顯示，本研究提出的PCMD方法在各模型上皆可得到比現有生成方法還要更好的結果，其中更以LR模型得到的召回率最高，XGBoost模型之結果在各類資料上最穩定。關鍵詞：不平衡資料；主成分分析；馬氏距離；資料增強；資料重抽樣；機器學習；良率分析

關鍵字

不平衡資料；主成分分析；馬氏距離；資料增強；資料重抽樣；機器學習；良率分析

並列摘要

Imbalanced data is becoming a more and more common and critical problem in various industries. Data imbalance results in preventing machine learning models from learning on what they should focus. To solve the classification on imbalanced data, there are three common ways: penalizing to learn over misclassified samples, resampling the data, and generating synthetic samples. However, these methods are not flawless. When data are extremely imbalanced, penalization does not take effect. Resampling easily leads to the model overfitting or losing key samples. Synthetic samples may get too close to the majority ones. In this thesis, a data synthesizing method called “Principal Component-based Mahalanobis Distance” (PCMD) to generate the minority data is proposed. First, dimension reduction is performed on normalized data using PCA. Each of the minority samples is treated as the center to calculate the Mahalanobis distances to the rest of the samples. The chi-squared test is used to filter out the samples that are too far from the center, and the private space of the minority sample is updated. Finally, within the private space of the minority sample, synthetic samples are generated. Five classification models are then tested with the synthesized dataset, including Logistic Regression, Random Forest, Support Vector Machine, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. The proposed synthesizing method is also benchmarked with other data augmentation methods, such as SMOTE, ADASYN, VAE, and GAN. The results show that PCMD can obtain better results than the conventional data augmentation methods on all models. Among these, the LR model has the highest recall rate, while the XGBoost model performs the most stably. Keywords: imbalanced data, principal component analysis, Mahalanobis distance, data augmentation, resampling, machine learning, yield analytics

並列關鍵字

imbalanced data ； principal component analysis ； Mahalanobis distance ； data augmentation ； resampling ； machine learning ； yield analytics

參考文獻

Abdi, H., Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459.

Google Scholar

Ahsan, M., Mashuri, M., Kuswanto, H., Prastyo, D. D., Khusna, H. (2018). Multivariate control chart based on PCA mix for variable and attribute quality characteristics. Production Manufacturing Research, 6(1), 364-384.

Google Scholar

Amini, P., Ahmadinia, H., Poorolajal, J., Amiri, M. M. (2016). Evaluating the high risk groups for suicide: a comparison of logistic regression, support vector machine, decision tree and artificial neural network. Iranian Journal of Public Health, 45(9), 1179.

Google Scholar

Barua, S., Islam, M. M., Yao, X., Murase, K. (2012). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge Data Engineering, 26(2), 405-425.

Google Scholar

Batista, G. E., Prati, R. C., Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.

Google Scholar

國際替代計量

發展先進資料增強技術以分析良莠比例失衡資料

不提供下載

主題瀏覽