改善資料不平衡分類問題之研究-以鋁壓鑄件為例

資料分類問題在機器學習領域中是重要研究議題之一，而理論上現有的分類器都是建立在資料分布平衡的基礎上，但資料不平衡的問題在實務上卻是非常常見的，例如金融相關的信用卡呆帳或是破產問題、保險理賠、潛在客戶預測、醫學中的罕見疾病、工業產品管理中的不良品等，這些例子皆屬於少數類別的資料，也是在實務中相對於多數類別較為重要的資料。但在資料不平衡的情況下，分類器會以最大化分類準確率為目標，容易造成分類器所建立的預測模型傾向於將樣本預測為多數類，而犧牲了少數類別的分類能力。因此本研究欲探討處理資料不平衡的方法，並提高少數類別的分類能力。本研究使用台灣汽車產業壓鑄廠實際生產之數據為分析依據，根據其高精度的特性使得不良品極少，因此數據存在著嚴重的資料不平衡問題，導致判別鑄件是否存在缺陷的預測模型在預測結果上雖然準確率高，但無法準確預測出不良品，本論文透過三種基於SMOTE(Synthetic Minority Oversampling Technique, SMOTE)改善的重抽樣演算法處理資料不平衡的問題，分別是ADASYN (Adaptive Synthetic)、Borderline-SMOTE及SMOTETomek三種。並建構類神經網路(Artificial Neural Networks, ANNs)、支持向量機(Support Vector Machine, SVM)及隨機森林(Random Forest)三種預測模型，以驗證重抽樣的結果，最後透過混淆矩陣(Confusion Matrix)的召回率(Recall)、F-指標(F-Measure)及ROC曲線下面積(Area Under the Curve of Receiver Operating Characteristic Curve, AUC ROC)評比三種重抽樣演算法與三種模型搭配之優劣，找出最適配之網路預測模型。研究結果顯示，三種重抽樣方法在此分類問題中，皆能有效解決不平衡資料的問題，且提升模型對少數類別的分類能力，其中以SMOTETomek能力最佳，而SMOTETomek與支持向量機的配置有最佳結果，使得模型預測準確率高達99%。

關鍵字

不平衡資料； SMOTETomek ； ADASYN ； Borderline-SMOTE ；機器學習

並列摘要

Data classification is one of the important research topics in machine learning. In theory, the existing classifiers are based on the balanced data sets, but imbalanced data is very common problem in practice, such as bankruptcy issues, insurance claims, potential customer forecasts, rare diseases in medicine, defective products in industrial product management, etc. These examples belong to a small number of categories of data, but more important than majority of data relatively in practice. However, in the case of imbalanced data, the classifier will aim to maximize the classification accuracy, which easily causes the prediction model tend to predict the sample as the majority class, but sacrifice the classification ability of the minority class. Therefore, this study intends to explore the ways to deal with imbalanced data and improve the classification ability of the small class. In this study, we use the actual production data of Taiwan’s automobile industry die-casting as the basis for analysis. According to its high-precision characteristics, there are very few defective products. Therefore, the data has serious problem of imbalanced data, which leads to the high accuracy of prediction model for judging whether the castings have defects but hardly predict defective products accurately. This paper uses three resampling algorithms, which namely ADASYN (Adaptive Synthetic), Borderline-SMOTE and SMOTETomek, to deal with the problem of imbalanced data. Three resampling algorithms are based on SMOTE (Synthetic Minority Oversampling Technique, SMOTE). Then construct three prediction models of Artificial Neural Networks (ANNs), Support Vector Machine (SVM) and Random Forest (RF) to verify the results of the resampling. Finally, through the Recall and F-Measure from Confusion Matrix and Area Under the Curve of Receiver Operating Characteristic Curve (AUC ROC) evaluate the pros and cons of the three resampling algorithms and the three models, and find the best fit equipped with network prediction model. The research results show that the three resampling methods in this classification problem can solve the problem of imbalanced data effectively, and improve the ability of model to classify small class. Among them, SMOTETomek has the best ability, and the configuration of SMOTETomek and SVM has the best results, which makes the model prediction accuracy rate as high as 99%.

並列關鍵字

Imbalanced Data ； SMOTETomek ； ADASYN ； Borderline-SMOTE ； Machine Learning

參考文獻

Al-Badarneh, I., Habib, M., Aljarah, I., Faris, H. (2020). Neuro-evolutionary models for imbalanced classification problems. Journal of King Saud University-Computer and Information Sciences.

Google Scholar

Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W. (2017). The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Information Sciences, 384, 174-190.

Google Scholar

Batista, G. E., Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6), 519-533.

Google Scholar

Batista, G. E., Prati, R. C., Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29.

Google Scholar

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Google Scholar

延伸閱讀

鄭宜雯（2021）。在不平衡數據集下改善鋁壓鑄件缺陷檢測性能的探討〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu202100931
許光城、黃榮茂、葉展宏（2015）。複雜斷面中空鋁擠型模具設計之研究。鍛造，24(1)，20-24。https://www.airitilibrary.com/Article/Detail?DocID=1023750x-201503-201902140030-201902140030-20-24
朱致潔（2010）。鋁結構容許應力設計規範之研究〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2010.00922
許光城、盛鉑灃、陳仕諺（2013）。以有限元素分析暨田口式品質方法模擬高強度鋁合金銲合壓力與材料特性之研究。鍛造，22(3)，12-23。https://www.airitilibrary.com/Article/Detail?DocID=1023750x-201310-201902140024-201902140024-12-23
JOKHIO, M. H., PANHWAR, M. I., & UNAR, M. A. (2011). Modeling Mechanical Properties of Aluminum Composite Produced Using Stir Casting Method. Mehran University Research Journal of Engineering and Technology, 30(1), 75-88. https://www.airitilibrary.com/Article/Detail?DocID=P20170126001-201101-201702060029-201702060029-75-88

國際替代計量

改善資料不平衡分類問題之研究-以鋁壓鑄件為例

未授權

主題瀏覽