發展資料不平衡與類別變數限制下的生產良率分類模型

錯誤偵測及預測的系統為許多先進製程中重要的分析環節，判斷的方法常以對資料分類或分群來判斷產品或在製品的正常或異常狀態。傳統上，利用機器學習演算法建立的良率分類模型，往往能夠達到很好的效果。然而類別間的比例失衡是實務上常見的資料特性，例如科技業製程中，一般僅有千分之一或甚至到百萬分之一的機率會出現不良品，因此大部分以整體分類正確率為目標的演算法很容易將全部的資料預測成良品，以達到極高的正確率，卻其實沒有真正學習到類別之間的差異性，更忽略了誤判不良品所帶來的極高成本，此類模型並無任何實質的應用價值。近年針對上述問題的文獻多以資料擴充方法 (data augmentation) 或模型參數調整來解決外，也有透過先對資料特性進行分析，例如：不平衡比例、密集程度、類別間的重疊度或是在同類別中又存在著大小不同的群體等性質，再進行資料擴充，惟上述的擴充基礎皆根基於數值型變數。本論文承接此一基礎，轉發展非數值型變數，例如全為二元變數時，該如何進行資料擴充。我們利用漢明距離 (Hamming distance) 計算二元特徵間的相似度，透過少數類別和多數資料的互動關係提出一嶄新的上採樣 (oversampling) 方法，經降低資料中噪音的干擾和重疊度後，將新資料生成在少數類別之間，或是避開多數類別的混淆。最後，本研究透過結合下採樣 (undersampling) 以及訓練集資料平衡比例的控制，對經過不同資料擴充方法的訓練集和常見的模型做多種組合的實驗分析，結果發現有進行上採樣的訓練集，其模型表現較佳；而對於極度不平衡且皆為類別變數的資料，透過本論文提出的方法亦能發現訓練集的改變對於最後的指標較有顯著的效果，而不同模型所帶來的影響則相對小。

關鍵字

不平衡資料；類別變數；資料重採樣；資料增能擴充；錯誤偵測與分類；機器學習模型

並列摘要

The classification system for fault detection and prediction is an important analysis tool in advanced processes. We often use classification or clustering to tell if the state of product is anomaly or not. The classification models based on yield by machine learning usually get good results, but it will be difficult identifying the minority when the training data is imbalanced. However, the cost of imbalanced data is relatively higher is also a common problem in the process of the technology industry. For example, there is generally one-thousandth or even one-in-a-million probability of detective products in the process, and the algorithm usually results in that all the products are good to reach a high accuracy rate. Therefore, the classifiers do not learn the difference between categories, ignoring the extremely high cost of misclassification. Recent studies show that the imbalanced data problem is mostly tackled by data augmentation or model parameter optimization. Some researches start by analyzing the data characteristics first, such as the imbalance ratio, distribution density, and overlap between categories, and then augment the minority data. Nevertheless, the data augmentation is purely based on numeric variables. In this thesis, we study and develop the novel data augmentation with regards to discrete variables, in particular, binary ones. Hamming distance is employed to calculate the similarity among binary features, and a new oversampling method based on the interaction between the minority and majority is proposed. New minority data are generated after taking the noise of data distribution and the confusion of the majority category into account. Finally, by combining conventional undersampling methods and controlling the balance ratio of the training data, this thesis conducts a variety of experimental analyses on applying the proposed oversampling algorithm to generating minority data, which are then trained by machine learning models. The results found that with the proposed oversampling method, model performances are consistently better compared with the benefits of model optimization.

並列關鍵字

imbalanced data ； categorical variables ； oversampling ； undersampling ； data augmentation ； fault detection and classification ； machine learning algorithms

參考文獻

[1] Barandela, R., Sánchez, J. S., Garcıa, V., Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.

Google Scholar

[2] Bayle, P., Bayle, A., Janson, L., Mackey, L. (2020). Cross-validation Confidence Intervals for Test Error. arXiv preprint arXiv:2007.12671.

Google Scholar

[3] Beyan, C., Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653-1672.

Google Scholar

[4] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.

Google Scholar

[5] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

Google Scholar

國際替代計量

發展資料不平衡與類別變數限制下的生產良率分類模型

不提供下載

主題瀏覽