集成學習Bagging模型在不平衡資料集的應用

在企業當中做分類問題時，資料集中的目標欄位每個類別數量有懸殊差異是極為常見的，例如信用卡詐騙、有無得糖尿病等等資料集，較多數量的類別得出的模型估計準確度會比數量較少的類別準確度會高出很多。為了改善此問題，本論文使用集成學習來加強模型的訓練，以利於提升模型估計的準確度。在論文中，除了使用bagging模型來實作之外，還有使用延伸的方法來調整，並觀察對於bagging模型有無幫助，例如子模型決策樹的參數調整與資料抽樣方法調整之各種比較。本論文研究目的是在探討如何對bagging模型做決策樹參數的調整與資料抽樣的調整，來得到更大的效益。

關鍵字

Bagging模型；分類；決策樹；集成學習；多數類別；少數類別；弱分類器

並列摘要

When an enterprise attempts to perform classification on its data, the target feature of the data set usually can be divided into several classes and these classes may have huge differences in counts. Well-known examples are credit card fraud detection and diagnosis of diabetes from clinical data. In these examples, typically one of the classes is much more numerous than the others and consequently causing the prediction model to perform much better on the majority class than the minority class. In order to better handle this phenomenon, we experiment with the use of Ensemble Learning to build models, with the goal of improving the accuracy of the resulting model. In this thesis, we will focus on the use of bagging models and their extensions. We will consider the problem of tuning the parameters of the Decision Tree for weak learners, and we will also experiment with the effect of adopting various sampling strategies. The goal is to discover appropriate strategies that we should adopt when using bagging models so as to predict minority classes with better accuracy.