本研究主旨在探討Gradient Boosting Algorithms在處理不平衡資料集時,在資料前處理上使用不同的數據前處理抽樣方式,特別是通過調整參數,對於其效能及效率的影響。本文中將會簡要提及傳統的Gradient Boosting以及XGBoost兩種梯度提升方法,並且選用其中的XGBoost來當作本研究之主要分析工具。 本研究使用兩筆規模不同數量之資料進行實驗:一為資料量較小的糖尿病資料集,二為資料量較大的信用卡詐騙資料集來當作我們的測試資料集。而在前處理階段,對於不平衡資料集的前處理方式中會提及Down-sampling the Majority Class、Up-sampling the Minority Classr及Synthetic minority over- sampling technique,而我們特別針對Down-sampling the Majority Class進行深入探討,作為我們的抽樣方法,且透過RandomizedSearchCV超參數調校來調整其不同參數,用來評估其在提升模型效能和計算效率方面的影響力。
The purpose of this thesis is to study, when using Gradient Boosting Algorithms to classify imbalanced dataset, the effect of different ways of down sampling data on the outcome. The main issue that we focus on is how the changing of various parameters during performance tuning affects accuracy and performance. In this thesis, we will focus two Gradient Boosting type Algoritms, namely the traditional Gradient Boosting and the XGBoost, with XGBoost as our main focus. In this these, we use two datasets with different volumes. One is the diabetes data set with a smaller minority class to marjority class ratio. The other one is the credit cards dataset which has a much larger minority class to marjority class ratio. There are methods that are used to habdle imbalanced datasets. They are Down-sampling the Majority Class, Up-sampling the Minority Classr and Synthetic minority over-sampling technique. We will focus on the Down-sampling the Majority Class method as our sampling technique. Through the adoption of various settings of parameters, we proceed to estimate how it affects performance and accuracy.