梯度提升類型演算法的效能調校

本研究主旨在探討Gradient Boosting Algorithms在處理不平衡資料集時，在資料前處理上使用不同的數據前處理抽樣方式，特別是通過調整參數，對於其效能及效率的影響。本文中將會簡要提及傳統的Gradient Boosting以及XGBoost兩種梯度提升方法，並且選用其中的XGBoost來當作本研究之主要分析工具。本研究使用兩筆規模不同數量之資料進行實驗：一為資料量較小的糖尿病資料集，二為資料量較大的信用卡詐騙資料集來當作我們的測試資料集。而在前處理階段，對於不平衡資料集的前處理方式中會提及Down-sampling the Majority Class、Up-sampling the Minority Classr及Synthetic minority over- sampling technique，而我們特別針對Down-sampling the Majority Class進行深入探討，作為我們的抽樣方法，且透過RandomizedSearchCV超參數調校來調整其不同參數，用來評估其在提升模型效能和計算效率方面的影響力。

關鍵字

梯度提升演算法；大型不平衡資料集；不平衡資料集；超參數調校；糖尿病資料集；信用卡詐騙資料集；隨機下採樣；混淆矩陣

並列摘要

The purpose of this thesis is to study, when using Gradient Boosting Algorithms to classify imbalanced dataset, the effect of different ways of down sampling data on the outcome. The main issue that we focus on is how the changing of various parameters during performance tuning affects accuracy and performance. In this thesis, we will focus two Gradient Boosting type Algoritms, namely the traditional Gradient Boosting and the XGBoost, with XGBoost as our main focus. In this these, we use two datasets with different volumes. One is the diabetes data set with a smaller minority class to marjority class ratio. The other one is the credit cards dataset which has a much larger minority class to marjority class ratio. There are methods that are used to habdle imbalanced datasets. They are Down-sampling the Majority Class, Up-sampling the Minority Classr and Synthetic minority over-sampling technique. We will focus on the Down-sampling the Majority Class method as our sampling technique. Through the adoption of various settings of parameters, we proceed to estimate how it affects performance and accuracy.

並列關鍵字

Gradient Boosting ； XGBoost ； Diabetes Dataset ； Credit Card Fraud Detection ； RandomUnderSampler ； RandomizedSearchCV ； Confusion Matrix ； Down-sampling ； Imbalanced datasets ； Unbalanced datasets

參考文獻

[1] Luís Fernando Torres. Machine Learning for Credit Card Fraud Detection(2023)

Google Scholar

https://medium.com/@luuisotorres/machine-learning-for-credit-card-fraud-detection-1edf98efaf5a

Google Scholar

[2] 黃偉宗。Classification of Large Imbalanced Data Set Using Xgboost(使用Xgboost分類大型不平衡資料集) (2024)

Google Scholar

[3] 曾郁茹。Classifying Imbalanced Datasets with Gradient Boost(使用梯度提升分類不平衡資料集) (2024)

Google Scholar

[4] Saul Dobilas. Gradient Boosted Trees for Classification — One of the Best Machine Learning Algorithms(2021)

Google Scholar

延伸閱讀

全文下載

主題瀏覽