使用Xgboost分類大型不平衡資料集

近年來，隨著資訊的進步及儲存技術的提升，我們所得到的資料集越來越大，也存在著不平衡資料的可能，若要對這種大型不平衡的資料進行效能調教的話，一個模型往往需要好幾個小時，甚至好幾天的時間來調教，不免浪費許多時間。因此，本篇論文希望透過XGBoost這個模型來探討，在大型不平衡資料中抽取一定比例，不使用全部的資料的情況下，進行效能調教，與使用全部的資料進行效能調教，兩者根據其得到最佳參數去訓練模型，最終兩個模型所得到的準確率是否有明顯的差異。

關鍵字

大型不平衡資料集

並列摘要

During recent years, as information becomes more widespread and technology on storage advances, the datasets that we have are growing bigger and bigger. Among these datasets, imbalanced datasets emerge. When classification is done on one of these imbalanced dataset, performance tuning ususally take many hours, or even days to complete. Thus wasting a large amount of time is inevitable. For this reason we proceed to discuss the performace tuning of applying XGBoost model on large imbalanced datasets. When we tune the model, we perform a sampling on the dataset with a certain ratio instead of using the whole dataset. As a comparison, we also tune the model using all the data. Then the optimal parameters obtained in these two approaches are used to train models using all the data. The goal is to compare how these two approaches achieve.