Catboost 和其他梯度提升方法的比較

近幾年，人工智慧越來越盛行，許多產業都嘗試使用電腦，取代傳統中使用大量人力的工作，其中之一就是要以人工分析巨量資料。若機器學習可以在有效的時間內達到一定的成效，那將在各個產業中都產生翻天覆地的變化。普遍企業中的資料，幾乎都屬於不平衡資料，像是我們測試的信用卡詐騙資料集。而傳統的機器學習方法對於不平衡資料，成效往往不佳、難以訓練，無法達到期望中的成效，但在集成學習中有許多方法，可以對不平衡資料做出較於傳統方法優秀的結果，像是CatBoost、XGBoost以及Gradient Boosting。本篇論文中，將著重於介紹上述幾種方法的功能及成果比較，會使用兩種不同的資料集，測試上述三種模型，且探討CatBoost與其他梯度提升(gradient boosting)方法彼此間的優缺點。

關鍵字

機器學習；集成學習

並列摘要

During recent years, artificial intelligence is becoming more and more popular. Many industries are trying to use computers to carry out intensive jobs that are traditionally done by human. Data analysis is one of these efforts. If a machine learning algorithm can operates with a reasonable accuracy, then it will certainly cause a huge revolution in the industry. But most of the industrial datasets are imbalanced; e.g. the credit card fraud dataset. Traditional machine learning models do not perform well against these large imbalanced datasets. They are hard to train and thus are not able to achieve goals of industries. Fortunately, there are many models among ensemble learning that can do better than traditional models, such as CatBoost, XGBoost and Gradient Boost. In this thesis, we focus on the three models listed above and compare their achievements. We perform our comparison using different datasets and focus especially comparing CatBoost with other Gradient Boost based models.