棒球為美國最早的職業化運動,如今美國職棒大聯盟經歷了數十載的更迭與發展後,已被視為世界職業棒球賽事的最高殿堂。在競爭激烈的環境中,球員薪資成為了不可忽視的因素之一,更是勞資雙方與球迷共同關注的議題。對於棒球選手而言,衡量薪資最重要的依據來自場上的表現。然而,傳統評估球員表現的基礎統計數據卻僅能解釋部分的能力,無法客觀地體現一名球員的價值。 直至2015年MLB的Statcast追蹤紀錄系統啟用後,新一代的科學數據改變了傳統衡量球員的標準,衍生出更多排除不可控因子的進階統計數據,以更為公正的角度確切量化球員的各項能力。然而,過去與球員薪資相關的文獻大多致力於使用傳統計量模型,僅透過基礎統計數據探討影響薪資的變數,較少納入進階統計數據進行探討,甚至是進一步使用機器學習進行薪資的預測。 為彌補上述提及之研究缺口,故本研究使用2015-2016至2022-2023共8個賽季的球員年齡、年資以及基礎和進階統計數據,使用嵌入法進行特徵篩選,接著根據特徵重要性逐次刪除較不重要的特徵,以進行實驗並選取出最佳的特徵子集,最終藉由Random Forest、XGBoost與CatBoost共三種監督式機器學習演算法建置預測球員下賽季薪資的預測模型。試圖探討哪些與打者或投手相關的特徵因子對於預測薪資具有較高的影響力,以及如何因應影響力較高的特徵因子建立具有較佳動態效果的球員薪資預測模型。 實驗結果表明,本研究使用的22項打者特徵與20項投手特徵中,分別有13項和9項特徵在Random Forest、XGBoost和CatBoost共三種演算法中均為最佳特徵子集使用的共同特徵,顯示其分別對於打者與投手的薪資預測模型,在建模時具有增益作用。此外,亦發現某些經Pearson相關分析為低度相關之特徵,在建模時的特徵重要度顯著提升,進一步凸顯了使用機器學習進行薪資預測的優勢,能夠深入挖掘數據中的隱含模式和非線性關係,提供更為準確的預測結果。在模型的表現上,無論是打者或投手薪資預測模型,三種演算法皆具有一定的預測水準,整體效能的差異並不顯著,顯示使用機器學習預測球員薪資的效能具有一致性,能夠有效地預測球員的薪資,盼能作為日後球團與球員進行勞資協商的依據。
Baseball is America’s earliest professional sport, and after several decades of evolution and development, Major League Baseball (MLB) is now considered the pinnacle of professional baseball worldwide. In this highly competitive environment, player salaries have become an essential factor, attracting attention from both labor and management, as well as fans. For baseball players, the most important basis for determining their salaries comes from their on-field performance. However, traditional statistical data used to evaluate player performance can only explain part of their capabilities and cannot objectively reflect a player’s true value. The introduction of MLB’s Statcast tracking system in 2015 marked a turning point. This new generation of scientific data has changed the traditional standards for measuring players, giving rise to advanced statistical data that more accurately quantifies various player abilities by eliminating uncontrollable factors. However, past literature related to player salaries mostly focused on traditional econometric models and explored salary-influencing variables using only basic statistical data. There has been little incorporation of advanced statistical data or the application of machine learning for salary prediction. To fill this research gap, this study utilizes player data from the 2015 - 2016 to 2022 - 2023 seasons, including age, experience, and both basic and advanced statistical data. Feature selection was performed using the embedded method, progressively eliminating less important features based on feature importance to conduct experiments and select the optimal subset of features. Ultimately, salary prediction models for the next season’s player salaries were built using three supervised machine learning algorithms:Random Forest, XGBoost, and CatBoost. The study aims to explore which features related to batters or pitchers have a higher impact on salary prediction and how to build a more dynamic salary prediction model based on these influential features. The experimental results show that out of the 22 batter features and 20 pitcher features used in this study, 13 and 9 features respectively were common in the optimal feature subsets used by Random Forest, XGBoost, and CatBoost, indicating their beneficial role in constructing batter and pitcher salary prediction models. Additionally, some features with low Pearson correlation coefficients were found to have significantly increased feature importance in the models, highlighting the advantage of using machine learning for salary prediction. Machine learning can uncover hidden patterns and nonlinear relationships in the data, providing more accurate prediction results. In terms of model performance, both batter and pitcher salary prediction models achieved a certain level of predictive capability, with no significant differences in overall performance among the three algorithms. This consistency demonstrates the effectiveness of using machine learning to predict player salaries. The models can effectively predict player salaries, potentially serving as a reference for future labor negotiations between teams and players.