機器學習與集成學習方法在房價預測上之應用

近年來人工智慧技術風潮盛起，各行各業開始利用機器學習解決自身的難題，包括在房地產業、半導體製造業、金融業、商業和行銷等領域上，機器學習都能做出許多貢獻。在房地產業中，房屋價格是政府與百姓皆相當重視的議題，為此，本研究找到一筆文獻記載的房屋資料數據集，先對資料數據進行預處理，再以八種機器學習模型預測資料集所記載的房屋價格。特別的，本研究先依據機器學習常見的嶺迴歸(Ridge Regression)、Lasso迴歸(Lasso Regression)、彈性網路(Elastic Net)三種迴歸方法，以及LightGBM、XGBoost兩種梯度提升框架，分別建立五種房價預測模型，再以集成學習(Ensemble Learning)中的投票(Voting)法與堆疊(Stacking)法集成上述五種預測模型，使分別形成Voting集成預測模型(本研究第六種模型)與Stacking集成預測模型(本研究第七種模型)，最後再使用集成學習(Ensemble Learning)中的混合(Blending)法將第六與第七種模型再次集成，使成最終的Blending集成預測模型(本研究第八種模型)。本研究經測試集分析結果比較後發現，在未經集成的前五種模型中，Lasso迴歸模型的預測效能最佳；在兩種梯度提升框架中，雖XGBoost效能比LightGBM好，但梯度提升框架並不適合用於少量資料集的房屋價格預測。針對三種集成模型，Voting集成模型預測效能確實優於前五種模型，且沒有明顯過度擬合(Over Fitting)情形；Stacking集成模型的效能則僅優於LightGBM與嶺迴歸，顯示集成學習並非提升預測效能的萬靈丹；當然，在八種預測模型中，表現最好的還是Blending集成模型，其係將Voting與Stacking集成模型以最佳混和權重來集成，預測效能自然最佳。

關鍵字

機器學習；集成學習；資料預處理；房屋價格預測

並列摘要

In recent years, the rise of artificial intelligence technology has prompted various industries to adopt machine learning to address their unique challenges. Sectors such as real estate, semiconductor manufacturing, finance, commerce, and marketing have all seen significant contributions from machine learning. In the real estate industry, housing prices are a topic of great concern for both the government and the public. To address this, this study utilized a housing dataset documented in the literature, first preprocessing the data and then applying eight machine learning models to predict the housing prices recorded in the dataset. Specifically, the study first developed five prediction models based on three commonly used machine learning regression methods—Ridge Regression, Lasso Regression, and Elastic Net—and two gradient boosting frameworks, LightGBM and XGBoost. These five models were then integrated using two ensemble learning techniques: Voting and Stacking, resulting in a Voting ensemble prediction model (the study’s sixth model) and a Stacking ensemble prediction model (the study’s seventh model). Finally, the Blending method from ensemble learning was used to integrate the sixth and seventh models into the final Blending ensemble prediction model (the study’s eighth model). After testing and comparing the results, the study found that, among the five non-ensemble models, the Lasso Regression model exhibited the best predictive performance. While XGBoost outperformed LightGBM, gradient boosting frameworks were not well-suited for housing price prediction with small datasets. As for the three ensemble models, the Voting ensemble model’s predictive performance was indeed superior to the five non-ensemble models, without significant overfitting. However, the Stacking ensemble model only outperformed LightGBM and Ridge Regression, indicating that ensemble learning is not a panacea for improving predictive performance. Ultimately, the Blending ensemble model was the best performer among the eight models, as it integrated the Voting and Stacking models with optimal blending weights, resulting in the highest predictive performance.

並列關鍵字

Machine Learning ； Ensemble Learning ； Data Preprocessing ； Housing Price Prediction

參考文獻

[1]Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260. https://doi.org/10.1126/science.aaa8415

Google Scholar

[2]Zhuhadar, L. P., & Lytras, M. D. (2023). The application of AutoML techniques in diabetes diagnosis: Current approaches, performance, and future directions. Sustainability, 15(18), 13484. https://doi.org/10.3390/su151813484

Google Scholar

[3]Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research directions. SN Computer Science, 2, 160. https://doi.org/10.1007/s42979-021-00592-x

Google Scholar

[4]Dietterich, T.G. (2000). Ensemble methods in machine learning. In Multiple classifier systems. MCS 2000. Lecture Notes in Computer Science, 1857, 1-15. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45014-9_1

Google Scholar

[5]Özöğür-Akyüz, S., Erdogan, B., Yıldız, Ö., & Karadayı Ataş, P. (2022). A novel hybrid house price prediction model. Computational Economics, 62, 1-18. https://doi.org/10.1007/s10614-022-10298-8

Google Scholar

主題瀏覽