資料採礦中的模型選取

有賴電腦的輔助，企業或組織雖擁有龐大資料庫，但其中資訊未必足夠。我們認為利用資料庫加值方法，可在不改變原始資料結構之下增加資料庫訊息，以達到擴充資訊的目的。本研究結論為在迴歸模型為主要流程下，利用迴歸為主的插補方法可以使加值後的資料庫較貼近原始資料，而系統抽樣來縮減資料量所獲得的結果比簡單隨機抽樣來的好。決策樹C5.0的主要流程下，以類神經演算法作為插補的主要方法使插補後的資料更接近原始資料。經由實證分析可瞭解不同的配模方式，利用資料庫加值技術的確可以增加資訊量，使加值後的虛擬資料庫更貼近原始資料結構。

關鍵字

資料採礦；模型選取；抽樣；違約機率；決策樹C5.0

並列摘要

The research thus is focusing on the integrity of the database. We adapt the methods of database value-added including imputation, sampling, and model-evaluating for enlarging the information contained while leave the data structure unmodified. In this paper, the purpose is comparing the structure of database with value-added. As the result of different value-added method on model building, we derive the following conclusion. First, the database structure is more closer to the origin data given regression analysis as the main imputation method. Second, system sampling has better performance than simple random sampling if adapting further sampling for reducing the amount of data. Third, in the process of the C5.0 decision tree and neuron network algorithm as the main imputation method, the data is also closer to the origin data when enlarge it. After experimental result, we find the better value-added method is not consistent in different model. Comparing with non-imputated database, database with value-added has larger information and is closer to original data structure.