資料挖掘對分類問題之研究－基因規劃法之啓發

隨著儲存技術、資料庫及資料倉儲的快速發展，企業廣泛的運用此些科技來掘取有用的資訊。為了有效掘取隱藏在資料庫/資料倉儲中的有用知識，在整個知識發現(knowledge discovering in databases, KDD)的流程中，資料挖掘(data mining)技術扮演著重要的角色。資料挖掘為知識發現流程中之核心功能且可視為從一大量資料中，掘取有效、重要、有趣資訊及知識的一連串反覆及互動之流程。而資料挖掘的任務主要可分為分類分析、迴歸分析、誤差偵測分析、集群分析、關聯規則分析及序列分析等。本論文主要的目的為探討及修正資料挖掘中的分類技術。有鑑於傳統分類模型的缺點，本論文提出三個主要的修正模型。此三個模型主要是根據基因規劃(GP)的方法來結合函數基礎(function-based)及歸納基礎(induction-based)等分類技術之優點而成。第一個模型為單純利用基因規劃法來進行分類問題。第二個提出之模型為IF-THEN規則的基因規劃法(IF-THEN GP)，此法乃根據“分開並各個擊破”的原則所成。再者，為了有效結合函數基礎和歸納基礎法之優點，本論文提出兩階段基因規劃法(2SGP)用於分類問題上。此外，利用兩個信用計分之標桿(benchmark)資料庫來測試本論文所提出之三個模型的正確率，並與傳統分類方法進行正確率之比較。根據實證之結果顯示，本論文所提出之三個分類模型均明顯優於傳統分類模型。因此，本論文可嘗試作為傳統相關學術研究的基礎與參考，甚至可協助實務界進行更精確的資料挖掘，以期能提昇學術界與實務界之研究與決策品質。

關鍵字

知識發現；資料挖掘；分類模型；基因規劃；信用計分

並列摘要

With the rapid development of storage system technology, databases, data warehouses are widely employed by enterprises to extract useful information for applying supply chain management (SCM), enterprise resource planning (ERP), and customer relationship management (CRM). In order to effectively extract the useful knowledge hidden in the database/data warehouse, data mining technology is highlighted in the process of knowledge discovering in databases (KDD). Data mining can be considered as the core of KDD and an iterative and interactive process to extract valid, nontrivial, and interesting information and knowledge from large among of data. The tasks of data mining can be divided into classification, regression, deviation detection, clustering, association rules, and sequential pattern. In this dissertation, the problem of data classification is highlighted. The problems of the conventional classification models are considered to develop three models. These three models are proposed to incorporate the advantages of the discriminant-based and the induction-based methods based on the genetic programming method (GP). The first model is to employ GP for building a classification model. The reasons which we employ GP to propose the classification model are that GP can automatically and heuristically determine the adequate discriminant functions and the valid attributes simultaneously. In addition, unlike artificial neural networks (ANNs) which are only suited for large data sets, GP can perform well even in small data sets. The second model called the IF-THEN ruled genetic programming (IF-THEN GP) is based on the principle of “divide and conquer.” We can set a threshold of the cut to retrain the indiscernible data set to form the second discriminant function using GP and to obtain other discriminant functions in this order. In order to combine the advantages of the discriminant-based and the induction-based methods, the third model we propose is two-stage genetic programming (2SGP). 2SGP integrates the function-based and the induction-based methods to form a hybrid model. First, the IF-THEN rules are derived using GP. Next, the reduced data are fed into GP again to form the discriminant function for providing the capability of forecasting. In addition, we used two credit-scoring data sets to test the effectiveness of the proposed models and to compared with the conventional methods including multi-layer perceptron (MLP), classification and regression tree (CART), C4.5, rough sets, and logistic regression (LR). On the basis of the numerical results, we can conclude that the proposed methods outperform to other models and should be more suitable for the real-life classification problems.

並列關鍵字

Classification models ； genetic programming ； artificial neural networks (ANNs) ； decision tree ； rough sets ； logistic regression

參考文獻

1. Agresti, A. (1990). Categorical data analysis, Now York: Wiley.

2. Ahn, B. S., Cho, S. S., and Kim, C. Y. (2000). The integrated methodology of rough set theory and artificial neural network for business failure prediction. Expert Systems with Applications, 18 (2), 65-74.

3. Aldrich, J. H. and Nelson, F. D. (1984). Linear probability, logit, and probit models, CA: Sage.

4. Beynon, M. J., and Peel, M. J. (2001). Variable precision rough set theory and data discretisation an application to corporate failure prediction. OMEGA: The International Journal of Management Science, 29 (6), 561-576.

5. Bi, Y., Anderson, T., and McClean, S. (2003). A rough set model with ontologies for discovering maximal association rules in document collections. Knowledge-Based Systems, 16 (5/6), 243-251.

國際替代計量

資料挖掘對分類問題之研究－基因規劃法之啓發

全文下載

主題瀏覽