  • 學位論文


Grid computing based meta-evolutionary mining approach as classification response model

指導教授 : 陳大正


資料探勘乃經常被用來發掘資料庫中新知識的方法及工具。近年來被廣泛使用於各種領域的應用並解決許多人類專家難以解決之問題,因而成為許多學者持續進行的重要研究課題。因此,本研究提出以網格運算為基礎之進化式演算法於資料探勘分類反應模型,以網格運算為基礎架構針對資料之屬性維度繁雜龐大,著重於找出資料屬性中的重要關鍵因子之資料集建立最佳屬性集合模型;針對資料之法則較具有意義和價值之資料,以If-Then法則表示方式建立最佳化法則模型。 傳統上統計模型及統計相關技術經常被使用,例如邏輯回歸及複回歸,但現實的問題經常是高度非線性,難以使用統計方法來發展出可以包含所有獨立變數之模型。近年來具非線性及複雜度之機器學習方法已經取代傳統統計方法,如:類神經網路( Artificial Neural Networks )及支向量機( support vector machine )。然而,這些方法針對高屬性維度的資料進行分析時,經常會受到過多不重要的屬性變數造成資料探勘時的干擾,使得分類的準確率下降。而針對其資料法則較具意義與價值之資料,雖可以獲得高分類之準確率,然而無法明確的運用法則表示出來,達成知識探索之目的( Knowledge Discovery )。 本研究以網格運算技術為基礎架構,建立一套能適合探勘大量資料並有效降低分析運算所需耗費之時間成本之系統方法。並針對資料之屬性維度繁雜龐大,著重於找出資料屬性中的重要關鍵因子之資料集提出以遺傳演算法結合向量距離之中位數計算之區別分析方法,找出最佳屬性變數集合;針對資料之法則較具有意義和價值之資料,本研究提出以遺傳演算法結合二元粒子群最佳化演算法之法則探勘模型,以IF-Then法則表示方式建立最佳化法則之法則庫。由實驗結果中可證實本研究所提出之系統架構可大量降低資料探勘模組進行分析所需耗費的大量時間,而所提出之研究方法亦優於或同於目前各種不同文獻與商業軟體所分析之結果。尤其藉由資料探勘技術可以建立一套有效的預測決策模式或近似於專家系統之分類模式。


Data mining usually means the methodologies and tools for the efficient new knowledge discovery from databases. In this study a hybrid meta-evolutionary data mining approach as a classification response model is proposed. Moreover, the proposed approach is based on the grid computing infrastructure for establishing the best attributes set. As the real world problems are highly nonlinear in nature, they are hard to develop a comprehensive model taking into account all the independent variables using the these statistical approaches. Early many studies of handling the problems used the conventional statistical methods and statistical related techniques including logistic regression and multi-normal regressions. Recently, nonlinear and complex machine learning approaches such as neural networks and support vector machines have been demonstrated to be with more reliable than the conventional statistical approaches. Although the usefulness of using these methods has been reported in literatures, the most obstacles are in the building and using the model in which the classification rules are hard to be realized. For enhancing the mining efficiency in this study, the proposed mining approach is build which is based on the grid computing infrastructure. The discriminant analysis based on vector distant of median method as the evaluation function of GA which lays stress on find the key attributes set of the data set to establish the best attributes set for constructing a classification response model with highest accuracy. Furthermore, to generate the classification rule, additional approach composing the hybrid GA and binary particle swarm optimization method is applied in the grid computing infrastructure to extract the If-Then rules set model. We show experimentally that the proposed mining approach based on the grid computing infrastructure can work effectively and efficiently, and the results of the proposed methods are better than those in the literature and/or by using business software. In particular, the proposed approach can be developed as a computer model for prediction or classification problem like expert systems.


[1] U. Fayyad and R. Uthurusamy, " Data mining and knowledge discovery in databases ", Communications of the ACM, vol. 39, pp. 24-26, 1996.
[2] Y. Bentz and D. Merunka, " Neural networks and the multinomial logit for brand choice modeling: A hybrid approach ", Journal of Forecasting, vol. 19, pp. 177-200, 2000.
[3] Y. S. Kim, Street, W. N., " An intelligent system for customer targeting: a data mining approach ", Decision Support Systems, vol. 32, pp. 215-228, 2004.
[4] D. Haughton and S. Oulabi, " Direct marketing modeling with CART and CHAID ", Journal of Direct Marketing, vol. 11, pp. 42-52, 1997.
[5] CH Ooi and P Tan, " Genetic algorithms applied to multi-class prediction for the analysis of gene expression data ", BIOINFORMATICS, vol. 19, pp. 37-44, 2003.


