利用混合式遞迴統計法則改善特徵選取與資料探勘之流程

本研究主要目的在於改良進行資料探勘前所作特徵選取分析之流程，以往的資料探勘研究總會所選的研究辦法中直接進行特徵選取，但是由於選取之法則可能無法將資料中不相關的特徵與因素徹底刪除，如果未刪除不必要的特徵因素，那麼這些冗餘或是不相關因素將可能會對運算的速度造成影響，更甚者可能會對最後的資料分析結果造成偏誤導致出現錯誤的預測或結果。所以我們將因素分析的流程從演算法中獨立出來探討是否有更佳的改善方式，能夠在進行資料探勘前便將資料的維度數大幅縮減但卻不影響資料的獨立性與代表性。　　　　進行特徵分析的過程中，本研究將使用統計方法多變量分析中的主成份分析法對資料庫的特徵與維度數進行縮減，此分析最重要的目的就是要使縮減後的資料達成上述的目標：代表性、獨立性、精簡性。經由主成份分析縮減後的資料再經由羅吉斯回歸分析法進行資料探勘並與其他演算法進行比較便可以精確看出整體的分析流程是否具有代表性。　　　　本研究將以UCI資料庫中具有代表性的資料庫為範例。以主成份分析結合階層式羅吉斯回歸分析法、兩階段分群法與虛擬資料進行資料探勘，最後對於找出的資料特徵分類將進行準確度分析以確認此流程是否有顯著改善．期待能以此流程找出提高資料分類準確度的分類法則並提供一個以統計方法進行特徵分析與資料探勘的研究方法。　　　　關鍵字:特徵分析、資料分類、主成份分析、兩階段分群法、階層式羅吉斯回歸分析法、虛擬資料

關鍵字

特徵分析；資料分類；主成份分析；兩階段分群法；階層式羅吉斯回歸分析法；虛擬資料

並列摘要

This research focuses on improving the process of feature selection before we use data mining techniques to analyze database. The studies in the past always used the kind of algorithms to execute feature selection in data mining process, but the kind of algorithms sometimes may delete unnecessary features or attributes incompletely. In this situation, these unnecessary features or attributes may reduce the speed of the algorithm and affect data mining result in incorrect prediction or decision rule. To improve this problem, we propose a new process for feature selection with a statistical method. The goal of our method is to completely reduce the unnecessary features or attributes totally before processing data mining and to kept independence and representation of the original data. 　　In processing feature selection or classification, this research will take the principal component analysis to reduce the features and attributes in benchmark databases. The most important target of this analysis is to set the reduced data to keep the independence、representation and simplicity. After using principal component analysis to reduce data, we will use the two-step cluster method, hierarchical logistic regression and dummy data to process and improve data mining expect for increasing the accuracy for the result and reducing the experiment time. 　　This study uses the UCI databases are the experiment examples and benchmark questions. By combining statistical methods, we can set up the new process for data mining and data classification. We hope that this study can offer new ideas in data mining combining and feature selection with statistical methods. Key words: feature selection, classification, principal component analysis, two-step cluster method, hierarchical logistic regression, dummy data

並列關鍵字

feature selection ； classification ； principal component analysis ； two-step cluster method ； hierarchical logistic regression ； dummy data

參考文獻

[1] P. Kumar, P. Vadakkepat, L.A. Poh, “Fuzzy-rough Discriminative Feature Selection and Classification Algorithm, with Application to Microarray and Image Datasets”, Applied Soft Computing, Vol.11, 2011, pp. 3429-3440.

[2] Z.Y. He, W.C Yu, “Stable Feature Selection for Biomarker Discovery”, Computational Biology and Chemistry, Vol.44, 2010, pp. 215-225.

[3] P.G. Espejo, S. Ventura, F. Herrera, “A Survey on the Application of Genetic Programming to Classification”, IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, Vol.40, 2010, pp.121-144.

[4] T. Sousa, A. Silva, A. Neves, “Particle Swarm based Data Mining Algorithms for classification tasks”, Parallel Computing, Vol.30, 2004, pp.767-783.

[5] U.M. Fayyad, G.P. Shapiro, P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine, Vol.17, 1996, pp.37-54.

國際替代計量

利用混合式遞迴統計法則改善特徵選取與資料探勘之流程

主題瀏覽