利用混合式遞迴統計法則改善特徵選取與資料探勘之流程__國立清華大學博碩士論文全文影像系統

帳號：guest(3.22.77.117) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	曾韋傑
作者(外文):	Tseng, Wei-Jie
論文名稱(中文):	利用混合式遞迴統計法則改善特徵選取與資料探勘之流程
論文名稱(外文):	Hybird Recursive Statistical Methods to Improve Features Selection and Data Mining Process
指導教授(中文):	葉維彰
指導教授(外文):	Yeh, Wei-Chang
口試委員(中文):	葉維彰溫于平林妙聰
口試委員(外文):	Yeh, Wei-Chang Wen, Yu-Ping Lin, Miau-Tsung
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	工業工程與工程管理學系
學號:	9834521
出版年(民國):	100
畢業學年度:	99
語文別:	英文
論文頁數:	48
中文關鍵詞:	特徵分析、資料分類、主成份分析、兩階段分群法、階層式羅吉斯回歸分析法、虛擬資料
外文關鍵詞:	feature selection、classification、principal component analysis、two-step cluster method、hierarchical logistic regression、dummy data
相關次數:	推薦:0 點閱:145 評分: 下載:9 收藏:0

本研究主要目的在於改良進行資料探勘前所作特徵選取分析之流程，以往的資料探勘研究總會所選的研究辦法中直接進行特徵選取，但是由於選取之法則可能無法將資料中不相關的特徵與因素徹底刪除，如果未刪除不必要的特徵因素，那麼這些冗餘或是不相關因素將可能會對運算的速度造成影響，更甚者可能會對最後的資料分析結果造成偏誤導致出現錯誤的預測或結果。所以我們將因素分析的流程從演算法中獨立出來探討是否有更佳的改善方式，能夠在進行資料探勘前便將資料的維度數大幅縮減但卻不影響資料的獨立性與代表性。
　　
　　進行特徵分析的過程中，本研究將使用統計方法多變量分析中的主成份分析法對資料庫的特徵與維度數進行縮減，此分析最重要的目的就是要使縮減後的資料達成上述的目標：代表性、獨立性、精簡性。經由主成份分析縮減後的資料再經由羅吉斯回歸分析法進行資料探勘並與其他演算法進行比較便可以精確看出整體的分析流程是否具有代表性。
　　
　　本研究將以UCI資料庫中具有代表性的資料庫為範例。以主成份分析結合階層式羅吉斯回歸分析法、兩階段分群法與虛擬資料進行資料探勘，最後對於找出的資料特徵分類將進行準確度分析以確認此流程是否有顯著改善．期待能以此流程找出提高資料分類準確度的分類法則並提供一個以統計方法進行特徵分析與資料探勘的研究方法。
　　
　　關鍵字:特徵分析、資料分類、主成份分析、兩階段分群法、階層式羅吉斯回歸分析法、虛擬資料

　This research focuses on improving the process of feature selection before we use data mining techniques to analyze database. The studies in the past always used the kind of algorithms to execute feature selection in data mining process, but the kind of algorithms sometimes may delete unnecessary features or attributes incompletely. In this situation, these unnecessary features or attributes may reduce the speed of the algorithm and affect data mining result in incorrect prediction or decision rule. To improve this problem, we propose a new process for feature selection with a statistical method. The goal of our method is to completely reduce the unnecessary features or attributes totally before processing data mining and to kept independence and representation of the original data.
　　In processing feature selection or classification, this research will take the principal component analysis to reduce the features and attributes in benchmark databases. The most important target of this analysis is to set the reduced data to keep the independence、representation and simplicity. After using principal component analysis to reduce data, we will use the two-step cluster method, hierarchical logistic regression and dummy data to process and improve data mining expect for increasing the accuracy for the result and reducing the experiment time.
　　This study uses the UCI databases are the experiment examples and benchmark questions. By combining statistical methods, we can set up the new process for data mining and data classification. We hope that this study can offer new ideas in data mining combining and feature selection with statistical methods.
Key words: feature selection, classification, principal component analysis, two-step cluster method, hierarchical logistic regression, dummy data

Table of contents
中文摘要 i
Abstract ii
Table of Contents iii
List of Figures iv
List of Tables v
Chapter 1 Introduction 1
1.1 Background and motivation 1
1.2 Contribution 4
1.3 Overview of This Thesis 4
Chapter 2 Literature Review 5
2.1 Data Mining 5
2.2 Feature Selection 7
Chapter 3 Methodology 13
3.1 Principal Component Analysis 13
3.2 Two-step Clustering method 16
3.3 Hierarchical Logistic Regression 18
3.4 Dummy Data 22
Chapter 4 Experiment Results 26
4.1 Experiment Data 26
4.2 Experiment Result 27
4.2.1 Binary Class Data 27
4.2.2 Plurality Class Data 31
Chapter 5 Conclusion and Future Research 44
References 46

References
[1] P. Kumar, P. Vadakkepat, L.A. Poh, “Fuzzy-rough Discriminative Feature Selection and Classification Algorithm, with Application to Microarray and Image Datasets”, Applied Soft Computing, Vol.11, 2011, pp. 3429-3440.
[2] Z.Y. He, W.C Yu, “Stable Feature Selection for Biomarker Discovery”, Computational Biology and Chemistry, Vol.44, 2010, pp. 215-225.
[3] P.G. Espejo, S. Ventura, F. Herrera, “A Survey on the Application of Genetic Programming to Classification”, IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, Vol.40, 2010, pp.121-144.
[4] T. Sousa, A. Silva, A. Neves, “Particle Swarm based Data Mining Algorithms for classification tasks”, Parallel Computing, Vol.30, 2004, pp.767-783.
[5] U.M. Fayyad, G.P. Shapiro, P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine, Vol.17, 1996, pp.37-54.
[6] M. Berry, G. Linoff, Data Mining Techniques: for Marketing, Sales, and Customer Support, John Wiley & Sons, New York, 1997.
[7] C. Kleissner, “Data Mining for the Enterprise”, Proceedings of the Thirty-First Hawaii International Conference, Vol. 7, 1998, pp. 295-304.
[8] E.W.T. Ngai, L. Xiu, D.C.K. Chau, “Application of Data Mining Techniques in Customer Relationship Management: A Literature Review and Classification”, Expert System with Application, Vol.36, 2009, pp.2592-2602.
[9] E. Turban, J.E. Aronson, T.P. Liang, R. Sharda, Decision Support and Business Intelligence Systems ,8th ed, Pearson Education, 2007.
[10] C.G. Carrier, O. Povel, “Characterizing Data Mining Software”, Intelligent Data Analysis, Vol.7, 2003, pp.181-192.
[11] T.C. Chen, H.L. Tsao, “Using a Hybrid Meta-Evolutionary Rule Mining Approach as a Classification Response Model”, Expert Systems with Applications, Vol.36, 2009, pp.1999-2007.
[12] W.C. Yeh, W.W. Chang, Y.Y. Chung, “A New Hybrid Approach for Mining Breast Cancer Pattern Using Discrete Particle Swarm Optimization and Statistical Method”, Expert Systems with Applications, Vol.36, 2009, pp.8204-8211.
[13] P. Bertolazzi, G. Felici, P. Festa, G. Lancia, “Logic Classification and Feature Selection for Biomedical Data”, Computers & Mathematics with Applications, Vol.55, 2008, pp.889-899.
[14] J.R. Cano, F. Herrera, M. Lozano, “Evolutionary Stratified Training Set Selection for extracting Classification Rules with Trade off Precision-interpretability”, Data Knowledge Engineering, Vol. 60, 2007, pp.90-108.
[15] P. Kontkanen, J. Lahtinen, P. Myllymaki, T. Silander, H. Tirri, “Supervised Model-Based Visualization of High-Dimensional Data”, Intelligent Data Analysis, Vol.4, 2000, pp.213-227.
[16] M. Dash, H. Liu, “Feature Selection for Classification”, Intelligent Data Analysis, Vol.1, 1997, pp.131-156.
[17] H. Kahramanli, N. Allahverdi, “Rule Extraction from Trained Adaptive Neural Networks using Artificial Immune Systems”, Expert Systems with Applications, Vol. 36, 2009, pp.1513-1522.
[18] O.L. Mangasarian, “Mathematical Programming in Data Mining”, Data Mining and Knowledge Discovery, Vol.1, 1997, pp.183-201.
[19] A.E. Akadi, A. Amine, A.E. Ouardighi, D. Aboutajdine, “A Two-Stage Gene Selection Scheme Utilizing MRMR Filter and GA Wrapper”, Knowledge Information System,Vol.26 , 2011, pp.487-500.
[20] T. Sousa, A. Silva, A. Neves, “Particle Swarm Based Data Mining Algorithms for Classification Tasks”, Parallel Computing, Vol.30, 2004, pp.767–783.
[21] V. Sugumaran, K.I. Ramachandran, “Fault diagnosis of roller bearing using fuzzy classifier and histogram features with focus on automatic rule learning”, Expert system with application, Vol.38 ,2011, pp.4901-4907.
[22] E. Byvatov, G. Schneider, “Support Vector Machine Applications in Bioinformatics”, Apply Bioinformatics, Vol.2, 2003, pp.67-77.
[23] J.A. Stegeman, J.C.M. Vernooij, O.A. Khalifa, J.V. Broek ,D.J. Mevius, “Establishing the Change in Antibiotic Resistance of Enterococcus Faecium Strains isolated from Dutch Broilers by Logistic Regression and Survival Analysis”, Preventive Veterinary Medicine,Vol.2160, 2006, pp.56-66.
[24] C. Stephenie, R. Jason, A. Melissa, D. Peter, R. William, “Classification and Regression Tree Analysis in Public Health: Methodological Review and Comparison with Logistic Regression”, Annals of Behavioral Medicine, Vol.26, 2003, pp.172-181.
[25] F. Jerome, H. Trevor, T. Robert, “Additive Logistic Regression: A Statistical View of Boosting”, The Annals of Statistics, Vol.28, 2000, pp.337-407.
[26] Jolliffe, Principal Component Analysis, New York: Springe, 1986.
[27] F. Ge, J.W Ma, “Spurious Solution of the Maximum Likelihood Approach to ICA”, IEEE Signal Processing Letters, Vol.17, 2010, pp.655-658.
[28] K.Y. Yeung, W.L. Ruzzo, “Principal Component Analysis for Clustering gene Expression Data”, Computer Science and Engineering, 2001, pp.763-774.
[29] Y.L. Lee, X.R. Xu, S. Wallenstein, J. Chen “Gene Expression Profiles of the One-Carbon Metabolism Pathway”, Journal of Genetics and Genomics, Vol.36, 2009, pp.277-282.
[30] X.P. Chang, Z.H. Zheng, X.H. Duan, C.M Xie, “Sparse Representation-Based Face Recognition for One Training Image per Person”, Advanced Intelligent Computing Theories and Applications , Vol.6215, 2010, pp.407-414.
[31] L. Alberto, P. Paolo, P. Giovanni , “Back propagation-Based Non Linear PCA for Biomedical Applications”, 2009 9th International Conference on Intelligent Systems Design and Applications, 2009, pp.635-640.
[32] H.L. Shang, R.J. Hyndman, “Nonparametric Time Series Forecasting with Dynamic Updating”, Mathematics and Computers in Simulation, Vol.81, 2011, pp.1310-1324.
[33] J.C. Anderson, D.W. Gerbing, “Structural Equation Modeling in Practice: A Review and Recommended Two-Step Approach”, Psychological Bulletin, Vol. 103, 1988, pp.411-423.
[34] I. Martinez, J. Mora, J. Fernando, “Two-Step Cluster Procedure After Principal Component Analysis Identifies Sperm Subpopulations in Canine Ejaculates and Its Relation to Cryoresistance”, Journal of Andrology, Vol.27, 2006, pp.596-603.
[35] O. Hauk, M. Davis, M. Ford, F. Pulvermu, W.D. Marslen, “The Time Course of Visual Word Recognition as Revealed by Linear Regression Analysis of ERP Data”, NeuroImage, Vol.30, 2006, pp.1383-1400.
[36] J.D. Olden, D.A. Jackson, “A Comparison of Statistical Approaches for Modeling Fish Species Distributions”, FRESHWATER BIOLOGY, Vol.47, 2002, pp.1976-1995.
[37] Y. Konishi, K. Adachi, “A Framework for Estimating Willingness-to-Pay to Avoid Endogenous Environmental Risks”, Resource and Energy Economics, Vol.33, 2011, pp.130-154.
[38] A. Asuncion and D. Newman, “UCI Machine Learning Repository”, 2007, http://www.ics.uci.edu/∼mlearn/MLRepository.html

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文