帳號:guest(13.59.18.83)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士以作者查詢全國書目勘誤回報
作者(中):張智鈞
作者(英):Chang, Chih-Chun
論文名稱(中):基於相關性的類別特徵選擇方法之評估
論文名稱(英):An Evaluation of Correlation-Based Categorical Feature Selection Methods
指導教授(中):周珮婷
指導教授(英):Chou, Pei-Ting
口試委員:周珮婷
梁穎誼
張育瑋
口試委員(外文):Chou, Pei-Ting
Leong, Yin-Yee
Chang, Yu-Wei
學位類別:碩士
校院名稱:國立政治大學
系所名稱:統計學系
出版年:2022
畢業學年度:110
語文別:中文
論文頁數:54
中文關鍵詞:變數篩選維度縮減變數相關性過濾法類別型資料
英文關鍵詞:Feature selectionDimension reductionVariable associationFilter methodEntropyCategorical datasets
Doi Url:http://doi.org/10.6814/NCCU202200500
相關次數:
  • 推薦推薦:0
  • 點閱點閱:47
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • gshot_favorites title msg收藏:0
隨著機器學習的蓬勃發展,變數篩選之重要性不言而喻,適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種,本研究旨在使用一些能夠計算變數相關性的指標,如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等,結合過濾法進行變數篩選,並探討於不同指標下各個資料集的預測表現,亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗,包含兩筆模擬資料和八筆真實資料,其中大部分為類別型資料。
在模擬資料中,本研究發現在資料變數為類別型的情況下,條件熵挑選重要變數的能力優於其他指標。在真實資料中,部分資料使用過濾法進行變數篩選後,仍有不錯的預測表現,然而亦有部分資料的預測表現不佳,推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理,而將連續型變數離散化時可依據原始資料的分配切分。
未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻,以及是否能將過濾法與包裝法和嵌入法結合出新的演算法,進而更精準地篩選出重要的變數,並提升資料分析的效率。

With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.

第壹章 緒論 1
第一節 變數篩選現況 1
第二節 研究動機及目的 2
第貳章 文獻探討 4
第參章 研究方法及資料介紹 6
第一節 使用的指標 6
第二節 使用的演算法 11
第三節 研究資料介紹 12
第肆章 研究流程與結果討論 17
第一節 實驗過程與結果紀錄 17
第二節 實驗結果討論 29
第伍章 結論與建議 34
參考文獻 36
附錄 39
Akoglu, H. (2018). User's guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.
Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.
Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720.
Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477.
Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'07) (pp. 16-16). IEEE.
Boltz, S., Debreuve, E., & Barlaud, M. (2009). High-dimensional statistical measure for region-of-interest tracking. IEEE Transactions on Image Processing, 18(6), 1266-1283.
Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. Elements of information theory, 2(1), 12-13.
Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232.
Cramer, H. (1946). Mathematical methods of statistics, Princeton Univ. Press, Princeton, NJ.
D'Ambra, L., & Lauro, N. (1989). Non symmetrical analysis of three-way contingency tables. In Multiway data analysis (pp. 301-315).
D’Ambra, L., Beh, E. J., & Lombardo, R. (2005). Decomposing Goodman-Kruskal tau for Ordinal Categorical Variables. International Statistical Institute, 55th.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–769.
Gruosso, T., Mieulet, V., Cardon, M., Bourachot, B., Kieffer, Y., Devun, F., ... & Mechta‐Grigoriou, F. (2016). Chronic oxidative stress promotes H2 AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO molecular medicine, 8(5), 527-549.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (Eds.). (2008). Feature extraction: foundations and applications (Vol. 207). Springer.
Hull, J. J. (1994). A database for handwritten text recognition. IEEE Trans. Pattern Anal. Mach. Intelligence, 16(5), 550-554.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial intelligence in medicine, 23(2), 149-169.
Masoudi-Sobhanzadeh, Y., Motieghader, H., & Masoudi-Nejad, A. (2019). FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics, 20(1), 1-17.
National Development Council (2020). 2018 Mobile Phone Users' Digital Opportunity Survey (AE080006) [data file]. Available from Survey Research Data Archive, Academia Sinica. doi:10.6141/TW-SRDA-AE080006-1
Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352), 240-242.
Remeseiro, B., & Bolon-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in biology and medicine, 112, 103375.
Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment, 624, 661-672.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
Sun, Y., Lu, C., & Li, X. (2018). The cross-entropy based multi-filter ensemble method for gene selection. Genes, 9(5), 258.
Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika Journal of Science & Technology, 26(1).
Wang, J., Xu, J., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32-39.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35).
Yöntem, M. K., Kemal, A. D. E. M., Ilhan, T., & KILIÇARSLAN, S. (2019). Divorce prediction using correlation-based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi, 9(1), 259-273.
(此全文20270613後開放瀏覽)
電子全文
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *