基於相關性的類別特徵選擇方法之評估

隨著機器學習的蓬勃發展，變數篩選之重要性不言而喻，適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種，本研究旨在使用一些能夠計算變數相關性的指標，如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等，結合過濾法進行變數篩選，並探討於不同指標下各個資料集的預測表現，亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗，包含兩筆模擬資料和八筆真實資料，其中大部分為類別型資料。在模擬資料中，本研究發現在資料變數為類別型的情況下，條件熵挑選重要變數的能力優於其他指標。在真實資料中，部分資料使用過濾法進行變數篩選後，仍有不錯的預測表現，然而亦有部分資料的預測表現不佳，推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理，而將連續型變數離散化時可依據原始資料的分配切分。未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻，以及是否能將過濾法與包裝法和嵌入法結合出新的演算法，進而更精準地篩選出重要的變數，並提升資料分析的效率。

關鍵字

變數篩選；維度縮減；變數相關性；過濾法；熵；類別型資料

並列摘要

With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.

並列關鍵字

Feature selection ； Dimension reduction ； Variable association ； Filter method ； Entropy ； Categorical datasets

參考文獻

Akoglu, H. (2018). User's guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.

Google Scholar

Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.

Google Scholar

Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720.

Google Scholar

Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477.

Google Scholar

Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'07) (pp. 16-16). IEEE.

Google Scholar

國際替代計量

基於相關性的類別特徵選擇方法之評估

查找全文

主題瀏覽