透過您的圖書館登入
IP:3.147.81.76
  • 會議論文
  • OpenAccess

A Low Dimensional Categorical Data Transform Based on Feature Combination

摘要


Upon transforming categorical data into numerical one, current encoders have the drawback of generating high dimensional output. To decrease the dimension of output would unavoidably cause loss of information, and the amount of lost information is considered positively correlated with the number of dimension discarded. This work developed an efficient approach to extract and to reserve more information from the dataset. The numerical output by the proposed approach delivers higher accuracy and desires less computation time due to the limited number of dimensions. The first technique used in this approach is coined as feature combination (FC), which is to combine few columns unto one column of combinations. The second technique, pre-selection, is to select important columns according to information gain metric before executing FC. The proposed method was evaluated with the categorical data from UCI and CTU datasets. The results of the experiments showed that the features, after transforming by the proposed method, are of dimensions from 1 to 4 according to the numbers of datasets' label. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accuracy is not remarkable, the number of dimensions of features are at least 20 times lower than that of OneHotEncoder.

延伸閱讀