發展二元變數之監督與非監督式編碼架構以提升模型預測表現

人工智慧、機器學習與深度學習在近年來被廣泛的應用於各行各業，不論是影像辨識或自然語言處理的發展，遍及製造業、金融業、市場銷售、與影像醫學辨識等領域，實作者前仆後繼地設法將機器與深度學習應用於實際的問題上，以提升日常工作的效率與準確度。然而，機器學習模型的表現並不端看模型的建置技巧、和超參數調教與設置，資料前處理與編碼方式對於模型表現也有著極為深遠的影響。例如，在處理含有字串的類別變數時，我們往往將單一類別變數編碼成多個數值特徵，例如以獨熱編碼來將類別變數中的字串型別特徵，轉換成二元數值型別特徵，以作為輸入資料供模型讀取。但若是類別變數中的類別繁多，進行獨熱編碼後將產生許多的二元特徵，如此將稀釋原始類別特徵的資訊、並造成維數災難的困境；此外編碼出的二元特徵也不全然與分類器、迴歸機有直接關聯；更甚者，二元特徵本身的資料分布往往與諸多機器學習演算法的假設相左。鑑於以上的挑戰，本研究提出了一創新的監督式與非監督式的編碼方式，能將二元特徵聚合成少數個整數型別的變數；編碼出來的整數型別變數，可以輕量化地餵入模型，且由於其聚合編碼是透過原二元特徵間的關聯度，因此模型訓練更有效率且最終表現亦有所提升。此創新的聚合編碼方法乃藉由探究二元特徵間的相關係數、主成份的權重等方式將二元特徵分組，再根據各組內特徵的屬性進行排序後，編碼成整數數值。整體而言，本方法力求在縮減維度、提升處理速度的同時，維持模型的準確性與變數的可解釋性。

關鍵字

類別變數；獨熱編碼；監督式／非監督式編碼；二元特徵排序

並列摘要

AI techniques have recently been widely applied to the tasks of image recognition and natural language processing. Practitioners from fields such as manufacturing, finance, marketing, and radiology are eager to implement AI methods to enhance daily efficiency and effectiveness. However, AI method performance depends on not only the modeling skills and hyperparameters tuning but also the data preprocessing and encoding. While handling categorical variables, one-hot encoding is commonly used to convert strings into binary features, which can then serve as the input for model training/testing. If the number of categorical levels is large, it consequently creates a large number of features, and the curse of dimensionality would be an essential concern. Furthermore, the one-hot encoding features are created based on the levels of categorical variables and do not guarantee to be related to the classification/regression tasks. Not to mention that the binary feature values often violate the assumptions in machine learning algorithms. In this research, we develop unsupervised and supervised encoding methods to overcome the modeling issue of categorical variables. In the unsupervised encoding scheme, we compare the feature properties, such as the column sparsity, PCA-weight, and feature importance, for consolidating related features into a semi-continuous one via binary encoding. In the supervised encoding method, an optimization scheme is proposed to incorporate the performance improvement of the classifier/regressor and the consolidating orders of the binary features. It is expected to reduce the number of binary features significantly as well as to enhance the classification/regression accuracy through inputting the consolidated features.

並列關鍵字

categorical variable ； one-hot encoding ； supervised/unsupervised encoding ； binary feature sorting

參考文獻

Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2), 49-60.

Google Scholar

Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 803-821.

Google Scholar

Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131.

Google Scholar

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Google Scholar

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining,

Google Scholar

國際替代計量

發展二元變數之監督與非監督式編碼架構以提升模型預測表現

全文下載

主題瀏覽