隨著網際網路、手機行動上網及社交軟體(如Facebook、Instagram等)的盛行,數據產生正在以前所未有的方式增加。然而,由於數據量龐大、格式多樣化、維度(變數)過多,對機器學習而言不利,過多的變量會妨礙模型找出預期之規律,而計算量較大、訓練時間長等問題,亦導致訓練後的結果產生不如預期的效果。因此在機器學習項目中,特徵處理是通常會先進行的前處理程序。本論文分析及比較現有的特徵處理相關技術,包括從原有的特徵建構新的特徵提取方法,如:主成分分析(PCA)與線性判斷分析(LDA),以及保留原始數據之訊息且做出篩選的特徵選擇方法,如:過濾法(Filter)與包裝法(wrapper),以期達到有效地利用特徵處理方法來實現高性能的學習算法。 本論文所分析及整理的多種特徵處理方法,更能了解特徵處理之流程內容,且提供使用者清晰的參數設定與運作模式,進一步提升資料之可用性。
With the popularity of Internet, mobile Internet and social software (such as Facebook, instagram, etc.), data generation is increasing in an unprecedented way. However, due to the large amount of data, the diversity of formats and the excessive number of dimensions (variables), it is disadvantageous for machine learning. Too many variables will hinder the model to find out the expected law, and the amount of calculation is large and the training time is long This paper analyzes and compares the existing feature processing technologies, including constructing new feature extraction methods from the original features, such as principal component analysis (PCA) and linear discriminant analysis (LDA), and preserving the original data In order to effectively use feature processing methods to achieve high-performance learning algorithm, we also make feature selection methods, such as filter and wrapper. The various feature processing methods analyzed and sorted out in this paper can better understand the process content of feature processing and provide users with clear parameter setting and operation mode, so as to further improve the availability of data.