資料前處理之研究：以基因演算法為例

特徵選取(feature selection)和樣本選取(instance selection)在資料探勘裡，是兩個很重要的資料前處理技術，主要目的是希望再給定一個資料集時，可以透過特徵選取技術來去除不相關或是冗餘的特徵值，或是透過樣本選取技術來消除重覆及錯誤的資料，特別的是基因演算法(genetic algorithm)是過去最被廣泛應用在這資料前處理技術的演算法，而目前這兩種資料前處理的方法，在過去往往是被分開探討的，所以目前尚未清楚特徵選取和樣本選取同時執行與個別單獨執行，其執行效能與結果有什麼樣的不同，因此本研究的目的是透過基因演算法去處理特徵選取與樣本選取，並且探討兩種資料前處理方法之間的順序，在不同的領域資料集中的分類表現，實驗的結果來自於不同領域的四個大型資料集與四個小型資料集在分類器(例如：support vector machines and k-nearest neighbor)上的表現，而其中這八個資料集的維度特徵與資料樣本數目並不相同，目的是希望可以將這樣的方法不僅可以應用在不同領域的資料集，還可以應用在差異性大的資料集，除此之外，本研究除了找到不同的資料前處理模式，更進一步的分析資料集的特性，目的是希望透過正確率與時效性的兩個層面，更進一步的探討那種特性的資料集適合應用何種資料前處理方法，透過找出一定的規律和準則，讓不同領域的資料集皆能夠在分類器上或實驗的時效性上，皆有較佳的表現。

關鍵字

資料探勘；特徵選取；基因演算法；樣本選取

並列摘要

Feature selection and instance selection are two important data preprocessing steps in data mining, where the former aims at removing some irrelevant and/or redundant features from a given dataset and the later for discarding the faulty data. In particular, genetic algorithms have been widely used for these tasks in related studies. However, these two data processing tasks are generally considered separately in literature. It is unknown about the performance differences between performing both feature and instance selection and feature or instance selection individually. Therefore, the aim of this paper is to perform feature selection and instance selection based on genetic algorithms using different priorities to examine the classification performances over different domain datasets. Experimental results based on four small and large scale datasets containing various numbers of features and data samples show that performing both feature and instance selection usually make the classifiers (i.e., support vector machines and k-nearest neighbor) perform slightly poorer than feature selection or instance selection individually. However, while there is not a significant difference in classification accuracy between these different data preprocessing methods, the combination of feature and instance selection largely reduces the computational effort of training the classifiers than feature and instance selection individually. By considering both classification effectiveness and efficiency, performing feature and instance selection is the optimal solution for data preprocessing in data mining.

並列關鍵字

data mining ； feature selection ； instance selection ； genetic algorithms

參考文獻

謝欣宏，2002，台鐵司機員排班與輪班問題之研究 – 以基因演算法求解，國立交通大學，碩士論文。

G I. Bose, R.K. Mahapatra, 2001. Business data mining ─ a machine learning perspective, Information & Management, Vol. 39, No. 3, pp. 221-225.

U. Fayyad, S.G. Piatetsky, P. Smyth, 1996. Advances in knowledge discovery and data mining, The MIT Press.

J. Han, M. Kamber, 2000. Data mining: concepts and techniques. Morgan Kaufmann.

S.F. Crone, S. Lessmann, R. Stahlbock, 2006. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing, European Journal of Operational Research, Vol. 173, No. 3, pp. 781-800.

被引用紀錄

秦聖昌（2015）。支援向量機於乳癌預測之研究〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201512094103

范睿昀（2015）。應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-0412201512044755

國際替代計量

資料前處理之研究：以基因演算法為例

未授權

主題瀏覽