透過您的圖書館登入
IP:3.144.172.115
  • 學位論文

支援基因演算法和支持向量機於雲端計算

A Feature Selection and Classification Tool Using Cloud Computing Architecture

指導教授 : 賴飛羆
共同指導教授 : 薛智文(Chin-Wen Hsueh)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


因為資訊科技的日益發達,可以藉由儀器得到許多以往未知的資料,然而那些資料往往過於龐大和複雜,每筆型態的資料都擁有許多的類別和特徵,若沒有經過整理和分析,人們無法有效的利用那些資料。人們往往希望能從資料中找尋最佳的答案和解法,但若想要找到最佳的答案,往往需要把每筆資料個別做分析,然而這樣的作法會消耗過多的時間和精力。因此人們藉由一些特徵擷取和分類的方式來有效的擷取出所需要的資料,並且利用一些近似最佳解的特徵擷取方式來找尋較佳的答案和減少資料處理的時間,並且希望能藉由擷取出來的資料來對未知的資料做分類和分析。然而,即使使用特徵擷取降低了時間和運算量,但往往還是需要非常龐大的時間才能完成,因此人們希望藉由網路的傳輸,把一些運算處理分散到其他台儀器上面去處理,希望藉由平行化處理的方式,來降低主機的運算時間和運算量。在此篇論文中,我們建置了一個容易上手的使用平台,可以讓使用者利用這平台來處理擁有需多類別的資料,我們採用了基因演算法和費雪計分法來當作特徵擷取,並利用支援向量機來做分類器。藉由雲端技術把運算分散到其他台儀器中,來降低主機的運算量和整體的運算時間。我們利用一份癌症疾病的mRNA的範例資料來做效能評估,此資料擁有14種類別和16,063種特徵,且因為mRNA資料比較難取得,所以能得到的資料比較少量,因此比較難得到很高的正確率,我們利用了上述的方法得到了87%的正確率。在時間成本上,系統把工作分散到十台電腦上,把本來需要花上22.057天的時間才能跑完的資料降低到只需要2.334天就可以跑完。因此本研究改善了分類器的正確率與時間。

並列摘要


The information technology is increasingly developed, so number of previously unknown information and data can be obtained by the new developed instrument. However, those data are often too large and complex, and each type of data has a number of categories and characteristics. People cannot effectively use and analyze those data without classification and analysis. As well as, people tend to search the best and optimal answer and solution from the information. If people would like to find the best answer, it would have to be analyzed by individual. However, this action would consume and take too much time and effort. Therefore, people use some feature selection methods and classification methods to effectively capture the useful and needed information. Some nearly optimal solution of feature selection and classification are used to find the better solution and reduce the computing time and cost. People also hope that this information which can be used to classify the unknown data. However, even using the feature selection and classification to reduce the computing cost and time, it still needs a lot of time and cost to complete. Therefore, people want to use the network transmission to spread some process to other devices and servers. By using the network, it can reduce the computation cost and time through parallel processing. In this thesis, we build up a user-friendly platform which allows users to utilize this platform to deal with multi-class data classification. This system adopts the Genetic Algorithm (GA) and Fisher Score to select the feature, and Support Vector Machine (SVM) for the classification. It also applies the cloud computing to reduce the computation and overall computing time by spreading the job to other devices. We use the mRNA cancer dataset, which has 14 categories and 16,063 features, to evaluate the system performance. Due to difficultly obtaining the mRNA data, it is hard to achieve high accuracy. To use the above methods, it gets 87% accuracy and reduces the overall computing time from 22.057 days to 2.334 days by separating data into ten computers.

參考文獻


1. Kao, W.-C. and C.-C. Wei, Automatic phonocardiograph signal analysis for detecting heart valve disorders. Expert Syst. Appl., 2011. 38(6): p. 6458-6468.
2. Becker, N., et al., penalizedSVM: a R-package for feature selection SVM classification. Bioinformatics, 2009. 25(13): p. 1711-1712.
3. Yu, G., et al., PUGSVM: A caBIGTM analytical tool for multiclass gene selection and predictive classification. Bioinformatics, 2010.
4. Ghorai, S., et al., Cancer Classification from Gene Expression Data by NPPC Ensemble. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 2011. 8(3): p. 659-671.
5. Lei, G., et al., Classification of Mental Task From EEG Signals Using Immune Feature Weighted Support Vector Machines. Magnetics, IEEE Transactions on, 2011. 47(5): p. 866-869.

延伸閱讀