微陣列是一個現今十分重要的基因分析工具,他可以協助分別多種的癌症類別。我們進行了一個癌症微陣列資料的識別工作,在這個工作中,我們運用了資訊科學的特徵選擇方法和支持向量機的機器學習方法,來進行將資料簡化和資料預測的工作。 我們將這兩樣的工具運用在三種的癌陣微陣列資料上,分別是白血病、肺癌和前列腺癌。我們運用的特徵選擇方法主要有兩類的方法,分別是距離測量法類的歐式距離特徵選擇法和相依性測量法類的皮爾森相關係數特徵選擇法。我們運用支持向量機在不同的特徵個數和三種不同的核函式,來進行分類的工作。而我們的結果顯示出距離式特徵選擇法是適合支持向量機分類器的特徵選擇法,且線性核函式在我們所進行的這三種問題來說是較佳的核函式。在這三組資料不同的特徵個數中,將至少7129個特徵數量,減少至僅15到100個特徵個數之間的狀況下,仍然能夠獲得了相等或較佳的預測結果。
Microarray is an important tool in gene analysis research. It can help identify genes that might cause various cancers. In this thesis, we use feature selection methods and the support vector machine (SVM) to search for the disease-causing genes in microarray data of three different cancers. The feature selection methods are based on Euclidian distance (ED) and Pearson correlation coefficient (PCC). We selected three most reference microarray data sets for classification which are AML & ALL data sets, Lung cancer data sets, and Prostate data sets. We investigated the effect on prediction results by training the SVM with different numbers of features and different kinds of kernels. The results show that linear kernel is the fittest kernel in this issue. Also, equal or higher accuracy can be achieved with only 15 to 100 features which are selected from 7129 or more features of the original data sets.