人類專家們期望可以使用微陣列資料去判斷一個病人是否有癌症,或是能從中找出跟癌症有關聯的基因。然而微陣列資料有非常多的特徵(基因),比方說,人類有兩萬多個已知基因。這不但對人類專家們來說,很難去從中去找出潛在決定癌症的法則,而且對機器學習工具也是一個很大的考驗。因此我們需要有一個方法來根據這些基因的重要性做排名,以此來篩選出有決定性的基因。這樣一來,不但人類專家們可以花費較少心力去專研挑選出來的基因,並且也能夠加強機器學習工具在癌症分類的準確率。 這篇論文中,我們探討特徵選擇法對使用基因微陣列資料在癌症分類上的影響,主要是針對徑向基函數類神經網路的研究。我們的實驗顯示,無需參數調整的徑向基函數類神經網路能與參數調整最佳化的支援向量機在癌症分類上有相近的準確度,而且遠快於需調整最佳化的支援向量機。若使用特徴選擇法,則徑向基函數類神經網路相對於支援向量機有較多的準確度增進。 在特徴選擇法的研究中,我們也發現基因雜訊對徑向基函數類神經網路比支援向量機有較大的影響,因此我們提出一個新的特徵選擇法,快速徑向基函數類神經網路特徵遞迴刪除法。我們的實驗顯示,快速徑向基函數類神經網路特徵遞迴刪除法,對於增進癌症分類的準確率跟支援向量機特徵遞迴刪法有相近的效果。我們在生物相關文獻中也發現由快速徑向基函數類神經網路特徵遞迴刪除法選出的基因,例如基因Bcl-xl 在淋巴瘤,基因CXCL10 在前列腺癌,確實與癌症有關係,而這些基因是統計特徵選取法和支援向量機特徵遞迴刪除法很難選出來的。我們也在文中探討為何不同的特徵選擇法會選擇不同的基因。我們希望經由本研究,可以在癌症研究上提供另一種可能性。
Human experts hope to use microarray data to know if a patient has a caner and to identify genes associated with cancer. However, a microarray data has many features (genes), for example, human has more than twenty thousand genes. It is not only a difficult task for human to discover pattern in the microarray data but also a problem for machine learning methods. Therefore, we need to rank the importance of these genes in microarray data in order to select informative genes. And it could not only help human experts to research what genes lead to cancer but also help machine learning methods to increase the accuracy in cancer classification. In this thesis, we studied the impact of feature selection methods on cancer classifier with DNA microarray data sets, especially on radial basis function network (RBF network). The experiment showed that RBF network could achieve similar accuracy with optimized support vector machine (SVM) in much less computing time. By using feature selection methods, RBF network could has more improvement than SVM in cancer classification accuracy. During the research of feature selection, we observed that noisy genes could affect RBF network more than SVM. We, therefore, proposed a feature selection method, QuickRBF-RFE. QuickRBF could rank the importance of genes by itself and we could select a subset of discriminate genes by recursive feature elimination algorithm. Our experiment result showed that QuickRBF-RFE had similar performance with SVM-RFE in cancer classification. Moreover some of the top genes identified by QuickRBF-RFE, such as Bcl-xl in lymphoma cancer, CXCL10 in prostate cancer, were clarified to be associated with cancer in biological literature, which were difficult to be identified by statistical feature selection methods and SVM-RFE. Moreover we discussed why various feature selection methods would select different genes for cancer classification. We hope our research could open a new direction in cancer research.