支撐向量機制：以編碼處理分類問題並利用迴歸模式進行基因選取

本論文主要分為兩部分。在第一部份中，著重於利用編碼(coding)找出一個低維線性分類子空間(low-dimensional linear discriminant feature subspace)的方法，並探討不同編碼之間的等價性質(equivalence)。透過編碼的方法可以將類別(class label)轉換成多維反應量(multiresponse)，將此多維反應量與核化資料(kernelized data)進行迴歸分析，再進一步利用迴歸係數得到低維線性分類子空間。此子空間可結合任意的線性分類法，使計算較為簡潔快速。在這一部份中也證明，任意編碼產生的多維反應量都會生成同樣的低維線性分類子空間，因此任意的線性分類法都會得到相同的分類結果。實際資料分類的結果顯示，本文提出的分類方法與LIBSVM比較，具有相近的正確率，但是需要較少的分類時間。在第二部分中，本文提出了一個利用支撐向量迴歸(support vector regression)進行基因選取(gene selection)的方法。目前根據微陣列資料(microarray data)作基因選取的方法都將每一片生物晶片視為相同。然而，生物晶片也許來自於不同疾病狀態的病人身上，因此與疾病的相關也不全然相同。所以應當給予生物晶片不同的權重來表示這些生物晶片與疾病之間的相關性。而這些權重可以由支撐向量迴歸估計得來。將這些加權過後的表現(weighted expressions)相加後得到的數值，可以用來決定哪些基因是有顯著意義的基因(significant genes)。我們使用白血病(leukemia)與結腸癌(colon cancer)的資料作分析，並比較其他基因選取的方法所得之正確率。結果顯示，本文提出的基因選取方法可以找出有顯著意義的基因。

關鍵字

編碼；基因選取；核化；線性分類子空間；微陣列資料支撐向量機制；支撐向量迴歸

並列摘要

This thesis contains two major themes. One is the multiclass support vector machines and the other is the support vector regression for gene selection. In the first part, we propose a regression approach for multiclass support vector classification. We introduce some existing coding schemes into the support vector classification by coding the class labels into multivariate responses. Regression of these multivariate responses on kernelized input data is used to extract a low-dimensional feature subspace for discriminant purpose. We unify these coding schemes by showing that they are equivalent in the sense of leading to the same low-dimensional discriminant feature subspace. Classification is then carried out in this low-dimensional subspace using a linear discriminant algorithm, which can be any reasonable choice. The regression approach for extracting low-dimensional discriminant subspace combined with user-specified linear algorithm can team up into a simple but yet powerful toolkit for multiclass support vector classification. Issues of encoding, decoding and the notions of equivalence of codes are discussed. Experimental results, including prediction ability and CPU time, show that our approach is a competent alternative for the multiclass support vector machine problem. In the second part, we propose a support vector regression approach for gene selection and use the selected genes for disease classification. Current gene selection methods based on microarray data have treated each individual subject with equal weight to the disease of interest. However, tissues collected from different patients can be from different disease stages and may have different strength of association with the disease. To reflect this circumstance, our proposed method will take into account the subject variation by assigning different weights to subjects. The weights are calculated via support vector regression. Then significant genes are selected based on the cumulative sum of weighted expressions. The proposed gene selection procedure is illustrated and evaluated using the acute leukemia and colon cancer data. The results and performance are compared with four other approaches in terms of classification accuracies.

並列關鍵字

coding ； gene selection ； kernel ； linear discriminant subspace ； machine learning ； microarray data analysis ； support vector machine ； support vector regression

參考文獻

Distinct types of diffuse large B-cell lymphoma identified by gene

expression profiling. Nature, 40(3):, 503–511, 2000

to binary: a unifying approach for margin classifiers. J. Machine

Learning Research, 1:113–141, 2000.

and A. J. Levine. Broad patterns of gene expression revealed by

國際替代計量

支撐向量機制：以編碼處理分類問題並利用迴歸模式進行基因選取

全文下載

主題瀏覽