透過您的圖書館登入
IP:18.218.129.100
  • 學位論文

應用混合式數據引力分類演算法於基因表現資料分類問題之研究

A hybrid data gravitation based classification algorithm applied to gene expression data

指導教授 : 葉維彰

摘要


在機器學習以及資料採礦的研究中,分類問題佔有一席之地。分類的基本概念即是運用機器學習的演算法根據訓練資料來建構一個分類器,並用來預測測試資料所屬的類別。現今,部分分類演算法已被廣泛地運用在許多生活常見的問題上,譬如說垃圾郵件偵測、手寫辨識以及生物資訊學等等。近年來對於機器學習應用在生物資訊學領域上尤為關注。而基因表現資料對於疾病的診斷及預防扮演著不可或缺的角色,甚而從中發現治療方法,因此其重要性日益增加。然而,基因表現資料的特性構成了分類上極大的挑戰,主要原因為樣本數量的限制及龐大的特徵數。 在過去幾年間,學者根據牛頓的萬有引力理論提出了數據引力分類演算法,此演算法有一套訓練特徵權重的機制,可藉由演算法來搜尋最佳的特徵權重來提升分類準確率。本研究,我們針對基因表現資料的分類問題設計一個以數據引力分類演算法為基礎的分類模型。我們首先以變異數分析為特徵過濾器將冗餘的基因剔除;其次剩餘的基因將會被用來訓練我們建立的數據引力分類模型,並利用優化簡化群體演算法來進行特徵權重的最佳化。最後,我們將本研究提出的演算法與過去文獻提出之方法做比較與討論;結果顯示,本研究提出的演算法能有效的處理基因表現資料的分類問題。

並列摘要


One of the important application for gene expression profiling technology in medical field is to support clinical decision in the form of diagnosis of disease and the prediction of clinical outcomes in response to treatment. The disease prediction and diagnosis become popular in the machine learning field and gene expression data classification problem has attracted considerable research interests in recent years. The challenges posed in gene expression data classification are the limited size of samples and the high dimensionality of the sample. Data gravitation based classification (DGC) model is a novel classification algorithm which performs well in many classification problems. Also, there is an important character of DGC to deal with gene expression data classification problem, feature weighing procedure which measures the importance of a feature by weighting them. In this study, we design a classifier based on the basic DGC model namely k-DGC for the gene selection and classification of gene expression data. We use ANOVA as a filter which can quickly reduce the number of genes and then apply our proposed k-DGC model based on the concept of K-Nearest Neighbor (KNN) and use improved Simplified Swarm Optimization algorithm (iSSO) to optimize the feature weight. Leave one out cross validation (LOOCV) served as an evaluator of the k-DGC model. We compared our method k-DGC with previous research by running ten gene expression datasets from GEMS. Experimental results show that our method is effective for gene expression data classification problems.

參考文獻


[2] L. Xu, A. Krzyzak, and C. Y. Suen, "Methods of combining multiple classifiers and their applications to handwriting recognition," IEEE transactions on systems, man, and cybernetics, vol. 22, pp. 418-435, 1992.
[3] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, "Support vector machine classification and validation of cancer tissue samples using microarray expression data," Bioinformatics, vol. 16, pp. 906-914, 2000.
[5] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, et al., "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks," Nature medicine, vol. 7, pp. 673-679, 2001.
[7] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, et al., "Gene expression correlates of clinical prostate cancer behavior," Cancer cell, vol. 1, pp. 203-209, 2002.
[8] Y. Wang, I. V. Tetko, M. A. Hall, E. Frank, A. Facius, K. F. Mayer, et al., "Gene selection from microarray data for cancer classification—a machine learning approach," Computational biology and chemistry, vol. 29, pp. 37-46, 2005.

延伸閱讀