透過您的圖書館登入
IP:3.145.171.58
  • 期刊
  • OpenAccess

高維度資料特徵選取之探討-應用於分類蛋白質質譜儀資料

On Feature Selection of High Dimensional Data-Application on Classifying Proteomic Spectra Data

摘要


一般健檢的腫瘤指標的靈敏度和特異性皆不高,也無法偵測較小的腫瘤,因此通常無法及早診斷出腫瘤。本研究的資料為應用蛋白質晶片與表面強化雷射解吸電離飛行質譜技術(SELDI)的血清蛋白質質譜資料,血清樣本來自健康的正常人以及三組不同時期的攝護腺癌症病人。研究目的在選取有助於區分不同時期攝護腺癌症的蛋白質特徵,利用重複隨機抽樣的交叉驗證和支援向量機(Support Vector Machine),先以t檢定的平均p值、Kruskal-Wallis檢定的平均p值、或平均分錯率對於所有蛋白質特徵進行排序,再利用向前選取方式找出最小分錯率模型之特徵變數。為了精簡模型,本研究同時考慮佐以相關係數與判定係數萃取後的特徵變數之分類結果。在各個方法比較上,使用Kruskal-Wallis檢定之最小p值特徵選取法的分類效果較好,而輔助的萃取方法以最大相關係數萃取法最能有效縮減特徵個數,同時又保持分類效果。

並列摘要


Often the time the tumor marker of regular health evaluation is low in sensitivity and specificity so that it could not detect tumor of small size in time. This research aims to develop a classification tool for early diagnosis of tumor by studying proteomic mass spectra of prostate cancer data at different stages. The prostate cancer data studied are the Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) generated from 327 serum samples. Of the 327 serum samples, 81 are from unaffected healthy men (HM), 78 are from patients diagnosed with benign prostatic hyperplasia (BPH), 84 are from patients with organ-confined PCA (T1/T2), and 84 are from patients with non-organ-confined PCA (T3/T4). The goal of this research is to select features (peaks) of the mass spectra that are useful for classifying different stages of prostate cancer via repeated random subsampling cross-validation. The forward minimum-p_value method (derived from t test or Kruskal-Wallis test) and the forward minimum-classification-error method incorporated with SVM are proposed in this study. In addition, maximum-correlation method and maximum-R2 method are considered for further feature selection. In comparison, the forward minimum-p_value method derived from Kruskal-Wallis test often outperforms other methods in terms of classification rate. Moreover, the maximum-correlation method not only can reduce the number of features effectively but also can preserve the classification rate at the same time.

參考文獻


西滿正(1996)。癌的最新診斷與治療。台北:建宏。
長庚大學台灣蛋白質體學簡介(2002)。取自http://memo.cgu.edu.tw/inscorelab/corelab/Intro.htm。
黃建榮(2004)。使用支援向量機分類變異特徵之影像查詢(碩士論文)。朝陽科技大學資訊管理系。
衛生署民國93年死因統計結果摘要(2004)。取自http://doh.gov.tw/statistic/index.htm。
賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年。取自http://enews.nhri.org.tw/enews_list_new3.php?volume_indx=52&enews_dt=2004-06-25

延伸閱讀