透過您的圖書館登入
IP:3.22.100.180
  • 學位論文

應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料

Feature Selection on Cross-laboratory Prostate Cancer RNA-sequencing Data

指導教授 : 趙坤茂
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


過去的幾年中,RNA-sequencing 技術在轉錄學研究中已經發展成一個不可或缺的工具。基於RNA-sequencing 實驗的花費相當龐大,研究人員總是無法有足夠的樣品去做更為複雜的顯著基因表現量差異的研究。各個實驗室產出的樣品會由於實驗室環境的差異而有不少差異,因此鮮少研究將各個實驗室的資料去整合成一個更大的資料庫。此研究主要探討跨實驗室資料的特徵選取議題。實驗使用四組來自不同實驗室的前列腺癌資料,並應用排名正規化方法去減少來自不同實驗室的差異。首先我們將三組資料結合成一組作為訓練組,再將剩下的一組資料做為測試組。並且使用隨機森林演算法去找出在訓練組中有顯著基因表現量差異的基因,再將找出的基因使用支持向量機從訓練組去建立分類模型。接著用此模型去預測測試組的類別辨識準確度,藉 此比較使用排名標準化方法前後的準確度差異。實驗結果顯示,使用排名標準化方法後能有效將測試組的辨識準確度提高,並且使用排名標準化方法配合隨機森林演算法的效果也優於使用Cuffdiff。此外除了標準化和特徵選取演算法的差異,定序機器的差別也是影響結果一個重要的因素。愈新的機器可以給予更穩定且準確的資料,以達到更高的辨識準確度。

並列摘要


Over the past few years, RNA-sequencing has become a revolutionary tool for transcriptomics analysis. The high cost of RNA-sequencing experiment results in the insufficient samples for researchers to conduct a comprehensive differential gene analysis. Nowadays, few studies integrate the cross-laboratory datasets into a big dataset due to the bias from different laboratories experimental procedures. In our study, we investigate the issue of cross-laboratory feature selection. We consider four prostate cancer RNA-seq datasets from different laboratories or platforms. Rank-based normalization is utilized to reduce the bias from the four cross-laboratory datasets. In our experiments, we combine three datasets into a training set. The remaining dataset is regarded as the testing set. Random Forest is applied to select differential genes from training sets. We then put the training subset with only differential genes in support vector machine to learn a classification model. This model then is utilized to predict the class of testing subset with the same list of differential genes. The predicted results are evaluated by balanced accuracy which is an unbiased measurement. Results show that applying rank-based normalization can improve the performance of cross-laboratory feature selection. The performance of Random Forest and rank-based normalization is also better than a well-known tool, Cuffdiff. In addition, we discuss the influence caused by various sequencing platforms. The sequencing machine is also an important factor which affects the preformance of feature selection on cross-lab RNA-seq datasets.

參考文獻


[1] S. Anders and W. Huber. Differential expression analysis for sequence count data.
and H. Zarbl. Addendum: Standardizing global gene expression analysis between
methods for high density oligonucleotide array data based on variance and
bias. Bioinformatics, 19(2):185–193, 2003.
[4] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

延伸閱讀