利用稀疏負二項分配之線性判別分類器分析基因表現測序資料

近年來，由於次世代定序的技術發展崛起，核糖核酸測序 (RNA測序) 更高的精確性逐漸取代DNA微陣列成為表達生物基因測序的主流方法。其中，若能藉由病患之RNA測序有效分類其各項特徵，必定能提升醫療診斷時的相應資訊。然而現況下大部分的統計方法皆是建立在連續型分布抑或是常態分配的假設下。導因於測量方式的差異，RNA測序和DNA微陣列所測得的基因表現量並非相同屬性。前者測量值皆為非負整數值，在資料分析時通常以卜瓦松分配或負二項分配做為分配假設；而後者則為連續型的測量值，一般以常態分配進行建模。故此發展針對卜瓦松分配或負二項分配作最為建模的分析方法是為現階段不可忽略的需要性。Witten (2011) 曾提出藉由卜瓦松分配的假設，改善原有常態分配假設下的線性判別分析。但在卜瓦松分配假設下，人們需要假設母體變異數和母體平均數是相等的，這並不能有效地體現RNA測序資料背後的生物特性。Dong (2016) 接著在Witten (2011)的方法進行延伸，將其原有的卜瓦松分配假設更改為負二項分配，藉此讓變異數的假設更加彈性。然而在測序資料中，變項個數大多是遠大於樣本個數的，所以變項挑選的選模機制在此情況中也變得格外重要。Dong (2016)所提出的演算法法本身並不能進行選模。我們相信若資料分析方法可以基於負二項分配假設且同時具備變項挑選機制，必能有效改善其分類結果。本文中，我們提出了負二項分配線性判別分析來為RNA測序資料進行分類，並由廣義線性模型進行參數估計。該分類器是基於貝氏定理以及負二項分配所導出。在模擬結果中證明了我們的負二項分配假設結合選模機制能夠有效改善分類結果。我們也分析了一筆真實資料以體現真實情形下的實驗結果。藉由上述情形的比較，我們能夠宣稱我們所提出的分類方法對於RNA測序資料的分類是非常有效的。

關鍵字

負二項分配；線性判別；基因表現測序

並列摘要

In recent years, RNA sequencing (RNA-seq) has become a powerful technology to characterize gene-expression profile of organisms because of the capabilities of next-generation sequencing and better accuracy compared to microarrays. Classification of gene expression profiles has been a promising approach for the purposes of diagnosis and prognostic prediction for patients. Most of the statistical method that have developed for micorarray data are either based on Normal distribution assumption. Since RNA-seq collects count data and is different from the continuous measurement from microarray data, it is necessary to develop methods that are well suited for the specific property of RNA-seq data. Witten (2011) proposed a Poisson linear discriminant analysis for RNA-seq data. The Poisson assumption forces the variance to be the same with the mean, and it may not be appropriate for the real medical samples. Dong (2016) proposed a Negative Binomial linear discriminant analysis to fix this this problem. However, sequencing data usually exist the problem that number of features is relatively large compared to the number of samples. Dong (2016)’s algorithm cannot achieve sparsity. We believe a linear discriminant analysis based on Negative Binomial assumption with variable selection mechanism can improve the classification performance. In this paper, we propose a Negative Binomial linear discriminant analysis under the generalized linear model framework for RNA-seq data. The classifier is conducted according to the Bayes rule through fitting a Negative Binomial model. Simulation result shows that the model assumption and feature selection mechanism in our method can improve the performance of classifier. We also demonstrate the advantages of our method by analyzing an RNA-seq data in real-world scenario. Based on the comparison result, our proposed classifier can serve as an effective tool for RNA-seq data classification.

並列關鍵字

RNA-seq ； generalizelinearmodel ； LinearDiscriminantAnalysis

參考文獻

Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."Machine learning 20.3 (1995): 273-297.

Google Scholar

Dong, Kai, et al. "NBLDA: negative binomial linear discriminant analysis for RNA-Seq data." BMC bioinformatics 17.1 (2016): 369.

Google Scholar

Dillies, Marie-Agnès, et al. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis." Briefings in bioinformatics 14.6 (2013): 671-683.

Google Scholar

Dudoit, Sandrine, Jane Fridlyand, and Terence P. Speed. "Comparison of discrimination methods for the classification of tumors using gene expression data." Journal of the American statistical association 97.457 (2002): 77-87.

Google Scholar

Hardcastle, Thomas J., and Krystyna A. Kelly. "baySeq: empirical Bayesian methods for identifying differential expression in sequence count data." BMC bioinformatics 11.1 (2010): 422.

Google Scholar

國際替代計量

利用稀疏負二項分配之線性判別分類器分析基因表現測序資料

全文下載

主題瀏覽