透過您的圖書館登入
IP:3.141.200.180
  • 學位論文

以非常態情境評估基因集合分析方法在真實基因資料下之表現研究

Statistical Evaluation for Methods of Gene-set Analysis with Multivariate Non-normal Scenarios

指導教授 : 蕭朱杏

摘要


隨著科技的進步,有越來越多的統計方法能夠幫助研究者找出致病的生物路徑或是基因,因此如何評估並且有效率地選擇這些方法來進行後續更進一步地研究便成為一個關鍵的步驟。在過去的研究中,研究者大多利用多維度(對數)常態分佈來評估這些統計方法,然而這個方式是否恰當仍具有很大的爭議性,因此,本論文的第一部分將會著重在mRNA基因表現量資料上,我們會從公開網站上蒐集各個疾病的資料集以及相關的生物路徑資訊,並且挑選四種常態性檢定(Mardia’s test, Henze-Zirkler’s test, Royston’s test, One-sample energy test)來探討這些經過正規化處理後的資料集是否符合常態性假設。而從這一部分的結果中可以得知,正規化後的真實基因資料有很高的機率並非分服從多維度常態分佈,也因此,本論文的第二個部分挑選了五種常用的基因集合分析方法(Hotelling’s T2, Global test, GlobalANCOVA, Energy test (N-statistic), GSEA (Category))來探討這些方法在非常態情境底下的表現。在這個過程中,我們利用多維度t分佈以及多維度常態混合分佈來設計一系列的多維度非常態情境,並且藉由這些分析方法在這幾種情境底下的穩健性來評估他們的好壞。而從實驗結果中可以發現,雖然大多數的統計方法在非常態情境底下的表現都不好,但Hotelling’s T2在不同情境的某些特定情況底下卻仍然擁有良好以及穩定的表現,然而這些結果都是無法從傳統多維度常態的模擬方式裡獲得的。因此,總結來說,為了要得到更可靠、更準確的評估資訊,我們建議之後的研究者在模擬階段時可以加入一些非常態情境以及其他細微的設定,並且利用各個情境下整體的穩健性來評估這些方法。最後,本論文採用雷達圖來將上述的多個資訊彙整,以提供研究者一個更清楚的視覺化方式來了解這些方法的表現。

並列摘要


As the technology improves, more and more statistical methods for gene-set analysis (GSA) are developed to find pathogenic pathways and genes. Thus, finding a suitable method to make further analysis becomes a critical procedure. In recent years, many studies use the multivariate normal distribution or multivariate lognormal distribution in simulation studies to evaluate the performance of these GSA methods. However, the normality assumption for the gene expression data has been questionable. Therefore, the first part of our study focus on the normality of mRNA gene expression data. We first collect the corresponding pathway information and the gene expression data for each cancer subtype from public website. Then, we choose four normality tests (Mardia’s test, Henze-Zirkler’s test, Royston’s test, One-sample energy test) to analyze these real data, and the results show it is very possible that the normalized gene expression data are not multivariate normally distributed. Thus, in the second part of our study, we consider five GSA methods (Hoteling’s T2, Global test, GlobalANCOVA, Energy Test, GSEA (Category)) in some multivariate non-normal scenarios (including multivariate t distribution and mixture of multivariate normal distributions) to compare the performance and the robustness of these statistical methods. The results of our experiments indicate that although the majority of these GSA methods have a very poor performance under the non-normal scenarios, surprisingly, Hoteling’s T2 still has a consistent and overwhelmingly good performance under different scenarios with some special settings. However, these results cannot be learned from the traditional multivariate normal simulation methods. Thus, to sum up, to get a more reliable and accurate information, we suggest that researchers should add some non-normal scenarios and other settings to the simulation study before using the robustness to evaluate these methods. Finally, these results are demonstrated with radar plots to visualize all the experimental outcomes more clearly.

參考文獻


Ackermann, M., and Strimmer, K. (2009), “A General Modular Framework for Gene Set Enrichment Analysis,” BMC Bioinformatics, 10, 47.
Baggerly, K. A. (2001), “Probability Binning and Testing Agreement between Multivariate Immunofluorescence Histograms: Extending the Chi-Squared Test,” Cytometry, 45, 37-46.
Baringhaus, L., and Franz, C. (2004), “On a New Multivariate Two-Sample Test,” Journal of Multivariate Analysis, 88, 190-206.
Benidt, S., and Nettleton, D. (2015), “SimSeq: A Nonparametric Approach to Simulation of RNA-Sequence Datasets,” Bioinformatics, 31, 2131-2140.
Bubeliny, P. (2011), “Hotelling Test for Highly Correlated Data,” Acta Universitatis Carolinae. Mathematica et Physica, 52, 67-75.

延伸閱讀