透過您的圖書館登入
IP:3.238.64.201
  • 學位論文

以距離統計量為基準的基因選取方法及基因富集分析

Methods based on distance statistics for detection of differentially expressed genes and gene set enrichment analysis

指導教授 : 蔡政安
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在本論文中的第一部分為單一差異表現基因分析,像是t檢定或SAM這類的統計方法都是將每個基因視為獨立並分別檢定是否為差異表現基因,但在未考慮基因之間的相關性下,得到的檢定結果可能會產生偏差。因此,近期有一個被稱為OR值的新統計量被提出,其優點是不需要模型假設及估計參數,並利用歐式距離考慮欲檢定基因與所有基因之間的關係以及整體資料的分散程度。在本論文中使用多元常態分配、多元t分配及混合分配來模擬基因表現資料,接著使用OR值檢定單一基因是否為差異表現基因,並與t檢定及不使用OR值的方法做比較,結果發現使用OR值的加權分位數差距方法在所有情況都有不錯的表現,尤其在基因之間相關性高的多元t分布下以及混合分布兩種多維度常態分布的平移量差距大於0時有很高的檢定力且錯誤發現率也較低。在本論文中的第二部份基因集分析採用自足假說,而檢定方法主要是嘗試調整第一部份中的量化基因表現量差異的統計量來做基因集分析,同時與現有常用的基因集分析方法做比較,結果發現只有在多元 t 分布下才比較明顯的看出以距離為主的方法如分位數差距總和、加權分位數差距總和及energy test 方法其檢定力表現相較於其他方法好,其他情況下並沒有發現特別有優勢的方法;第三部份則是使用一組乳癌病人的實際資料來進行單一差異表現基因和基因集分析,並與其他方法比較結果。總和來說,在進行單一差異表現基因分析時,OR值是個值得考慮的方法,但在基因集分析中則可能還需要一個更穩健的統計量。

並列摘要


The first part of this paper is to study the effectiveness of differentially expressed gene analysis. Statistical methods such as t-test or SAM treat each gene as independent and separately identify whether it is a differentially expressed gene. However, the results of the test may be biased because of the correlation between genes. Therefore, a novel statistic called OR value is proposed for identifying differentially expressed genes recently. The advantage of OR value is no model assumptions and no estimated parameters, as well as the Euclidean distance is used to consider the correlation between genes and the dispersion of data. In this paper, multivariate normal distribution, multivariate t distribution, and mixed distribution are used to simulate gene expression data, and then the OR value is used to identify whether the gene is a differentially expressed gene, and compared it to the commonly used t-test and non-OR methods. The results show that the weighted quantile difference method using OR value performs well in all cases, especially in the multivariate t distribution with a high correlation coefficient and the mixed distribution with shift amount greater than 0. The second aim of this paper is gene set analysis (GSA) using the self-contained hypothesis. Adjustments for the GSA method is carried out using statistics in the first part, and we also compared it to commonly used gene set analysis methods. The results show that only in the multivariate t distribution, the distance-based methods such as the sum of the quantile difference, the sum of the weighted quantile difference and the energy test method perform better than other methods, and there is no apparent method outperforming others under other conditions. Finally, we applied the OR-based method and competing methods to a large scale dataset from a group of breast cancer patients to perform the differentially expressed gene and gene set analysis. In summary, the OR value is a worthwhile method when performing the differentially expressed gene analysis, but a more robust statistic may be needed to extend the analysis for gene-set level.

參考文獻


Ackermann, M. and Strimmer, K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics 2009;10:47.
Ashburner, M., et al. Gene Ontology: tool for the unification of biology. Nature Genetics 2000;25:25.
Benjamini, Y. and Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995;57(1):289-300.
Blackwood, M.A. and Weber, B.L. BRCA1 and BRCA2: from molecular genetics to clinical medicine. Journal of Clinical Oncology 1998;16(5):1969-1977.
Bobkov, S. and Ledoux, M. One-dimensional empirical measures, order statistics and Kantorovich transport distances. preprint 2014.

延伸閱讀