透過您的圖書館登入
IP:3.142.53.68
  • 期刊

Use Type I Multivariate Zero-Inflated Poisson Model and Microbial Metagenomics Data to Group Subjects

利用第I型零膨脹卜瓦松模型和微生物基因組數據對受試者分群

摘要


Next-generation sequencing, also known as high-throughput sequencing technology, has been developed for many years. It can sequence hundreds of thousands or even millions of nucleic acid molecules compared to traditional technologies. It helps many researchers to align and analyze genome sequences. Many studies have also confirmed that many manifestations and diseases of the human body are closely related to specific genes, so it is the primary purpose of research to identify specific pathogenic genes effectively. Many microbial phase analysis studies are analyzed from a statistical point of view. For example, Holmes et al. (2012) assumed that the microbial community followed the Dirichlet multinomial distribution. However, this microbial community data has discrete and sparse properties. Therefore, this study considers these two properties and establishes a type I multivariate zero-inflated Poisson model according to the data characteristics. Considering that some microorganisms have the phenomenon of co-occurrence zeros, we first grouped the genes with high zero correlation of microorganisms into the same group; and then grouped the people after determining the number of groups. Finally, the statistical simulation generates type I multivariate zero-inflated Poisson data. We use the proposed method, the K-means and the Hierarchical clustering method to analyze and compare their results. Principal component analysis was used to reduce the dimensionality of high-dimensional genetic data to show the visualization effect of clustering. The study found that the proposed method is relatively effective, even under more overlaps between groups. Moreover, our proposed clustering method can achieve the same accuracy in the real data as using the Dirichlet multinomial distribution assumption. Besides, the correlation between two genes is negative in the Dirichlet multinomial distribution, but it is positive in the type I multivariate zero-inflated Poisson distributions. Considering the positive correlation between genes caused by co-occurrence zeros, we provide another choice of correlation for grouping subjects in data analysis.

並列摘要


次世代定序又稱為高通量測序技術,該技術已發展多年,與傳統技術相比,它可以一次對數十萬甚至數百萬條核酸分子進行序列測定,繼而讓研究人員對基因組序列進行比對與分析。許多研究已證實人體許多表徵與疾病都與特定基因息息相關,因此能夠有效找出特定致病基因是生物資訊研究的主要目的。有些菌相分析研究是從統計的角度出發進行分析,例如Holmes et al.(2012)假設微生物群落服從狄利克雷多項式分配(Dirichlet-multinomial distribution),然而多數微生物群落資料擁有離散及稀疏的性質,因此本研究考慮此兩種性質,依資料特性建立第I型多變量零膨脹卜瓦松模型(type I multivariate zero-inflated Poisson model,ZIP)。我們考慮到某些微生物具有同時出現零的現象,故先將微生物零相關較高的基因分在同一組,決定好分組的組數後再對受試者進行分群,最後利用統計模擬方式生成第I型多變量零膨脹卜瓦松資料,使用本研究所提出的分群模型進行分析。最後,與常見的K組平均演算法(K-means algorithm)及階層式分群演算法(hierarchical clustering algorithm)進行結果比較,並利用主成份分析(Principal Component Analysis,PCA)對高維度基因資料進行降維以呈現視覺化分群效果。研究發現即使群與群之間的重疊處越多時,本研究所提出的方法也是相對有效,在實例資料中,我們提出的分群方法可以達到與使用狄利克雷多項式分配假設相同的準確性。此外,兩個微生物基因之間的相關性在狄利克雷多項式分配中為負的,但在第I型多變量零膨脹卜瓦松分佈中為正的。在考慮微生物具有同時出現零的現象造成基因間正相關的數據下,本研究方法為受試者的分群提供了另一種選擇。

延伸閱讀