開發以機器學習搭配基因資訊為基礎之上位作用偵測演算法

全基因體關聯性分析(Genome-wide association study)是一個偵測基因變異(Genetic variant)與外顯型(Phenotype)之間關聯性的常用方法。然而，全基因體關聯性分析在偵測基因變異之間的交互作用與外顯型的關聯性，或稱為上位作用(Epistasis)，的能力有限。我們認為，開發一個有效且有效率的全基因體關聯性分析方法來偵測上位作用，將有助於解開像是阿茲罕默症(Alzheimer’s disease)等複雜疾病(Complex disease)的致病機制。因此，本研究開發一個演算法：GenEpi，此演算法利用機器學習(Machine learning)來偵測變異之間的交互作用與外顯型的關聯性。由於在生物學上，基因是最小的功能單位，故GenEpi的核心概念便是利用基因(Gene)在基因體中的區段為分割區塊，並分兩個階段進行特徵值的萃取，試圖解決偵測全基因體上位作用計算複雜度(Computational complexity)過高的問題，以及多重檢定(Multiple testing)導致統計信度下降的問題。GenEpi的兩個階段分別為基因內(Within-gene)的上位作用偵測，以及基因間(Cross-gene)的上位作用偵測。在這兩個階段我們皆使用二元組合編碼(Two-element combinatorial encoding)來產生代表上位作用的特徵值，並利用正規化回歸(L1-regularized regression)以及穩定性選擇法(Stability selection)來篩選特徵值並建立模型。本研究將GenEpi運用於阿茲罕默症的資料集上，來預測樣本是否為阿茲罕默症病患或預測阿茲罕默症的病程快慢，藉此驗證演算法成效，並期待因此能進一步揭開更多阿茲罕默症可能的致病因子。不論是在模擬資料或是阿茲罕默症真實資料，結果顯示GenEpi的預測準確度及計算時間皆優於現行的演算法，如：FastEpistasis，BOOST，ReliefF等方法。足見GenEpi將有助於其他全基因體關聯性分析，特別是針對複雜疾病的研究，預期將可提供生醫研究人員進行實驗設計時更多有用的參考資訊。可用性：GenEpi是一個開源的Python套件，授權給非商業行為的學術人員使用。原始碼已公開在PyPi套件庫，以及GitHub (https://github.com/Chester75321/GenEpi)。

關鍵字

GenEpi ；機器學習；上位作用；全基因體關聯性分析；阿茲罕默症

並列摘要

Genome-wide association studies (GWAS) provide a powerful means to identify associations between genetic variants and phenotypes. However, GWAS techniques for detecting epistasis, the interactions between genetic variants associated with phenotypes, are still limited. We believe that developing an efficient and effective GWAS method to detect epistasis will be a key for discovering sophisticated pathogenesis, which is especially important for complex diseases such as Alzheimer’s disease (AD). In this regard, this study presents GenEpi, a computational package to uncover epistasis associated with phenotypes by the proposed machine learning approach, and illustrates the application of GenEpi on predicting the diagnosis and the progression of AD. The key concept of GenEpi is a two-stage feature extraction process based on gene structures. Since a gene is the minimal physical and functional unit of heredity, GenEpi considers a gene as a unit to retrieve genetic variants as features. GenEpi adopts two-element combinatorial encoding when producing features and constructs the prediction models by L1-regularized regression with stability selection. Features are first modeled using combinatorial encoding followed by L1-regularized regression with stability selection to detect the epistasis within a single gene. The selected features for each gene are then pooled together to identify cross-gene epistasis, using L1-regularized regression with stability selection again. This study compared GenEpi with several commonly used algorithms for detecting epistasis, including FastEpistasis, BOOST and ReliefF. The simulation data demonstrated that GenEpi outperforms the other methods in ranking the true epistasis as the top one. As real data is concerned, the results suggested that the epistasis selected by GenEpi has the best predictive power for two major phenotypes in the AD dataset: diagnosis of AD and disease progression. For diagnosis, the proposed model of predicting AD contains three clinical features (Age, Gender and Education) and 14 genetic features, including 24 SNPs from 12 genes that contain the well-known causal gene, APOE. The 2-fold cross validation (CV) and leave-one-out CV (LOO CV) accuracy of this model are 0.83 and 0.81, respectively. On the other hand, for predicting progression, the proposed model contains eight clinical features (Age, Gender, Education, Cognitively Normal (CN), Early Mild Cognitive Impairment (MCI), Late MCI, AD, and MMSE at baseline) and four genetic features, including seven SNPs from six genes, where all of the four genetic features are cross-gene epistasis with significant p-values (< 10-11). The average of Pearson and Spearman correlation of 2-fold CV and LOO CV are 0.52 and 0.53, respectively. The results on AD revealed the capability of GenEpi in finding disease-related variants and epistasis that show both biological meanings and predictive power. The released package can be generalized to largely facilitate the studies of many complex diseases in the near future. Availability: GenEpi is an open-source Python package and available free of charge for non-commercial users. The package has been published on The Python Package Index, and GitHub (https://github.com/Chester75321/GenEpi)

並列關鍵字

GenEpi ； Machine Learning ； Epistasis ； GWAS ； Alzheimer’s disease

參考文獻

1. Kingsmore, S.F., et al., Genome-wide association studies: progress and potential for drug discovery and development. Nature Reviews Drug Discovery, 2008. 7(3): p. 221-230.

Google Scholar

2. Ozaki, K., et al., Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet, 2002. 32(4): p. 650-4.

Google Scholar

3. Klein, R.J., et al., Complement factor H polymorphism in age-related macular degeneration. Science, 2005. 308(5720): p. 385-9.

Google Scholar

4. Pinero, J., et al., DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res, 2017. 45(D1): p. D833-D839.

Google Scholar

5. McCarthy, M.I., et al., Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet, 2008. 9(5): p. 356-69.

Google Scholar

國際替代計量

開發以機器學習搭配基因資訊為基礎之上位作用偵測演算法

未授權

主題瀏覽