透過您的圖書館登入
IP:3.129.69.189
  • 學位論文

基於分群集成技術的非平衡學習應用於預測非編碼區變異的致病性

Clustering Ensemble Based Imbalanced Learning for Predicting Pathogenic Non-coding Variants

指導教授 : 陳倩瑜
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在次世代定序以及全基因組定序漸漸普及的情況下,已經在全人類的基因組中發現了數千萬個基因變異,其中大部分的基因變異集中在非編碼,這些發生於非編碼區的基因變異可能會導致基因的調控機制產生改變,進而導致疾病產生。然而,實際上會影響人體基因功能進而造成疾病的變異僅佔非常少數,所以要如何在這麼大量的變異中去找出與疾病有相關聯的變異是個很大的挑戰。 近年來已經有許多機器學習的方法用於預測人類基因組中的致病變異,但當非致病變異數量上升時,意味著資料集的正/負(致病/非致病)樣本間的比例變大,分類器的精確率和召回率明顯下降,為了讓分類器在不平衡資料集下的預測效果能有效的提升,本研究開發出一種基於分群集成 (Clustering Ensemble,CE)採樣技術和Hyper-ensemble集成方法的機器學習框架:CE-SMURF,改善一般機器學習演算法在學習不平衡資料集時效果不佳的問題,並應用於預測非編碼區的致病變異。

並列摘要


With the help of Next Generation Sequencing (NGS) and whole-genome sequencing (WGS), many variants in the non-coding regions were found in the human genome, but the ensured pathogenic variants were only a minority. It is a challenge to find pathogenic variants from such a large number of non-coding variants. Recently, a method, HyperSMURF, was previously proposed to tackle this problem by using both sampling and over-sampling techniques to balance the data. Through reproducing the analytic results of HyperSMURF, we observed that this approach might generate samples that did not help with training in minority or reduced the samples that might benefit training in majority. In this regard, this study aims at presenting a machine learning framework, CE-SMURF. The CE-based (Clustering Ensemble-based) method is used to find the samples of the center in majority and the samples of the boundary in minority, and then use the resampling technique to balance the ratio of data. Moreover, in order to improve the learning performance, we used the ensemble method to build multiple models, and computed the final scores by averaging the probability of variants in each model. It is found that CE-SMURF can significantly improve the performance of the predicting non-coding pathogenic variants.

參考文獻


1. Edwards, S.L., et al., Beyond GWASs: illuminating the dark road from association to function. Am J Hum Genet, 2013. 93(5): p. 779-97.
2. Smedley, D., et al., A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am J Hum Genet, 2016. 99(3): p. 595-606.
3. Kircher, M., et al., A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 2014. 46(3): p. 310-5.
4. Quang, D., Y. Chen, and X. Xie, DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics, 2015. 31(5): p. 761-3.
5. Ionita-Laza, I., et al., A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet, 2016. 48(2): p. 214-20.

延伸閱讀