透過您的圖書館登入
IP:3.141.42.84
  • 學位論文

DNA 甲基化插補方法之比較研究

A comparative study on DNA methylation imputation methods

指導教授 : 林菀俞
本文將於2024/09/01開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


DNA 甲基化是表觀遺傳學中十分重要的生物標記,且已被許多研究證實 DNA 甲基化與人體生物功能如老化、癌症、過敏及糖尿病等具有高相關性。然而透過甲基化分析晶片收集得的甲基化數據卻可能因為包含許多遺失值而增加後續甲基化資料分析之困難度,因此甲基化研究必須經過插補方法得出替換值進行資料插補的動作。本研究使用了三類插補方法分別為單位插補法、K-近鄰演算插補法及鏈式方程多重插補法,考慮了方法中許多參數組合並且加入了位點與位點之間的相關程度進入本研究的方法中。 於實際資料中本研究使用臺灣人體生物資料庫全基因體甲基化晶片資料的 2091 位參與者,根據不同的缺失機制製作出相對應得模擬資料集,並且針對不同的模擬資料集比對出三類插補方法中最為適合的插補方法。最終本研究測定出的結果在遺失比例小的完全隨機缺失機制及隨機缺失機制下,使用均方根誤差作為最終評價指標時使用 K-近鄰演算插補法並且考慮甲基化位點之間相關性可以得到最好的預測插補值,而使用平均絕對誤差作為最終評價指標則使用鏈式方程多重插補法可以得到最好的預測插補值。但是遺失比例較大的資料集則本研究會建議不論最終評價指標為何,使用 K-近鄰演算插補法均可以得到最好的插補結果。 本研究之結果指出,評價指標、遺失比例及缺失機制等要素均會影響插補結果的好壞,並且加入甲基化位點之間相關性的插補方法可以顯著的減少插補誤差。

並列摘要


DNA methylation is a key biomarker in epigenetics. Previous studies showed that DNA methylation is highly related to several human biological functions in both physiological mechanisms (such as embryonic development and aging) and pathological diseases (such as cancer, asthma, and diabetes). However, methylation data collected through the methylation bead chip may contain many missing values, which may increase the difficulty of subsequent methylation data analysis. Therefore, investigators usually need to use some imputation methods to generate replacement values ​​for missing data. In this study, three types of imputation methods were used, namely, unit imputation, K Nearest Neighbors impute algorithm (KNN imputation), and Multiple Imputation by Chained Equations (MICE). Several parameter combinations in the methods were considered in this study, and we also added the relationship between cytosine-phosphate-guanine dinucleotides (CpGs) to improve the methods. In real data analysis, 2091 participants of Taiwan Biobank were studied. We applied the methods to simulation datasets based on various missingness mechanisms and then compared them against different datasets, to find out the most recommended imputation methods among the three. The results showed that if the root mean square error (RMSE) is used as the final evaluation metric, KNN imputation while considering the correlation between CpG sites can achieve the best-predicted imputation values, regardless of the missingness mechanisms. If the mean absolute error (MAE) is used as the final evaluation metric, MICE can lead to the best-predicted imputation values if the data missing mechanism is missing at random or missing completely at random with a low missing rate. However, for data sets with large missing rates, KNN can lead to the best imputation results. The results of this study also showed that factors such as evaluation metrics, missing rates, and missing mechanisms will all influence the quality of imputation results, and using imputation methods while considering the relationship between the methylation sites can significantly reduce the imputation errors.

參考文獻


Abdolmaleky, H. M., Smith, C. L., Faraone, S. V., Shafa, R., Stone, W., Glatt, S. J., Tsuang, M. T. (2004). Methylomics in psychiatry: modulation of gene–environment interactions may be through DNA methylation. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 127(1), 51-59.
Azur, M. J., Stuart, E. A., Frangakis, C., Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 20(1), 40-49.
Bibikova, M., Lin, Z., Zhou, L., Chudin, E., Garcia, E. W., Wu, B., . . . Fan, J. B. (2006). High-throughput DNA methylation profiling using universal bead arrays. Genome Res, 16(3), 383-393. doi:10.1101/gr.4410706
Biobank, T. (2015). Purpose of Taiwan Biobank. In.
Casella, G., George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167-174.

延伸閱讀