透過您的圖書館登入
IP:3.147.65.65
  • 學位論文

利用連鎖失衡加權K最近鄰法於基因型資料填補之研究

Genotype imputation using LD-based Weighted K Nearest Neighbor

指導教授 : 蔡政安

摘要


近年來高通量定序蓬勃的發展,各類型的基因資料成長飛快,尤其在單核苷酸多型性(single-nucleotide polymorphism ,SNP)的探測上相對於過去的方法簡單許多,像是Affymetrix GeneChip、Illumina BeadChip 或 NGS-SNP檢測(SNPs calling)等,而單核苷酸多型性這種基因體上的單點變異,因其資料型態相較過去的生物標誌,如:SSR標誌(Simple sequence repeats marker)等,有著便於資料數位化及其資料數量龐大這兩項主要優勢,在近幾年來逐漸成為了各種基因體關聯性分析(Association study)、全基因體關聯性分析(Genome-Wide Association Study ,GWAS) …等研究的要角,但SNP 這種資料型態也不是盡善盡美,因SNP在探測時常常會出現資料缺失,像以分析人類SNPs為例子,廣用性高密度SNPs晶片(general-purpose high-density SNPs microarray),也被分析出其遺失率(missing rate)及錯誤率(error rate)大致在0.05%~1%間,而這結果還是由已分析相當透徹的人類基因體得出,更別說其他非模式物種或者投資金費較少的物種,拿本篇資料分析的資料為例子,在SNP檢測(SNPs calling)後的資料就有14.5%的遺失率。 而這種生物標誌資料不全的現象會影響關聯性分析結果,尤其是在全基因體關聯性分析時SNPs的資料越完整其能找到的關聯區域就會越明顯,相對的有缺值分析結果的密度會過於疏鬆更嚴重的是有可能遺失掉重要的關聯區域,而一般在做關聯性分析時都會先過濾掉遺失過多的生物標記(biomaker)或樣本,但是這樣就會大大的影響到檢定的結果,所以才會出現插補(imputation)這樣的想法,盡可能使資料趨於完整。 本篇文章主要提出一個插補的方法LDKNN (Linkage disequilibrium-based K-nearest neighbor),它是一個建立在KNN (K-nearest neighbor)分群演算法上並加入連鎖失衡(linkage disequilibrium)資訊的新方法,在文章中會將它運用在Genotyping by sequencing (GBS)做SNP探測的水稻,自然種原與重組自交系實際資料來比較插補前後是否有差異,另外會讓LDKNN與KNN、SVM、Beagle4等方法做模擬試驗比較,來比較我們提出方法與其他方法之間的優劣。

並列摘要


Detection of single nucleotide polymorphism (SNP) in high-throughput sequencing technologies has become efficient and robust strategies for SNP discovery and genome-Wide association study. However, the conventional high-throughput genotyping techniques often produce a certain proportion of missing calls. It has been long recognized that failing to account for these missing data could dramatically reduce the power of detecting SNPs. A variety of imputation methods have been developed to impute the missing genotypes. Methods based on the K-nearest neighbors (KNN) and weighting K-nearest neighbors (wtKNN) have received some attention by considering the similarities in the haplotype structures. More recently, a number of powerful methods based on hidden Markov model (HMM) have become popular in SNPs imputation. However, these methods are time consuming or mostly suitable for small maker sets imputation and cannot exploit the structure of indirect association of tightly linked SNPs. In this study, We Will propose a novel but computationally simple imputation method that is based on weighting K-nearest neighbors (wtKNN) by considering linkage disequilibrium (LD). We will demonstrate the performance of our method to impute missing SNPs using both Genotyping by sequencing (GBS) data and simulation studies. In addition, we will compare the accuracy and performance of our method with competing imputation methods.

參考文獻


40. 李銓(2014)。水稻幼苗耐鹽相關數量性狀之基因座定位與分析。國立台灣大學農藝所作物生理乙組碩士論文。
1. Reich, D. E. et al. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32, 135–142 (2002).
2. Shen, Y.-J. et al. Development of Genome-Wide DNA Polymorphism Database for Map-Based Cloning of Rice Genes. Plant Physiol. 135, 1198–1205 (2004).
3. Yu, J. et al. A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).
4. Tenaillon, M. I. et al. Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proc. Natl. Acad. Sci. U. S. A. 98, 9161–9166 (2001).

延伸閱讀