以次世代定序平台同時進行單體型之重組與結構性變異之偵測

在生物圈中大部份的物種都是由一對單體型(Haplotype)所組成的雙倍基因體(Diploid Genome)，然而目前適用於次世代定序平台的重組軟體都只能重建出一條序列，且此序列是同時包含兩條單體型資訊的馬賽克結構。此外，兩條單體型之間的序列差異包含單一核甘酸多型性(Single Nucleotide Polymorphism; SNP)，與大規模的結構性變異(Structural Variation; SV)。因此，要使用次世代平台重建一個雙倍基因體的兩條單體型序列，至今仍是個艱鉅的任務。在此篇論文中我們設計並且實作出一個新的架構，可以利用雙端定序短序列重組出雙倍基因體的兩條單體型序列，我們將其命名為HapSVAssembler。HapSVAssembler首先結合多種重組演算法先重建出一條參考序列稱為參考基因體。透過雙端序列與參考基因體之序列比對，進一步找出異合型單一核甘酸多型性與異合型結構性變異之座標位置。最後分析跨越兩個以上之異合型單一核甘酸多型性或異合型結構性變異的雙端序列，以分離重建出兩條完整的單體型序列。在單體型重組過程中，我們定義出一個新的最佳化問題，並設計基因演算法(Genetic Algorithm; GA)來解決。各種模擬實驗結果顯示HapSVAssembler重組的正確性和完整度都較之前的方法來的好。此外，HapSVAssembler將可協助分析不同遺傳變異間的連鎖不平衡(Linkage Disequilibrium)現象。

關鍵字

雙倍基因體；基因演算法；單體形重組

並列摘要

The genomes of most species in the biosphere is a diploid genome composed of two haplotypes. However, existing short-read assemblers for next-generation sequencing (NGS) platforms only reconstruct one consensus sequence which is a mosaic of the two haplotypes. In addition, the differences between the two haplotypes range from Single Nucleotide Polymorphisms (SNPs) to large-scale structure variations (SVs). Therefore, de novo haplotype assembly of a diploid genome is a still challenging task using NGS platforms. In this thesis, we design and implement a new framework called HapSVAssembler for de novo assembly of a diploid genome using short paired-end reads. HapSVAssembler uses a hybrid assembly approach to build a consensus sequence, identify heterozygous SNPs and SV loci, and simultaneously reconstruct the SNP/SV haplotypes via reads spanning two or more SNPs/SVs. A new optimization problem is formulated and solved by Genetic Algorithm (GA). The experimental results indicated that the assembly accuracies and continuity of HapSVAssembler is much higher than previous methods. With the ability of assembling haplotypes containing multiple types of genomic variations, HapSVAssembler is very useful for studying linkage disequilibrium across different variations.

並列關鍵字

Genetic algorithm ； Haplotype assembly ； Diploid genome

參考文獻

[1] Ahn, S.M., Kim, T.H., Lee, S., et al. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Research, 19:1622–1629, 2009.

[2] Alkan, C., Sajjadian, S. and Eichler, E.E. Limitations of next-generation genome sequence assembly. Nature Methods, 1:61–65, 2011.

[3] Bansal, V., and Bafna, V. HapCUT: an efficient and accurate algorithm for the

[5] Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., and Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 27:578–579, 2011.

[6] Chaisson, M.J., Brinza, D. and Pevzner, P.A. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Research, 19:336–346,

延伸閱讀

游宗策（2013）。使用列舉序列演算法與半定鬆弛技術之多重輸入多重輸出迭代偵測器設計〔碩士論文，國立中正大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201613552266
Roland, S. (2016). 次世代定序資料分類之總體基因組學裝箱演算法研究 [master's thesis, National Taiwan University]. Airiti Library. https://doi.org/10.6342/NTU201610197
游竣棠（2005）。分析大量表現序列標籤重組基因結構之研究〔碩士論文，亞洲大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0118-0807200916283010
Huang, H. D., Chang, H. L., Tsou, T. S., Liu, B. J., & Horng, J. T. (2003). A Data Mining Method to Predict Transcriptional Regulatory Sites Based on Differentially Expressed Genes in Human Genome. Journal of Information Science and Engineering, 19(6), 923-942. https://doi.org/10.6688/JISE.2003.19.6.2
吳秉承（2016）。Incorporating sequence motifs to improve accuracy of predicting transcription factor binding sites using ChIP-seq data〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU201603094

國際替代計量

以次世代定序平台同時進行單體型之重組與結構性變異之偵測

未授權

主題瀏覽