在生物圈中大部份的物種都是由一對單體型(Haplotype)所組成的雙倍基因體(Diploid Genome),然而目前適用於次世代定序平台的重組軟體都只能重建出一條序列,且此序列是同時包含兩條單體型資訊的馬賽克結構。此外,兩條單體型之間的序列差異包含單一核甘酸多型性(Single Nucleotide Polymorphism; SNP),與大規模的結構性變異(Structural Variation; SV)。因此,要使用次世代平台重建一個雙倍基因體的兩條單體型序列,至今仍是個艱鉅的任務。在此篇論文中我們設計並且實作出一個新的架構,可以利用雙端定序短序列重組出雙倍基因體的兩條單體型序列,我們將其命名為HapSVAssembler。HapSVAssembler首先結合多種重組演算法先重建出一條參考序列稱為參考基因體。透過雙端序列與參考基因體之序列比對,進一步找出異合型單一核甘酸多型性與異合型結構性變異之座標位置。最後分析跨越兩個以上之異合型單一核甘酸多型性或異合型結構性變異的雙端序列,以分離重建出兩條完整的單體型序列。在單體型重組過程中,我們定義出一個新的最佳化問題,並設計基因演算法(Genetic Algorithm; GA)來解決。各種模擬實驗結果顯示HapSVAssembler重組的正確性和完整度都較之前的方法來的好。此外,HapSVAssembler將可協助分析不同遺傳變異間的連鎖不平衡(Linkage Disequilibrium)現象。
The genomes of most species in the biosphere is a diploid genome composed of two haplotypes. However, existing short-read assemblers for next-generation sequencing (NGS) platforms only reconstruct one consensus sequence which is a mosaic of the two haplotypes. In addition, the differences between the two haplotypes range from Single Nucleotide Polymorphisms (SNPs) to large-scale structure variations (SVs). Therefore, de novo haplotype assembly of a diploid genome is a still challenging task using NGS platforms. In this thesis, we design and implement a new framework called HapSVAssembler for de novo assembly of a diploid genome using short paired-end reads. HapSVAssembler uses a hybrid assembly approach to build a consensus sequence, identify heterozygous SNPs and SV loci, and simultaneously reconstruct the SNP/SV haplotypes via reads spanning two or more SNPs/SVs. A new optimization problem is formulated and solved by Genetic Algorithm (GA). The experimental results indicated that the assembly accuracies and continuity of HapSVAssembler is much higher than previous methods. With the ability of assembling haplotypes containing multiple types of genomic variations, HapSVAssembler is very useful for studying linkage disequilibrium across different variations.