次世代定序 (Next Generation Sequencing) 技術已經被廣泛使用於定序並重組出尚未被研究物種的基因體。事實上,由於基因體序列的高複雜度及次世代定序技術所產生的序列片段較短,大部份被重組出的基因體仍相當破碎。在本篇論文中,我們設計並撰寫一個利用雙端定序技術,能將重組出的大片段序列進一步延長,將其命名為CEPS。CEPS能快速地偵測出落在大片段序列邊緣之短序列片段,判斷是否有發生重複(Repeat)序列並加以延長。利用雙端序列的特性,CEPS能克服目前基因體重組,在定序高低覆蓋率區域間會破碎的現象。我們使用多組模擬資料,實作、測試、並比較CEPS與目前的序列重組軟體。實驗結果顯示CPES可組出更完整的基因體,除可獲得到較高的N50,其正確性更可相當逼近100%。值得一提的是,CEPS可以整合多種不同長度之雙端定序資料,來進一步改善基因體重組。
Next Generation Sequencing (NGS) technologies have been widely used to assemble the genome of unstudied species in the biosphere. In practice, the assembled genomes are very fragmented due to the complexity of the genome and relatively short length of reads. In this thesis, we design and implement a Contig Extension using Paired-end Sequencing (called CEPS) software for improving de novo assembly. By using paired-end sequencing and, CEPS extract paired-end reads over hanging on the boundary of contigs and extend these contigs across extreme low- and high-coverage regions, which often lead to fragmented genomes by most assemblers. CEPS has been multi-threaded, tested and compared with existing assemblers using a variety of simulated data sets. The experimental results indicated that CEPS significantly produced a more contiguous genome with larger N50 and genome size, and the assembly accuracy as high as ~100%. It is worth mentioning that the CEPS can integrate multiple paired-end or mate-pair libraries for further improving genome assembly.