利用成對序列(Mate Pair)來改善基因體組裝(Genome Assembly)之完整度已行之有年。雖然成對序列定序有著低價格優勢,其定序時品質並不穩定,且經常有汙染之情形。隨著第三代定序技術問世,其特有之長序列能有效地改善基因體組裝完整度。然而其缺點則是定序錯誤較多且價格高昂。因此本篇論文欲將成本相對較低的成對序列,利用計算方法轉變為長序列,以期能利用長序列的優點來改善基因體組裝。我們利用幾組真實測試資料來驗證我們方法的正確性與準確性。此外,我們也利用原始成對序列、由成對序列轉成的長序列,以及混合兩者分別對基因體組裝結果進行比較。
Mate-pair scaffolding has been used from the early days of genome sequencing to improve the final assembly. Although the mate-pair sequencing is now affordable, its power and accuracy has be limited by the lower quality and contamination. The 3rd generation sequencing, which generates long reads,is helpful for accurate scaffolding. However, the error rates and cost of this technology are still too high. This thesis aims to convert low-cost mate-pair reads into long reads using computational approaches, which has the benefits of both mate-pair reads and long reads for scaffolding. We test our methods by using several real datasets and validate the accuracy of converted long reads. In addition, the scaffolding results are compared using mate-pair reads, long reads, and mixture of both material.