根據最多配對模式解決Scaffolding問題之研究

在DNA的定序過程中，scaffolding是一個重要的步驟，它的目的是要去決定目標基因體草圖(target draft genome)中contigs的次序與方向。準確的Scaffolding有利於我們後續取得一個更完整的基因體序列。過去我們實驗室已經發展出一種rearrangement-based的scaffolding工具CSAR可以根據一個完整 (complete) 或不完整 (incomplete)的參考基因體 (reference genome) 來對一個目標基因體草圖進行scaffolding。然而，CSAR的主要限制是目標與參考基因體之間的保守序列標記 (conserved sequence markers) 必須是不重覆的。事實上，重複的序列標記 (duplicate sequence markers) 在物種的基因體上是非常普遍的。因此，在本篇論文中，我們利用一個所謂的maximum-matching breakpoint distance (MBD) 的觀念去定義出一個MBD-based scaffolding problem，這個問題的目的是要去決定目標與參考基因體之間的骨架(scaffolds)，使得這兩個骨架之間的maximum-matching breakpoint distance為最小。除此之外，我們利用integer linear programming (ILP)設計出一個精準的演算法(exact algorithm)去解決MBD-based scaffolding problem。最後，我們在模擬與真實資料的實驗結果顯示出我們MBD-based scaffolding algorithm在有考慮duplicate markers時的準確度比它在沒有考慮duplicate markers時的準確度還來得好。另一方面，EBD-based scaffolding algorithm在模擬資料的表現勝過我們MBD-based scaffolding algorithm，但是在真實資料的表現上，我們MBD-based scaffolding algorithm卻勝過EBD-based scaffolding algorithm。除此之外，我們MBD-based scaffolding algorithm在準確度的表現略勝過CSAR，但CSAR在執行速度上卻遠勝過我們的MBD-based scaffolding algorithm。

關鍵字

演算法；基因體組裝；最多配對模式；整數線性規劃；生物資訊；次世代定序

並列摘要

Scaffolding is an important step in the process of DNA sequencing. The purpose of scaffolding is to determine orders and orientations of the contigs of a draft genome. An accurate scaffolding is helpful for obtaining a more complete genome sequence in the subsequent process. Previously, our laboratory has already developed a rearrangement-based scaffolding tool CSAR that can scaffold a target draft genome based on a complete or incomplete reference genome. However, the main limitation of CSAR is that the conserved sequence markers between target and reference genomes must be a singleton. In fact, duplicate sequence markers are very common in the genomes of species. In this thesis, therefore, we utilize a concept of the so-called maximum-matching breakpoint distance (MBD) to define an MBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the maximum-matching breakpoint distance between the resulting scaffolds is minimized. In addition, we use integer linear programming (ILP) to design an exact algorithm to solve the MBD-based scaffolding problem. Finally, our experimental results on simulated and real datasets have shown that the accuracy of our MBD-based scaffolding algorithm with considering duplicate markers is better than that of our MBD-based scaffolding algorithm without considering duplicate markers. On the other hand, the accuracy performance of EBD-based scaffolding algorithm is better than that of our MBD-based scaffolding algorithm on simulated datasets, but our MBD-based scaffolding algorithm outperforms EBD-based scaffolding algorithm on real datasets. Moreover, our MBD-based scaffolding algorithm performs slightly better than CSAR does in terms of accuracy performance, but CSAR is much better than our MBD-based scaffolding algorithm in terms of running time.

並列關鍵字

algorithm ； scaffolding problem ； maximum-matching model ； integer linear programming ； bioinformatics ； next generation sequencing

參考文獻

[1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.

Google Scholar

[2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.

Google Scholar

[3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.

Google Scholar

[4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.

Google Scholar

[5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.

Google Scholar

國際替代計量

根據最多配對模式解決Scaffolding問題之研究

查找全文

主題瀏覽