透過您的圖書館登入
IP:3.145.72.232
  • 學位論文

根據最多配對模式解決Scaffolding問題之研究

The Study of Solving Scaffolding Problem Based on Maximum-matching Model

指導教授 : 盧錦隆
本文將於2024/08/28開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


在DNA的定序過程中,scaffolding是一個重要的步驟,它的目的是要去決定目標基因體草圖(target draft genome)中contigs的次序與方向。準確的Scaffolding有利於我們後續取得一個更完整的基因體序列。過去我們實驗室已經發展出一種rearrangement-based的scaffolding工具CSAR可以根據一個完整 (complete) 或不完整 (incomplete)的參考基因體 (reference genome) 來對一個目標基因體草圖進行scaffolding。然而,CSAR的主要限制是目標與參考基因體之間的保守序列標記 (conserved sequence markers) 必須是不重覆的。事實上,重複的序列標記 (duplicate sequence markers) 在物種的基因體上是非常普遍的。因此,在本篇論文中,我們利用一個所謂的maximum-matching breakpoint distance (MBD) 的觀念去定義出一個MBD-based scaffolding problem,這個問題的目的是要去決定目標與參考基因體之間的骨架(scaffolds),使得這兩個骨架之間的maximum-matching breakpoint distance為最小。除此之外,我們利用integer linear programming (ILP)設計出一個精準的演算法(exact algorithm)去解決MBD-based scaffolding problem。最後,我們在模擬與真實資料的實驗結果顯示出我們MBD-based scaffolding algorithm在有考慮duplicate markers時的準確度比它在沒有考慮duplicate markers時的準確度還來得好。另一方面,EBD-based scaffolding algorithm在模擬資料的表現勝過我們MBD-based scaffolding algorithm,但是在真實資料的表現上,我們MBD-based scaffolding algorithm卻勝過EBD-based scaffolding algorithm。除此之外,我們MBD-based scaffolding algorithm在準確度的表現略勝過CSAR,但CSAR在執行速度上卻遠勝過我們的MBD-based scaffolding algorithm。

並列摘要


Scaffolding is an important step in the process of DNA sequencing. The purpose of scaffolding is to determine orders and orientations of the contigs of a draft genome. An accurate scaffolding is helpful for obtaining a more complete genome sequence in the subsequent process. Previously, our laboratory has already developed a rearrangement-based scaffolding tool CSAR that can scaffold a target draft genome based on a complete or incomplete reference genome. However, the main limitation of CSAR is that the conserved sequence markers between target and reference genomes must be a singleton. In fact, duplicate sequence markers are very common in the genomes of species. In this thesis, therefore, we utilize a concept of the so-called maximum-matching breakpoint distance (MBD) to define an MBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the maximum-matching breakpoint distance between the resulting scaffolds is minimized. In addition, we use integer linear programming (ILP) to design an exact algorithm to solve the MBD-based scaffolding problem. Finally, our experimental results on simulated and real datasets have shown that the accuracy of our MBD-based scaffolding algorithm with considering duplicate markers is better than that of our MBD-based scaffolding algorithm without considering duplicate markers. On the other hand, the accuracy performance of EBD-based scaffolding algorithm is better than that of our MBD-based scaffolding algorithm on simulated datasets, but our MBD-based scaffolding algorithm outperforms EBD-based scaffolding algorithm on real datasets. Moreover, our MBD-based scaffolding algorithm performs slightly better than CSAR does in terms of accuracy performance, but CSAR is much better than our MBD-based scaffolding algorithm in terms of running time.

參考文獻


[1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.
[2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
[3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.
[4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
[5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.

延伸閱讀