利用基因體重組決定序列片段的次序與方向

DNA定序技術的持續進步使得大量的基因體草圖能以較低的成本被快速地定序出來。然而這些基因體草圖通常僅是一些被部分定序成未組裝的序列片段 (又稱 contigs) 的集合，這些 contigs彼此之間的位置和方向在原本的基因體上仍是未知的。因為這樣的不完整性，這些 contigs 便無法直接地被目前現存的演算法用以研究基因體上的結構變異、演化關係的重建以及其它生物學上的應用。為了得到基因體草圖上更完整的序列，其 contigs 需要被準確地定序和定向成更長且包含空隙的序列，稱為 scaffolds，如此在其 contigs 之間的空隙便可以被後續的空隙閉合程序給正確地填補。Scaffolding 方法的其中一種為所謂參考式的 scaffolding，其是根據參考基因體來定序和定向基因體草圖中的 contigs。本論文中，我們首先把參考單一完整基因體的 scaffolding 定義為一個在考慮反轉以及區塊互換距離之下的單邊區塊 (或 contig) 排序問題。我們利用代數學中的排列群設計了一個有效率的演算法能在 O(δn) 的時間內來解決單邊區塊排序問題，其中 n 是基因 (或基因片段) 個數，而 δ 是所使用到的反轉和區塊互換的次數。此外，我們已經將這演算法開發出一個稱為 CAR 的 scaffolding 工具，它可以根據一個完整的參考基因體來有效率且更加準確地 scaffold 目標基因體草圖中的 contigs。在模擬和實際的資料上，我們的實驗結果顯示 CAR 在敏感度、精準度以及基因體覆蓋率的評估標準上確實比許多其它類似為參考單一完整基因體的 scaffolding 工具有更好的表現。另一方面，如果目標基因體與參考基因體之間有著重組事件或者是它們之間的演化關係較為遙遠，則參考單一完整基因體的 scaffolding 工具可能會產生錯誤的 scaffolds。這可能意謂著單一個參考基因體並不足以產生基因體草圖正確的 scaffolds。因此，我們設計出一個啟發式演算法來進一步改良參考單一完整基因體的 scaffolding 工具 CAR 成為另一個新的工具 Multi-CAR，它可以利用多個完整的參考基因體對於目標基因體草圖中的 contigs 進行更準確的 scaffolding。我們的實驗結果在使用多個完整參考基因體的實際資料上顯示，Multi-CAR 在敏感度、精準度、基因體覆蓋率、scaffold 個數以及 scaffold N50 的數值上皆勝過其它兩個參考多個基因體的 scaffolding 工具 Ragout 和 MeDuSa。在實際的使用上，完整的參考基因體並非總能被取得用來 scaffold 一個基因體草圖。因此，我們繼續將 CAR 以及 Multi-CAR 改進成新的單參考式及多參考式的 scaffolding 工具 CSAR 以及 Multi-CSAR，其各別可以使用單一個非完整的以及多個非完整的基因體來作為參考，用以 scaffold 目標基因體草圖。最後實際資料的實驗結果顯示 CSAR 以及 Multi-CSAR 在多個準確度的評估標準上，各別都表現得比其它類似的 scaffolding 工具來得好。此外，為了方便使用以及視覺化地驗證 scaffolding 的結果，我們開發出了 CSAR 的網頁服務版本 CSAR-web，它提供了使用者一個操作簡易的介面來執行 CSAR 並將 scaffolding 的結果以圖形化的方式來呈現。

關鍵字

演算法；生物資訊；基因體；序列片段；支架；定序與定向

並列摘要

Continuing advances in DNA sequencing allow an increasing number of draft genomes to be produced rapidly in a decreasing cost. However, these draft genomes usually are just partially sequenced as collections of unassembled contigs whose relative positions and orientations along the genome being sequenced are still unknown. Due to the incompleteness, these contigs cannot be used directly by currently existing algorithms for studying their genome structural variation, phylogeny reconstruction, and other biological applications. To obtain a more complete sequence of a draft genome, its contigs are needed to be accurately ordered and oriented into larger gap-containing sequences, called scaffolds, so that the gaps between scaffolded contigs can be correctly filled in the subsequent gap-closing process. One of scaffolding approaches is the so-called reference-based scaffolding, which is to scaffold the contigs of a draft genome based on reference genomes. In this thesis, we first formulate the single complete reference-based scaffolding as one-sided block (or contig) ordering problem under weighted reversal and block-interchange distance. By using permutation groups in algebra, we design an efficient algorithm to solve this one-sided block ordering problem in O(δn) time, where n is the number of genetic markers (or genes) and δ is the number of used reversals and block-interchanges. In addition, we have developed this algorithm into a scaffolding tool called CAR that can efficiently and more accurately scaffold the contigs of a target draft genome based on a complete reference genome. Our experimental results on simulated and real datasets have also shown that CAR indeed performs better than many other similar single reference-based scaffolding tools in terms of sensitivity, precision and genome coverage. On the other hand, single reference-based scaffolding tools may produce erroneous scaffolds if there are rearrangements between the target and reference genomes or their phylogenetic relationship is distant. This may suggest that a single reference genome may not be sufficient to produce correct scaffolds of a draft genome. Therefore, we design a heuristic method to further revise our single reference-based scaffolding tool CAR into a new one called Multi-CAR that can utilize multiple complete genomes as references to more accurately scaffold the contigs of a draft genome. Our experimental results on real datasets with multiple complete reference genomes have shown that Multi-CAR outperforms other two multiple reference-based scaffolding tools Ragout and MeDuSa in terms of sensitivity, precision, genome coverage, scaffold number and scaffold N50 size. In practical usage, complete reference genomes are not always available for a draft genome to be scaffolded. Therefore, we continue to improve CAR and Mutli-CAR into new single and multiple reference-based scaffolding tools CSAR and Multi-CSAR that can respectively take single and multiple incomplete genomes as references to scaffold a target draft genome. Our experimental results on real datasets have finally shown that CSAR and Multi-CSAR respectively outperforms other similar single and multiple reference-based scaffolding tools in terms of many accuracy metrics. In addition, for the convenient usage and the visual validation of scaffolding results, we have developed a web server version of CSAR called CSAR-web that provides the users with an easy-to-operate interface to run CSAR and outputs its scaffolding result in a graphical mode.

並列關鍵字

Algorithm ； Bioinformatics ； Genome ； Contig ； Scaffold ； Ordering and orientation

參考文獻

[1] E. R. Mardis, The impact of next-generation sequencing technology on genetics, Trends in Genetics, 24 (2008) 133–141.

Google Scholar

[2] J. Shendure and H. Ji, Next-generation DNA sequencing, Nature Biotechnology, 26 (2008) 1135–1145.

Google Scholar

[3] M. L. Metzker, Sequencing technologies - the next generation, Nature Reviews Genetics, 11 (2010) 31–46.

Google Scholar

[4] M. Pop, Genome assembly reborn: recent computational challenges, Briefings in Bioinformatics, 10 (2009) 354–366.

Google Scholar

[5] M. Hunt, C. Newbold, M. Berriman and T. D. Otto, A comprehensive evaluation of assembly scaffolding tools, Genome Biology, 15 (2014) R42.

Google Scholar

國際替代計量

利用基因體重組決定序列片段的次序與方向

全文下載

主題瀏覽