透過您的圖書館登入
IP:18.118.140.108
  • 期刊

次世代定序資料模擬軟體的比較

Comparison of Next Generation Sequencing Simulators

摘要


次世代定序技術之高通量定序資料產出,近年來常用於探究基因和轉錄體等基礎研究,也加速了作物之遺傳馴化與分子標誌之輔助育種。正因龐大序列資料之產出,透過適當模擬次世代定序結果,可在實際進行序列分析前,估算所需之覆蓋倍率、序列組裝之必要性與否、與後續開發分子標誌之流程等,擬定更正確且有效率之開發策略,減少後續驗證成本。目前有許多模擬次世代定序資料的模擬軟體,本研究選擇了大腸桿菌基因體DNA序列以及水稻第五條染色體DNA序列作為模擬的資料來源,透過序列組裝及序列比對的過程,進行ART, FlowSim, MetaSim, SimSeq及wgsim等五種模擬軟體,以模擬Roche/454和Illumina定序資料進行評估。MetaSim與wgsim分別是模擬Roche/454與Illumina定序資料所需運算時間最短,為最有效率的軟體。所有的模擬軟體應用於大腸桿菌等較小的基因體時,在序列組裝及序列比對上的表現皆與真實資料相似,但以ART模擬Roche/454較長的序列之結果較接近於真實資料,以SimSeq模擬Illumina序列資料之最大N50長度及覆蓋率與真實資料最為相近。而在水稻等較大的基因體時,大部分的模擬軟體序列組裝後的結果比實際結果樂觀,其中ART模擬之N50、疊連群長度、總裝後的總長度(k-mer = 37-43)較接近真實資料組裝的結果。本研究依據模擬時間、序列組裝及序列比對結果的評估方式,得以客觀比較Roche/454及Illumina定序平台之模擬軟體的優劣,評估結果可提供於作物基因體再定序、孤兒作物之基因體定序、轉錄體研究等參考,在有限的資源下提升研發能量。

並列摘要


The high-throughput next generation sequencing technologies (NGST) have been widely adopted in genomic and transcriptomic researches. NGST has also accelerated the processes of crop domestication and maker-assisted selection in plant breeding. When the budget is limited, it is often inquired to estimate the minimal coverage to yield the sufficient amount of sequencing data and to develop efficient strategies to discover molecular markers based on simulations. Five simulators, including ART, FlowSim, MetaSim, SimSeq and wgsim, had been proposed to mimic the data generated by Roche/454 and Illumina NGST platforms. Using E. coli whole genomes and rice (Oryza sativa) chromosome 5 as the references, we simulated the sequencing results by the five simulators, respectively. The simulators were compared based on the running time, the results of genome assembly, and the results of the sequence alignment. MetaSim and wgsim consumed the shortest running time when simulating Roche/454 and Illumina data, respectively. All simulators yielded similar results of genome assembly and sequence alignment with the real E. coli sequencing data. Among them, ART and SimSeq performed the best in simulating Roche/454 and Illumina, respectively. When simulating rice sequencing data, most simulators yielded more mappable reads and higher coverage rates than reality. ART was the most comparable with the real data. In conclusion, this study proposed the ways to evaluate simulating results for Roche/454 and Illumina sequencing data, which can be consulted for the researches of genome resequencing, de novo sequencing, and transcriptomic studies under limited budget.

延伸閱讀