基於 MapReduce 巨量資料框架之次世代定序錯誤校正演算法

次世代定序(next-generation sequencing; NGS)技術的快速發展造就超大規模資料的爆炸性增長與 de novo 基因組裝(de novo genome assembly)的各式計算問題。較深的定序深度與越來越長的基因序列(read)隱含更多的序列錯誤，並增加錯誤組裝的可能性。巨大的資料量除了造成極高的硬碟讀寫負載，更會拖慢運算速度並使執行時間無法被準確預估。為了在不影響組裝品質下加快耗時的組裝過程並解決錯誤校正(error correction)的相關問題，我們著眼於使用雲端運算(cloud computing)改進可用於次世代定序資料 de novo 基因組裝的演算法設計、架構設計以及實做。定序資料內含的錯誤使得組裝品質降低，並產生破碎的片段重疊群(contigs)。為此我們提出一套基於雲端運算的錯誤校正演算法，並參考 ALLPATHS-LG 的錯誤校正設計，使其能保守地進行錯誤校正以避免誤判。為了達成以減少磁碟讀寫的大量負載來加快執行時間的目的，我們提出名為「序列-訊息對應圖」(read-message diagram)的訊息控制策略，用以呈現計算過程中需由基因序列生成的中介資料結構。我們同時開發多種調控模式以縮減中介資料量，進而減少磁碟讀寫的操作數量。我們已將提出的錯誤校正演算法實作於 MapReduce 雲端計算框架上，並以最先進的工具進行效能評估。我們提出的訊息控制策略也成功減少中介資料量並加快執行速度。至此，我們不僅顯著地減少組裝流程所需的時間，更提高了組裝的品質。本論文提出並實作用以加快 de novo 基因組裝以及提高組裝品質的演算法與其架構設計。這些研究成果對於轉錄體學(transcriptomics)、總體基因體學(metagenomics)、藥物基因體學(pharmacogenomics)以及精準醫學(precision medicine)等可利用次世代定序巨量資料(NGS big data)進一步發展相關應用的生物資訊領域極具參考價值。

關鍵字

de novo 基因組裝； MapReduce ；巨量資料；次世代定序；錯誤校正

並列摘要

The rapid advancement of next-generation sequencing (NGS) technology has generated an explosive growth of ultra-large-scale data and computational problems, particularly in de novo genome assembly. Greater sequencing depths and increasingly longer reads have introduced numerous errors, which increase the probability of misassembly. The huge amounts of data cause severely high disk I/O overhead and lead to an unexpectedly long execution time. To speed up the time-consuming assembly processes without affecting its quality and to address problems pertaining to error correction, we focus on improving algorithm design, architecture design, and implementation of NGS de novo genome assembly based on cloud computing. Errors in sequencing data result in fragmented contigs, which lead to an assembly of poor quality. We therefore propose an error correction algorithm based on cloud computing. The algorithm emulates the design of error correction algorithm of ALLPATHS-LG, and is designed to correct errors conservatively to avoid false decisions. To speed up execution time by reducing the massive disk I/O overhead, we introduce a message control strategy, the read-message (RM) diagram, to represent structure of the intermediate data generated along with each read. Then, we develop various schemes to trim off portions of the RM diagram to shrink the size of the intermediate data and thereby reduce the number of disk I/O operations. We have implemented the proposed algorithms on the MapReduce cloud computing framework and evaluated them using state-of-the-art tools. The RM method reduces the intermediate data size and speeds up execution. Our proposed algorithms have improved not only the execution time of the pipeline dramatically, but also the quality of assembly. This dissertation presents algorithms and architectural designs that speed up execution time and improve the quality of de novo genome assembly. These studies are valuable for further development of NGS big data applications for bioinformatics, including transcriptomics, metagenomics, pharmacogenomics, and precision medicine.

並列關鍵字

big data ； de novo genome assembly ； error correction ； next-generation sequencing ； MapReduce

參考文獻

[1] C. Alkan, S. Sajjadian, and E. E. Eichler. Limitations of next-generation genome sequence assembly. Nature methods, 8(1):61–65, 2011.

[2] E. Anderson and J. Tucek. Efficiency matters! ACM SIGOPS Operating Systems Review, 44(1):40–45, 2010.

[3] W. J. Ansorge. Next-generation DNA sequencing techniques. New biotechnology, 25(4):195–203, 2009.

[5] G. Berg, C. Zachow, H. Müller, J. Philipps, and R. Tilcher. Next-generation bio-products sowing the seeds of success for sustainable agriculture. Agronomy, 3(4): 648–656, 2013.

[7] K. Chaturvedi and L. Sahijram. Plant molecular biology applications in horticulture: An overview. In Plant Biology and Biotechnology, pages 113–129. Springer, 2015.

延伸閱讀

楊浩（2014）。基於MapReduce的基因演算法於旅遊行程規劃之研究〔碩士論文，國立臺北科技大學〕。華藝線上圖書館。https://doi.org/10.6841/NTUT.2014.00067
張志宏（2012）。MapReduce架構中支援大規模資料之結合運算〔碩士論文，國立臺中科技大學〕。華藝線上圖書館。https://doi.org/10.6826/NUTC.2012.00090
鍾弘哲（2013）。應用於雲端運算系統預測MapReduce 排程機制之研究〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2013.00359
Jia, Y. B., Fan, H. D., Zhang, Q., Li, X., & Xia, G. H. (2012). An Improved Classification Algorithm for Structured Data Based on Secondary Data Processing. Research Journal of Applied Sciences, Engineering and Technology, 4(11), 1500-1503. https://www.airitilibrary.com/Article/Detail?DocID=20407467-201206-201510230012-201510230012-1500-1503
郝家銘（1995）。Genetic Algorithms for Adaptive Contrast Enhancement〔碩士論文，元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-0112200611345722

國際替代計量

基於 MapReduce 巨量資料框架之次世代定序錯誤校正演算法

全文下載

主題瀏覽