透過您的圖書館登入
IP:3.21.248.119
  • 學位論文

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

Constructing Suffix Array and Longest-Common-Prefix Array for Next-Generation-Sequencing Data Using MapReduce Framework

指導教授 : 李德財
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

並列摘要


Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors. In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output. In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.

參考文獻


[9] Chang, Y. J., Chen, C. C., Chen, C. L., & Ho, J. M. (2012). A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(Suppl 7), S28.
[4] Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1), 53-86.
[5] Manber, U., & Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5), 935-948.
[8] Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
[1] Menon, R. K., Bhat, G. P., & Schatz, M. C. (2011, June). Rapid parallel genome indexing with MapReduce. In Proceedings of the second international workshop on MapReduce and its applications (pp. 51-58). ACM.

延伸閱讀