基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors. In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output. In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.

並列關鍵字

NGS data analysis ； suffix array ； LCP array ； BWT array ； de novo genome assembly

參考文獻

[9] Chang, Y. J., Chen, C. C., Chen, C. L., & Ho, J. M. (2012). A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(Suppl 7), S28.

[4] Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1), 53-86.

[5] Manber, U., & Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5), 935-948.

[8] Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

[1] Menon, R. K., Bhat, G. P., & Schatz, M. C. (2011, June). Rapid parallel genome indexing with MapReduce. In Proceedings of the second international workshop on MapReduce and its applications (pp. 51-58). ACM.

Google Scholar

國際替代計量

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

主題瀏覽