透過您的圖書館登入
IP:3.135.217.228
  • 學位論文

以不重疊的近似重覆區段來壓縮去氧核醣核酸序列

DNAC: A Compression Algorithm for DNA Sequences by Non-overlapping Approximate Repeats

指導教授 : 李瑞庭

摘要


去氧核醣核酸序列內有許多未知的重覆性結構需要進一步去分析。在本篇論文中,我們利用這重覆的特性來達到壓縮序列的目的。因為去氧核醣核酸序列的字母種類只有四種,以致傳統的文字壓縮演算法並不適用。並且去氧核醣核酸序列的長度可達上千萬個鹼基對,使得壓縮演算法的時間和空間複雜度必須低到可接受的程度。 我們所提出的無失真壓縮演算法,是以搜尋最佳的不重疊近似重覆區段來壓縮序列。共可分為四個步驟:第一步以後置樹定位完全重覆區段。第二步用動態規劃來延伸為近似重覆區段以節省更多空間。第三步是提取出不重疊的重覆區段。最後以費氏序列來編碼所找到的重覆區段。根據實驗結果,我們提出的方法擁有較佳的壓縮率,同時所花費的計算時間也更短。

並列摘要


The repetitive structure of genomic DNA includes many secrets to be explored. In this thesis, we apply this repetitive structure to compress the DNA sequences. The existing compression tools are not suitable because its alphabet size is not large enough. In addition, the time and space complexity of DNA compression algorithm must be as low as possible because there can be millions of bases in a DNA sequence. We present a lossless compression algorithm, DNA Compression (DNAC), for genetic sequences by searching for non-overlapping approximate repeats. DNAC has 4 computation phases. In the first phase, it builds a suffix tree to locate exact repeats. In the second phase, all the exact repeats are extended into approximate repeats by dynamic programming in order to save more bits. In the third phase, we extract the optimal non-overlapping repeats from overlapping ones. In the last phase, we use the Fibonacci series to encode the repeats in a self-delimited way. According to our experimental results, DNAC not only achieves better compression ratio but also runs in almost linear time.

參考文獻


[1] Apostolico,A and Fraenkel,A.S. (1987). Robust transmission of unbounded strings using Fibonacci representations. IEEE Trans. Inform. Theory, vol. 33, no. 2, pp. 238-245.
[2] Chen,X., Kwong,S., and Li,M. (2001). A compression algorithm for DNA sequences. IEEE engineering in medicine and Biology, pp. 61-66.
[3] Chen,X., Li,M., Ma,B., and Tromp,J. (2002). DNACompress: fast and effective DNA sequence compression. Bioinformatics, vol. 18, no. 12, pp. 1696-1698.
[5] Donald,A., Adjeroh, Zhang.Y, Mukherjee,A., Powell,M., Bell,T. (2002). DNA Sequence Compression Using the Burrows-Wheeler Transform. IEEE Computer Society Bioinformatics Conference, pp. 303-313.
[6] Grumbach,S. and Tahi,F. (1994). A new challenge for compression algorithms: Genetic sequences. J. Inform. Process. Manage, vol. 30, no.6, pp. 875-866.

延伸閱讀