計算DNA序列間編輯距離及排比的快速演算法

近年來，生物科技成為世界各國發展的重要產業，其中DNA序列的分析在生物學上的應用也日趨重要。但是，DNA序列通常都非常的長，而且不同物種的序列也不斷地被攫取出來，如此龐大的資料量，使得利用資訊科學來幫助生物學家分析DNA序列的生物資訊學(Bioinformatics) 在近年來一直蓬勃發展。當生物學家攫取出一段新的DNA序列時，該如何分析它的功能呢？通常，生物學家會將這個陌生的DNA序列作為搜索序列，將它和一個包含許多已經研究透徹的DNA序列之資料庫作一個搜索的動作，找出序列組成和這個陌生序列較相似的一些序列。這樣做的原因，是因為DNA序列的組成方式相似度越高，通常代表功能越相近，或是來自親緣關係較相近的物種，所以我們可以藉由找到與陌生DNA序列相似度高的序列，進而推測陌生DNA序列的功能及親緣關係。而比較兩個DNA序列之間的相似度，可以透過找出兩序列間的編輯距離(edit distance)以及排比(alignment)來當作指標。編輯距離及排比可以用動態規劃的方法精準地求得，但是使用動態規劃的方法非常耗時，而且記憶體需求極大。因此，許多快速演算法應運而生。許多快速演算法著重於速度，但只能提出近似的結果。這篇論文中會先介紹一些有名的快速演算法，接著介紹我們提出的演算法。這篇論文可以分為兩個部份。第一部份著重於大範圍的資料庫搜索，其目的為找出資料庫中和所要搜索的序列相似度較高的序列，但結果並不十分精確。我們的方法是先將DNA序列轉換到數值域，接著利用離散相關函數的特性來找出兩個DNA序列之間較佳的對應位置，進而找出兩序列間相似度較高的區域，再對這段區域做更詳細的比對。第二部份為本篇論文重點。傳統上，使用動態規劃法來計算兩序列間的編輯距離以及排比的演算法能保證所找出來的結果是完全精準的，但使用動態規劃法除了非常的慢，記憶體的需求也非常的大。我們的快速演算法是以動態規劃法為雛型，減少許多不必要的運算，並且只使用非常少量的記憶體，便能求出精準的結果。值得注意的是，第二部份的演算法可用於對許多快速演算法所找出來的粗略結果作更進一步的分析。

關鍵字

資料庫搜尋；動態規劃；生物資訊；序列排比；編輯距離

並列摘要

A newly discovered DNA sequence is usually input as a query sequence to search the database, which contains many known sequences, for some sequences similar to the query sequence. The reason is that high similarity between two DNA sequences is thought that they are homology and they have similar functions. The edit distance and alignment between two DNA sequences are two elementary methods showing the similarity between two sequences. The dynamic programming method is mathematically optimal for these problems; however, it costs too much time and too much memory such that using a common PC for analyzing DNA sequences is impossible. Therefore, some fast heuristics are developed. This thesis can be divided into two parts. In the first part, we concentrate on the fast database search heuristics. Although the fast heuristics runs rapidly, the outcomes are usually not accurate. We propose the UDCR algorithm, which first maps the DNA sequences to the numeric domain and then employ the characteristics of the discrete correlation function to show the “better-aligned locations” between two sequences. According to the better-aligned locations, we then use the dynamic programming method for computing the edit distance (or similarity score) and the accompanying alignment between two sequences. In the second part, we propose a fast algorithm which aligns two sequences and computes the accurate edit distance. The proposed algorithm is based on the dynamic programming method accompanying with some rules to ignore unnecessary computations. Besides, the memory requirement is substantially reduced. It is noteworthy that the proposed algorithm can be used to further analyze the outcomes of the fast database search heuristics.

並列關鍵字

Database search ； Dynamic programming ； Bioinformatics ； Sequence alignment ； Edit distance

參考文獻

[1] J. D Watson, F. H. C. Crick, "Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature, 171:737-738, 1953.

[2] Saenger, Wolfram. Principles of Nucleic Acid Structure. New York: Springer-Verlag, 1984.

[4] D. J. Lipman, W. R. Pearson, “Rapid and sensitive protein similarity searches,” Science 227, 1435-1441, 1985.

[5] D. J. Lipman, W. R. Pearson, “Improved tools for biological sequence comparison,” Proc. Natl. Acad. Sci. USA. 85, 2444-2448, 1988.

[6] J. P. Dumas, J. Ninio, “Efficient Algorithms for Folding and Comparison,” Nucl. Acids Res. 10, 197-206, 1982.

國際替代計量

計算DNA序列間編輯距離及排比的快速演算法

全文下載

主題瀏覽