快速整體與半整體比對演算法與基因序列分析

近年來，隨著生物科學的發達，越來越多的生物基因序列資料被建立成資料庫，其中基因序列的分析在生物科學上的應用也日趨重要。為了解決許多生物問題，生物學家利用生物資訊（bioinformatics）以及計算生物（computational biology）的分析工具來幫助分析龐大的資料庫，也使得生物資訊以及計算生物的發展更加刻不容緩。在諸多的生物資訊及計算生物的分析工具中，序列比對(sequence alignment)是一個基本且相當重要的分析工具，它可以用來比較及分析兩條或多條基因序列之間的相似程度。相似程度通常由編輯距離(edit distance)或是相似分數(similarity score)及各自的序列排比(sequence alignment)當作指標，相似度高的基因序列彼此之間會有相似的結構及功能，或是意味著它們是來自於親緣關係較相近的物種。因此，生物學家一旦發現一段未知功能的基因序列時，會利用序列比對工具來搜尋比對已研究透徹的基因序列資料庫，試圖找尋出相似的基因序列，藉此推測其生物功能。在生物資訊以及計算生物中，動態規劃(dynamic programming)是一個最常見且普遍的演算法，它除了可以精確地求得編輯距離或是相似分數，還可以很容易地找出相對應的序列排比。但是動態規劃演算法相當耗時，而且運算時需要相當大的記憶體。為了解決這些問題，許多演算法應運而生，這些演算法除了能夠精確地達到相同的結果，還能更加地有效率或是更能節省運算時的記憶體需求，此篇論文中也會提出我們的快速演算法。本篇論文主要分為兩個部分。第一個部分著重於介紹動態規畫及其應用，還有一些已經存在的快速演算法。第二個部分為本篇論文的重點，我們發展出一種快速演算法，它比對的結果可以跟動態規畫一樣精確，但是卻可以改進動態規畫演算法耗時間以及空間的缺點。我們的快速演算法會與動態規畫以及一些已知的快速演算法做比較，在程式運算量的表示圖以及程式運行的時間，皆可以明顯地看出我們的演算法在時間以及空間上都更加地有效率。

關鍵字

整體；半整體；演算法；序列分析；序列比對；基因序列

並列摘要

In recent years, with the development of biological science, there are more and more biological DNA sequences that are added into the database. Therefore, the analysis of DNA sequences and its application in the biological science are becoming increasingly important. In order to solve many biological problems, the analytical tools in bioinformatics and computational biology help biologists to analyze huge databases, and the developments of bioinformatics and computational biology become more urgent. The Sequence alignment, which can be used to compare the similarity between two or more DNA sequences, is an elementary and important analysis tool in bioinformatics and computational biology. The Edit distance and the similarity score are used as indicators of aligning DNA sequences. The DNA sequences with high similarity mean that they are homology and they have similar functions or structure. When biologists discovered a DNA sequence of unknown function, they will try to search the database, which contains many known DNA sequences, for some similar DNA sequences to speculate the biological function of the new one. The dynamic programming method is one of the most common and popular algorithms in bioinformatics and computational biology. It can obtain not only the edit distance or the similarity score between DNA sequences but also the corresponding sequence alignment. However, the dynamic programming method is very time-consuming and requires considerable memory to analyze DNA sequences. To solve these problems, many algorithms have emerged. These algorithms accurately achieve the same results, and they are more efficient in time and memory in addition. We propose our fast algorithm and its simulation results in this thesis. This thesis is divided into two parts. In the first part, we focus on introducing the dynamic programming method, some existing fast algorithms, and their applications. In the second part of this thesis, we develop a fast algorithm, which can be as accurate as the dynamic programming method, but it can reduce the waste of computation time and memory space. The proposed algorithm compare with the dynamic programming method and some of the known fast algorithms, and the results are presented in this part.

並列關鍵字

global alignment ； semiglobal alignment ； algorithm ； sequences analysis ； sequences alignment ； DNA sequence

參考文獻

[1] J. D Watson and F. H. C. Crick, “Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid,” Nature, vol. 171, pp. 737-738, 1953.

[2] Saenger and Wolfram, Principles of Nucleic Acid Structure, New York: Springer-Verlag, 1984.

DNA sequence assembly

[4] F. Sanger, “Determination of nucleotide sequences in DNA,” Nobel lecture 8, December 1980.

DSP for Bioinformatics

國際替代計量

快速整體與半整體比對演算法與基因序列分析

全文下載

主題瀏覽