基因資料庫中相似與不相似型樣搜尋技術之研究

在本研究中，我們將討論從生物 DNA 序列資料庫中搜尋相似型樣(similar pattern)與不相似型樣(dissimilar pattern)的問題，並為這些問題設計快速、有效的搜尋演算法。隨著儲存在資料庫中的資料量以及使用者查詢次數的快速增加，可將不可能是答案之型樣快速、及早刪除的過濾演算法在解決相似型樣搜尋的問題上日益受到重視且益發重要。然而，現有以 gram 為基礎的過濾演算法皆沒有考慮 gram 在型樣上排列的順序，因而可能造成演算法過高的誤判率，導致過濾效果不彰的結果。在本研究中，我們將搜尋滿足指定誤差量(mismatch)與涵蓋率(coverage)條件之相似型樣的問題，轉換為有範圍限制之最長遞增子序列(longest increasing subsequence)搜尋問題，並提出稱之為 IDCF 的快速過濾演算法。由我們所得的實驗數據可看出， IDCF 演算法因將 gram 在型樣上出現的順序列入考慮，明顯的降低了相似型樣之候選資料數量，有效提升了搜尋效能。不相似型樣可用做分辨資料庫中不同序列的特徵(signature)。為了達到快速搜尋不相似型樣的目的，我們設計了稱之為 IMUS 與 USD 的兩個特徵型樣搜尋演算法。其中， IMUS 演算法可用以處理可全數被載入內部記憶體進行處理的小型生物序列資料，而 USD 演算法則使用 IMUS 演算法為處理核心，是一個專為處理資料量過大、無法全數被載入內部記憶體的大型資料庫而設計的演算方法。根據我們的實驗結果可發現， IMUS 與 USD 演算法在處理過程中所需的字元比對次數很明顯的少於現有之特徵型樣搜尋演算法。在一台普通的個人電腦上， IMUS 與 USD 演算法可在一天內完成從資料量為156MB的人類11號染色體序列中搜尋出不相似特徵型樣的工作，而在人類Y染色體序列上搜尋特徵型樣的工作則僅需35秒即可完成。特徵型樣搜尋演算法通常需要設定包括型樣長度 l 與誤差容忍度 d 在內的參數，這些參數值的設定將影響搜尋所得之結果。然而，如何為搜尋設定適當的參數值之建議與準則卻很少被提及，尤其是在處理不熟悉的資料庫時，此一問題更加嚴重。在大多數的情況下，生物學家只能先依據過去的經驗甚至是猜測的方式來設定型樣搜尋演算法之參數值，如果因此而得到的結果無法令人滿意，就嘗試其他的參數組合，並重新進行搜尋，上述嘗試錯誤的過程一再重複，直到找出令人滿意的結果為止。對於指定的搜尋條件(l, d)，我們將所有長度小於等於 l 且誤差容忍度大於等於 d 的特徵型樣稱之為該搜尋條件下的隱含特徵型樣(implicit signature)。如果搜尋演算法可以快速、有效的將所有滿足使用者需求的隱含特徵型樣全數找出，當能改善重複地嘗試錯誤之狀況，對生物學家亦將有所幫助，但現有的特徵型樣搜尋演算法在設計時卻未將搜尋隱含特徵型樣的需求考慮在內，因而搜尋效果並不理想。在本研究中，我們提出兩個特徵型樣搜尋演算法： Consecutive Multiple Discovery (CMD) 以及 Parallel and Incremental Signature Discovery (PISD)演算法。其中， PISD 演算法是專為在指定的搜尋條件下找出特徵型樣而設計的快速演算法它採用了漸進式搜尋(incremental discovery)的概念，以既有、已知的結果做為候選資料，並由候選資料中尋找新的結果，而不需針對資料庫的全部內容進行搜尋，另外，此演算法引入了平行運算的技術，以加快特徵型樣之搜尋；而 CMD 則是專為搜尋隱含特徵型樣而設計的演算法，它採用 PISD 演算法做為處理核心，以便在各個特定的搜尋條件下，以漸進的搜尋方式，快速找出所有的隱含特徵型樣。我們所提出的漸進式演算法確實可快速、有效地從資料庫中搜尋出隱含特徵型樣，由實驗數據可知，當使用8個處理器時， CMD 演算法可較傳統循序搜尋演算法節省超過97%的執行時間。

關鍵字

相似型樣搜尋；不相似型樣搜尋；隱含特徵型樣搜尋

並列摘要

In this study, the problems of discovering similar and dissimilar patterns in sequence databases are discussed, and the efficient algorithms for the discovery are designed. With exponentially increasing database size and number of queries, the filtration approach, which filters out impossible patterns to accelerate similar pattern discovery, becomes more and more important in bioinformatics. However, the order of the gramsin sequences does not be considered in most of the known gram-based filtration approaches in literature so that higher false-positives would be conducted. In this study, the task of extracting similar patterns under a certain coverage level and error tolerance is transformed to a longest increasing subsequence problem with range constraints, and an efficient algorithm, Incremental Decreasing Cover Filtering (IDCF) algorithm, is designed for the filtration. Experimental results show that the IDCF algorithm significantly reduces the number of candidates for similar pattern discovery. Dissimilar patterns can be used as unique signatures to distinguish a sequence from the other sequences in a database. To achieve efficient unique signatures discovery in genomic databases, two efficient algorithms, IMUS and USD, are designed in this study. The IMUS algorithm is designed for handling a sequence database which can be loaded into internal memory, and the USD algorithm uses the IMUS algorithm as a kernel routine to handle large-scale databases. The results of our experiments present that the amount of character comparisons used in the IMUS and USD algorithms are significantly less than that of the existing discovery algorithms. On a regular PC platform, the IMUS and USD algorithms discover unique signatures from a human chromosome 11 EST database, 156M bases, within one day, and takes 35 seconds to discover signatures from a human chromosome Y EST database. The signature discovery algorithms require to set some input factors, such as signature length and mismatch tolerance , which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition ( ) are defined as the signatures of length with mismatch tolerance . A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. Two discovery algorithms, consecutive multiple discovery (CMD) algorithm and parallel and incremental signature discovery (PISD) algorithm, are proposed in this study. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.

並列關鍵字

similar pattern discovery ； dissimilar pattern discovery ； implicit signature discovery

參考文獻

expressed sequence tags and human genome project. Science, 252:1651–1656, 1991.

[2] M. D. Adams, M. B. Soares, A. R. Kerlavage, C. Fields, and J. C. Venter. Rapid cDNA

sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA

library. Nat. Genet., 4:373–380, 1993.

Efficient Filtration of String Proximity Search of Biological Sequences. Technical Report

國際替代計量

基因資料庫中相似與不相似型樣搜尋技術之研究

主題瀏覽