有效率的生物序列查詢的過濾方法

DNA序列查詢與非siRNA目標序列查詢，都是將查詢序列與生物序列資料庫中所有序列做比對，將與查詢序列差異在一定範圍內的資料庫序列找出，兩者都是生物資訊中序列相似性查詢的重要工具。由於序列資料庫中的序列都可能多達數百萬個鹼基對，若以序列並列分析的方式來直接掃描資料庫一一做比對，則其所產生的時間複雜度將非常大。因此，部分學者提出了查詢過濾演算法，試圖在查詢時的第一個步驟，就過濾掉資料庫中大部分差異較大的序列，只留下一小部分的候選序列。然而，這類演算法皆存在若干問題，包括候選序列個數過多、偽陰性(false negative)比例過高等等。在此兩種查詢中，DNA序列比對的查詢序列比較長，且大都使用編輯距離(edit distance)當作查詢的依據；非siRNA目標序列查詢，其查詢序列較短，通常大約有19~25鹼基對，且大都是使用漢明距離 (Hamming distance)當作查詢的依據。所以在本篇論文中，我們針對此兩種查詢分別提出有效率的過濾演算法: 對DNA序列比對我們提出一個名為「以轉換法為基礎之資料庫查詢過濾的演算法」(我們稱之為TDF)，此方法先以小波轉換將資料庫每條子序列轉成一個特徵向量，並建立索引，然後利用查詢序列的特徵向量與此索引過濾掉不可能符合查詢條件的序列，找出候選子序列。對非siRNA目標序列查詢，我們提出一個名為「以共通前置序列為基礎之資料庫查詢的過濾演算法」(我們稱之為CPF)，此方法先將資料庫每條子序列排序，將有共通前置序列集中在一起，並為此共通前置序列建立索引，然後利用查詢序列的前置序列與此索引過濾掉不可能符合查詢條件的序列，找出候選子序列。最後，對於每條候選子序列，再去計算該子序列與查詢序列之間的真正距離。實驗結果顯示，此兩種查詢過濾演算法均能有效過濾掉大部分的資料序列，同時保證沒有偽陰性。實驗結果顯示TDF優於QUASAR與YM，CPF優於YM與SoS方法。

關鍵字

序列查詢；生物序列資料庫；過濾方法； siRNA

並列摘要

In this dissertation, we propose two filtration methods for two sequence similarity queries of biological databases: DNA sequence similarity query and siRNA off-target query. Both queries are used to retrieve similar sequences from biological sequence databases. The DNA sequence similarity query is used to retrieve all sequences in a database so that the edit distance between the query and a sequence retrieved is less than a user-specified threshold, where the length of such a query is often long. It is mostly used to retrieve highly similar sequences. A small interfering RNA (siRNA), also called silencing RNA, is used to knock a gene down by an artificial mechanism. Although an siRNA is designed to silence a specific gene, many researchers have shown that the genes highly similar to the siRNA are also silenced or their expressions are depressed. An siRNA off-target query can be used to find those highly similar genes in a database, where the Hamming distance between the query and the sequence retrieved is less than a user-specified threshold. Its query length is usually short (normally 19~25 base pairs). For both queries, a filtration method can be used to screen out most unqualified data sequences from the database and leave only a small number of candidate sequences for sequence comparisons. For the DNA sequence similarity retrieval query, we propose a method called Transformation-based Database Filtration method (TDF). In the TDF method, the data sequences are first divided into several blocks, each of which is transformed into a feature vector by Haar wavelet transform and stored in an index file. Then, we search the index and extract those candidate blocks whose edit distance to the feature vector of the query sequence is less than a user-specified threshold. For the siRNA off-target query, we propose a method called Common Prefix Filtration method (CPF). In the CPF method, the data sequences are first sorted and stored in an index tree according to their common prefixes. Then, we search the index tree and filter out those sequences whose Hamming distance to the query sequence is greater than a user-specified threshold. We extract those possible candidate sequences and pass them to the verification stage. The experiment results show that our both filtration methods can filter out most unqualified sequences and guarantee no false negatives. The experimental results show that the TDF method outperforms the QUASAR and YM methods while the CPF method outperforms the YM and SoS methods.

並列關鍵字

sequence query ； biological sequence database ； filtration method ； siRNA

參考文獻

[1] Aghili, A., Agrawal, D., Abbadi, A. E., 2003. Filtration of string proximity search via transformation. In Proceedings of the Third IEEE International Symposium on Bioinformatics and Bioengineering, 149-157.

[4] Altschul S. F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25, 3389-3402.

[5] Aluru. S., 2006. Handbook of Computational Molecular Biology. Chapman & Hall., New York, U.S.A.

[6] Amarzguioui, M., Holen, T., Babaie, E., Prydz, H., 2003. Tolerance for mutations and chemical modifications in a siRNA. Nucleic Acids Research 31, 589-595.

[9] Baeza-Yates,R., Perleberg,C., 1992. Fast and practical approximate string matching. Combinational Pattern Matching, 185-192.

國際替代計量

有效率的生物序列查詢的過濾方法

全文下載

主題瀏覽