生物字串資料庫上高效能索引技術之研究

隨著生物和電腦科技的進步，將這個相連的領域帶入到一個新的紀元。生物實驗室的發現多半在實質上可以幫助電腦科學家去增進他們的計算模式，相反的，電腦計算預測的結果可以方便我們將接下來需要驗證的範圍變小。對於這種知識發現的過程而言，當生物字串的數目以極快的速度成長，計算模式的效率是十分的重要。人類基因和它相關正在進行的計畫，對於計算的需求是十分大的。我們的研究主要也是針對這個論點。我們的研究主要是專注在改進字串對準的搜尋效率，具體而言，我們提出了一個高效能的索引方法，來充分發揮迄今最好的索引結構設計技術，儘管基於Smith-Waterman (SW) 的alignment演算法有眾所周知之優異的預測準確度，然而SW的方法卻很難直接運用在日益遽增的資料量上。OASIS*的設計是將SW的計算嵌入在suffix tree的索引結構中，使得OASIS在搜尋比對上可以很快，然而在空間上所需付的代價卻是十分可觀。為了修正這些缺失，我們發展一個在空間上有效率且簡單、更容易更新的索引結構，來快速的找出潛在與query string相符的字串，以利後續昂貴的SW驗證工作。事實上，我們的方法只需要付出大約佔整體10%的索引空間的代價，就可以避免許多無謂的驗證計算，這種先導的過濾架構具有一般性，它可以當作各種昂貴SW矩陣的安全閥（下限距離函數），同時其運用suffix tree的內部設計也繼承了良好的擴張性，無懼於資料之日益增長，我們利用完整的實驗結果來證實了所提出方法的特色。 *“An Online and Accurate Technique for Local-alignment Searches on Biological Sequences”

關鍵字

染色體比對

並列摘要

With the advances of Bioinformatics and computer technologies, the knowledge development in this joint field comes to a new era. More often than not, the laboratory discoveries of biologists substantially help refine the computing models used by computer scientists. In turn, the predictive results from computing facilitate confining the scope of the subsequent verification. As the amount of biological strings grows at an increasing speed, the efficiency of computing model is vital to this knowledge discovery process. Human Genome project and its ongoing work places a high demand on computing. This thesis aims at this issue. Our research focuses on improving the search efficiency of semi-global alignment problem. In particular, we propose a scalable solution to leverage the state-of-the-art index designs to date. Despite the wide acceptance in prediction accuracy, Smith-Waterman (SW) –based alignment algorithms are too expensive to scale. “An Online and Accurate Technique for Local-alignment Searches on Biological Sequences”(OASIS) scheme embeds SW to better search scalability using a suffix tree index. However, its space overhead is considerable. To address these drawbacks, we develop a space-efficient, easy-to-update index design to rapidly identify the potential matches to a query string before subsequent costly SW verification. As a result, most of futile computations can be avoided in our approach using only about 10% index space overhead. Such preliminary filtration framework is generic in nature. It serves as a safe lower bound to the distance functions for a variety of SW cost matrices. Yet, the usage of suffix trees in its design offers favorable scalability for the problem. Intensive experimental results substantiate these advantages of our technique.

並列關鍵字

DNA match

參考文獻

[Alt86] S. Altschul and B.W. Erickson, Optimal sequence alignment using affine gap costs. Journal of Molecular Biology, 48(5-6):603-616, 1986.

[Alt96] S. Altschul and W. Gish, Local Alignment Statistics. Methods Enzymol, 266:460-480, 1996.

[Alt97] S. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman, Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research, 25(17):3389-3402, 1997.

[Bae92] R.A. Baeza-Yates and C.H. Perleberg, Fast and practical approximate string matching. In Combinatorial Pattern Matching, 185-192, 1992.

[Bae99] R.A. Baeza-Yates and G. Nabarro, Faster Approximate string matching. Algorithmica, 23(2):127-158, 1999.

國際替代計量

生物字串資料庫上高效能索引技術之研究

主題瀏覽