生物字串資料庫上高效能索引技術之研究__國立清華大學博碩士論文全文影像系統

帳號：guest(18.222.230.126) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	林健紋
作者(外文):	Chien-Wen Lin
論文名稱(中文):	生物字串資料庫上高效能索引技術之研究
論文名稱(外文):	Scalable Index Techniques for Biological String Databases
指導教授(中文):	陳朝欽
指導教授(外文):	Chaur-Chin Chen
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	934377
出版年(民國):	95
畢業學年度:	94
語文別:	英文、中文
論文頁數:	32
中文關鍵詞:	染色體比對
外文關鍵詞:	DNA match
相關次數:	推薦:0 點閱:146 評分: 下載:10 收藏:0

隨著生物和電腦科技的進步，將這個相連的領域帶入到一個新的紀元。生物實驗室的發現多半在實質上可以幫助電腦科學家去增進他們的計算模式，相反的，電腦計算預測的結果可以方便我們將接下來需要驗證的範圍變小。對於這種知識發現的過程而言，當生物字串的數目以極快的速度成長，計算模式的效率是十分的重要。人類基因和它相關正在進行的計畫，對於計算的需求是十分大的。我們的研究主要也是針對這個論點。
我們的研究主要是專注在改進字串對準的搜尋效率，具體而言，我們提出了一個高效能的索引方法，來充分發揮迄今最好的索引結構設計技術，儘管基於Smith-Waterman (SW) 的alignment演算法有眾所周知之優異的預測準確度，然而SW的方法卻很難直接運用在日益遽增的資料量上。OASIS*的設計是將SW的計算嵌入在suffix tree的索引結構中，使得OASIS在搜尋比對上可以很快，然而在空間上所需付的代價卻是十分可觀。為了修正這些缺失，我們發展一個在空間上有效率且簡單、更容易更新的索引結構，來快速的找出潛在與query string相符的字串，以利後續昂貴的SW驗證工作。事實上，我們的方法只需要付出大約佔整體10%的索引空間的代價，就可以避免許多無謂的驗證計算，這種先導的過濾架構具有一般性，它可以當作各種昂貴SW矩陣的安全閥（下限距離函數），同時其運用suffix tree的內部設計也繼承了良好的擴張性，無懼於資料之日益增長，我們利用完整的實驗結果來證實了所提出方法的特色。
*“An Online and Accurate Technique for Local-alignment Searches on Biological Sequences”

With the advances of Bioinformatics and computer technologies, the knowledge development in this joint field comes to a new era. More often than not, the laboratory discoveries of biologists substantially help refine the computing models used by computer scientists. In turn, the predictive results from computing facilitate confining the scope of the subsequent verification. As the amount of biological strings grows at an increasing speed, the efficiency of computing model is vital to this knowledge discovery process. Human Genome project and its ongoing work places a high demand on computing. This thesis aims at this issue.
Our research focuses on improving the search efficiency of semi-global alignment problem. In particular, we propose a scalable solution to leverage the state-of-the-art index designs to date. Despite the wide acceptance in prediction accuracy, Smith-Waterman (SW) –based alignment algorithms are too expensive to scale. “An Online and Accurate Technique for Local-alignment Searches on Biological Sequences”(OASIS) scheme embeds SW to better search scalability using a suffix tree index. However, its space overhead is considerable. To address these drawbacks, we develop a space-efficient, easy-to-update index design to rapidly identify the potential matches to a query string before subsequent costly SW verification. As a result, most of futile computations can be avoided in our approach using only about 10% index space overhead. Such preliminary filtration framework is generic in nature. It serves as a safe lower bound to the distance functions for a variety of SW cost matrices. Yet, the usage of suffix trees in its design offers favorable scalability for the problem. Intensive experimental results substantiate these advantages of our technique.

CHAPTER 1 INTRODUCTION 1
CHAPTER 2 RELATED WORK 6
2.1. MRS 8
2.2. RMRS 9
CHAPTER 3 SITBSD (SCALABLE INDEX TECHNIQUES FOR BIOLOGICAL STRING DATABASES) 12
3.1. SUFFIX TREE INDEX 12
3.2. OASIS 14
3.3. DV 16
3.4. ALGORITHM 20
CHAPTER 4 EXPERIMENTAL EVALUATION 24
CHAPTER 5 CONCLUDING REMARKS 30
REFERENCES 31

[Alt86] S. Altschul and B.W. Erickson, Optimal sequence alignment using affine gap costs. Journal of Molecular Biology, 48(5-6):603-616, 1986.
[Alt90] S. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipma, Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403-410, 1990.
[Alt96] S. Altschul and W. Gish, Local Alignment Statistics. Methods Enzymol, 266:460-480, 1996.
[Alt97] S. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman, Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research, 25(17):3389-3402, 1997.
[Bae92] R.A. Baeza-Yates and C.H. Perleberg, Fast and practical approximate string matching. In Combinatorial Pattern Matching, 185-192, 1992.
[Bae97] R.A. Baeza-Yates and G. Navarro, A practical index for text retrieval allowing errors. In Conferencia Latinoamericana de Informática volume1, 273-282, 1997.
[Bae99] R.A. Baeza-Yates and G. Nabarro, Faster Approximate string matching. Algorithmica, 23(2):127-158, 1999.
[Ben00] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp, and D.L. Wheeler, Genbank. Nucleic Acids Research, 26(1):1–7, 2000.
[Dec02] A. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg, Fast Algorithms for Large-scale Genome Alignment and Comparison. Nucleic Acids Research, 30(11):2478-2483, 2002.
[Del99] A. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg, Alignment of Whole Genomes. Nucleic Acids Research, 27(11):2369-2376, 1999.
[Dur98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis. Cambridge University press, 1 edition, 12-34 1998.
[Fer01] P. Ferragina and G. Manzini, An Experimental Study of an Opportunistic Index. In Proc. of ACM-SIAM Symposium on Discrete Algorithms, 269-278, 2001.
[Gus97] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University press, 1 edition, 1997.
[Hun03] E. Hunt, The Suffix Sequoia Index for Approximate String Matching. Department of Computing Science, University of Glasgow, Glasgow, UK, TR 2003-135, 2003.
[Kah01] T. Kahveci and A.K. Singh, An Efficient Index Structure for String Databases. Very Large Data Bases, 351-360, 2001.
[Kur01] S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Research, 29(22):4633-4642, 2001.
[Man93] U. Manber and E. Myers, Suffix Arrays: A New Method for On-line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993.
[Mee03] C. Meek, J.M. Patel, and S. Kasetty, OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences. Very Large Data Bases, 910–921, 2003.
[Mye86] E. Myers, An O(ND) difference algorithm and its variations. Algorithmica, 251-266, 1986.
[Mye94] E. Myers, A sublinear algorithm for approximate keyword matching. Algorithmica, 345-374, 1994.
[Nas01] H. Nash and D. Blair, Comparing Algorithm for Largescale Sequence Analysis. Bioinformatic and Bioengineering, 89-96, 2001.
[Pea88] W.R. Pearson and D.J. Lipman, Improved Tools for Biological Sequence Comparison. Proceedings of the National academy of Sciences,85(8):2444-2448, 1988.
[Sah03] S.C. Sahinalp, M. Tasan, J. Macker, Z.M. Ozsoyoglu, , Distance Based Indexing for String Proximity Search. International Conference on Data Engineering, 125-136, 2003.
[She05] S.Sheu, A. Chang, and W. Hang, Fast Similarity Search in String Database. Advanced Information Networking and Applications, 617-622, 2005.
[Shp96] E.G. Shpaer, M. Robinson, D. Yee, J.D. Candlin, R. Mines, T. Hunkapiller., Sensitivity and Selectivity in Protein Similarity Searches: A Comparison of Smith-Waterman in Hardware to BLAST and FASTA. Genomics, 38:179-191, 1996.
[Smi81] T. Smith and M. Waterman, Identification of Common Molecular Subsequences. Journal of Molecular Biology, 147:195-197, 1981.
[Tat04] S. Tata, R.A. Hankins, and J.M. Patel, Practical Suffix Tree Construction. Very Large Data Bases, 36-47, 2004.
[Uli99] S. Uliel, A. Fliess, A. Amir, and R. Unger, A simple algorithm for detecting circular permutations in proteins. Bioinformatics, 15(11):930-936, 1999.
[Wu92] S. Wu, and U. Manber, Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992.

封面
摘要
誌謝
目錄
第一章
第二章
第三章
第四章
第五章
參考文獻

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文