生物資訊應用上的省空間資料結構

本論文主要是研究字典比對問題(dictionary matching problem)以及其相關的問題像是近似字典比對問題(approximate dictionary matching problem)與循環字典比對問題(circular dictionary matching)以及允許萬用符號的字串索引問題(text indexing with wildcards problem). 給定一個有$d$個字串的字串集合$cal{D}$, 這$d$個字串的總長度是$n$, 字典比對問題的目標是去為這集合$cal{D}$設計一個資料結構, 使得當你輸入一個很長的字串$T$去詢問此資料結構時, 此資料結構可以很迅速地告知你是否有哪些在$cal{D}$內的字串發生在$T$裡面. 字典比對這個問題可以用著名的Aho-Corasick automaton這個資料結構來解決這個問題, 不過Aho-Corasick automaton這個資料結構所需要的使用空間距離最低所需得使用空間還有一段距離. Hon教授一群人在2008年設計了一個只需要壓縮空間的資料結構來解決字典比對問題, 他們的資料結構只需要$nH_k(D)+o(nlogsigma)$-位元數, 而他們的資料結構支援詢問字串$T$的時間是$O(|T|(log^{epsilon} n+ log d)+ occ)$, $epsilon>0$. 之後2010年Belazzougui同樣設計了一個只需很省空間的資料結構, 他的資料結構只需要$nlogsigma +O(n)$位元數, 而Belazzougui的資料結構支援詢問字串$T$的時間是最佳的. 在這篇論文內我們設計了一個資料結構, 只需要$nH_k(D)+O(n)$ 位元數且同時支援詢問字串$T$的時間是最佳的,我們採用的方法是使用 XBW的壓縮方法來進一步改進Belazzougui的資料結構所需要的空間. 對於近似字典比對這個問題, 我們考慮可以允許一個錯誤的情況, 在允許一個錯誤的情況下, 我們考慮一個$T$的子字串如果與$cal{D}$中的某個字串$P$最多只有一個edit distance的距離, 我們設計的資料結構就必須回報$P$發生在$T$中. 對於這個問題目前最著名的資料結構是由Cole教授一群人2004年設計出來的資料結構, 他們的資料結構需要$O(n+dlog d)$字元組, 且支援詢問字串$T$的時間是$O(|T|log{d}log{log{d}}+occ)$. 再者就是1999年, Ferragina教授一群人設計的資料結構, 他們設計的資料結構需要$O(n^{1+epsilon})$字元組,而支援詢問$T$字串的時間是$O(|T|loglog n + occ)$. 然而至今尚未有人設計出壓縮空間的資料結構來解決這個問題, 在這篇論文中我們將提出第一個壓縮空間的資料結構來解決在允許一個錯誤的情況下的近似字典比對問題. 循環字串是指一個字串的每個不同的循環都屬於有效的字串, 這樣的字串在生物資訊領域以及計算幾何領域逐漸受到重視. 在這篇論文中我們設計一個簡潔(succinct)空間的資料結構來解決循環字典比對問題, 我們的資料結構需要$nlogsigma(1+o(1))+O(n)=O(dlog n)$位元數, 我們使用Burrows-Wheeler Tranform的方法來設計我們的資料結構. 有時候一個長字串$T$內是存在有萬用符號的, 允許萬用符號的字串索引問題的目標是為了這樣的字串$T$去設計一個資料結構使得當我們詢問一個較短的字串$P$時, 可以迅速地回答$P$是否存在在$T$中, 這個問題被應用在俱有單核苷酸多態性(SNP)的染色體序列中, 因為單核苷酸多態性的性質可以用萬用符號來模仿. 最近Tam教授一群人在2009年以及Thachuk在2011年, 分別都提出他們設計的簡潔空間的資料結構來解決允許萬用符號的字串索引問題, 在這篇論文中我們展現字典比對問題如何可以幫助我們解決允許萬用符號的字串索引問題, 然後我們提出第一個只需要壓縮空間的資料結構來解決這個問題. 我們的資料結構只需要$nH_h +o(nlog sigma)+O(dlog n)$位元數.

關鍵字

字典比對；資料結構；字串比對

並列摘要

This thesis studies the dictionary matching problem and its related variations {it approximate} dictionary matching problem, {it circular} dictionary matching problem, and an application {it text indexing with wildcards problem}. Given a set $D$ of $d$ patterns of total length $n$, the dictionary matching problem is to index $D$ such that for any query text $T$, we can locate the occurrences of any pattern within $T$ efficiently. This problem can be solved in optimal $O(|T|+occ)$ time by the classical Aho-Corasick automaton where $occ$ denotes the number of occurrences. The space requirement is $O(n)$ words which is far from optimal (i.e. succinct space or compressed space). When $D$ contains a total of $n$ characters drawn from an alphabet set $Sigma$ of size~$sigma$, Hon et al.~(2008) gave an $nH_k(D)+o(nlogsigma)$-bit index which supports a query in $O(|T|(log^{epsilon} n+ log d)+ occ)$ time, where $epsilon >0$ and $nH_k(D)$ denotes the $k$th-order entropy of $D$. Recently, Belazzougui~(2010) has proposed an elegant scheme, which takes $nlogsigma +O(n)$ bits of index space and supports a query in optimal time. In this thesis, we provide connections between Belazzougui's index and XBW compression of Ferragina and Manzini (2005), and show that Belazzougui's index can be slightly modified to be stored in $nH_k(D)+O(n)$ bits, while query time remains optimal; this improves the compressed index by Hon et al.~(2008) in both space and time. For the {it approximate} dictionary matching problem, we consider the one error case instead of the $k$ errors case (i.e. general case), where $k$ is an constant number larger than~0. In the one error case, we consider a substring of $T[i..j]$ an occurrence of $P$ whenever the edit distance between $T[i..j]$ and $P$ is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires $O(n+ dlog{d})$ words of space and reports all occurrences in $O(|T|log{d}log{log{d}}+occ)$ time, and by Ferragina et al. (1999), which requires $O(n^{1+epsilon})$ words of space and reports all occurrences in $O(|T|loglog n + occ)$ time. Although there have been successes in compressing the dictionary matching index while keeping the query time optimal (as described on the above). However, a compressed index for approximate dictionary matching problem is still open. In this thesis, we propose the first such index which requires an optimal $nH_k+O(n)+o(nlogsigma)$-bit index space. The query time of our index is $O(sigma |T|log^3{n}log{log{n}}+occ)$. Circular patterns are those patterns whose cyclic shifts are also valid patterns. These patterns arise naturally in bioinformatics and computational geometry. In this thesis, we consider succinct indexing schemes for a set of $d$ circular patterns of total length~$n$, with each character drawn from an alphabet $Sigma$ of size $sigma$. Our succinct index which needs $nlogsigma(1+o(1))+O(n)=O(dlog n)$ bits is based on the popular Burrows-Wheeler transform (BWT) on circular patterns, while the dictionary matching problem or the pattern matching problem can be solved efficiently. Sometimes the text string $T$ could have wildcard characters inside. Therefore, suppose $T=T_1phi^{k_1}T_2phi^{k_2}cdotsphi^{k_d}T_{d+1}$ whose total length is $n$, where characters of each $T_i$ are chosen from an alphabet $Sigma$, and $phi$ denotes a wildcard symbol. The text indexing with wildcards problem is to index $T$ such that when we are given a query pattern $P$, we can locate the occurrences of $P$ in $T$ efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this thesis, we will show how to apply the index of dictionary matching problem to solve this problem, and we present the first compressed index for this problem, which takes only $nH_h +o(nlog sigma)+O(dlog n)$ bits of space, where $H_h$ is the $h$th-order empirical entropy~($h=o(log_{sigma} n)$) of $T$.

並列關鍵字

Dictionary Matching ； Text Indexing

參考文獻

[1] A. Aho and M. Corasick. Efficient String Matching: An Aid to Bibligoraphic Search. Communications of the ACM, 18(6):333–340, 1975.

[3] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein, and M. Rodeh. Text Indexing and Dictionary Matching With One Error. Journal of Algorithms, 37(2): 309-325, 2000.

[5] M. A. Bender and M. Farach-Colton. The Level Ancestor Problem Simplified. The- oretical Computer Science, 321(1):5–12, 2004.

[6] M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest Common Ancestors in Trees and Directed Acyclic Graphs. Journal of Algorithms, 57(2):75–94, 2005.

[8] H. L. Chan, W. K. Hon, T. W. Lam, and K. Sadakane. Compressed Indexes for Dynamic Text Collections. ACM Transactions on Algorithms, 3(2), 2007.

國際替代計量

生物資訊應用上的省空間資料結構

全文下載

主題瀏覽