文件特徵化問題的改進演算法

在本論文中，我們研究了由 Amir et al. 所提出來的文件特徵化索引問題 (text fingerprinting indexing problem) 。令 S 為一個由有限且有序的字母表 Sigma 所形成的字串。對任何 S 的子字串 S' ，所有在 S' 中出現的字元所形成的集合稱為它的特徵 (fingerprint) 。文件特徵化索引問題需要對 S 先建立一個資料結構，使得之後給定任何的集合C subseteq Sigma ，我們可以很有效率的回答以下兩種查詢 (query) ：(1) 回答 C 是不是代表了 S 的某一個子字串的特徵；(2) 找出所有特徵為 C 的最大子字串。目前為止最好的結果可以在 (|Sigma|) 以及 (|Sigma| + K) 分別回答這兩種查詢，其中 K 代表了最大子字串的個數。在本論文中，我對這個問題提出了兩個改進演算法。第一個演算法可以分別在 O(min{|C| log n, |Sigma|}) 和 O(min{|C| log n, |Sigma|} + K) 的時間內完成兩種不同的查詢。第二個改進演算法用了一個更省空間的方法，並且可以分別在 O(|C| log (|Sigma|/|C|)) 以及 O(|C| log (|Sigma|/|C|) + K) 的時間回答兩種查詢。我們的兩種改進演算法都解決了 Amir et al. 所提出來的open problem。

關鍵字

演算法；字串比對；特徵化；文字索引

並列摘要

In this dissertation, we study the text fingerprinting indexing problem, which was introduced by Amir et al. Let S be a string over a finite, ordered alphabet Sigma. For any substring S' of S, the set of distinct characters contained in S' is called its fingerprint. The text fingerprinting indexing problem consists of constructing a data structure for the string S in advance, so that on given any input set C subseteq Sigma of characters, we can answer the following queries efficiently: (1) determine if C represents a fingerprint of some substrings in S; (2) find all maximal substrings of S whose fingerprint is equal to C. The best results known so far solved these two queries in (|Sigma|) and (|Sigma| + K) time, respectively, where K is the number of maximal substrings. In this dissertation, we propose two improved algorithms for the text fingerprinting indexing problem. The first algorithm solves the two queries in O(min{|C| log n, |Sigma|}) and O(min{|C| log n, |Sigma|} + K) time, respectively. The second algorithm solves them in O(|C| log (|Sigma|/|C|)) and O(|C| log (|Sigma|/|C|) + K) time, respectively, by using a new data structure with less storage than the existing solutions. Both results answer an open problem proposed by Amir et al.

並列關鍵字

algorithms ； string matching ； fingerprinting ； text indexing

參考文獻

[1] Abrahamson, K., "Generalized string matching," SIAM Journal on Computing, vol. 16, no. 6, 1039–1051, 1987.

[2] Aho, A. V. and Corasick, M. J., "Efficient string matching: an aid to bibliographic search," Communications of the ACM, vol. 18, no. 6, 333–340, 1975.

[3] Amir, A., Aumann, Y., Benson, G., Levy, A., Lipsky, O., Porat, E., Skiena, S., and Vishne, U., "Pattern matching with address errors: Rearrangement distances," Journal of Computer and System Sciences, doi: 10.1016/j.jcss.2009.03.001, 2009.

[6] Amir, A., Butman, A., Crochemore, M., Landau, G. M., and Schaps, M., "Two-dimensional pattern matching with rotations," Theoretical Computer Science, vol. 314, no. 1–2, 173–187, 2004.

[7] Amir, A., Butman, A., and Lewenstein, M., "Real scaled matching," Information Processing Letters, vol. 70, 185–190, 1999.

國際替代計量

文件特徵化問題的改進演算法

全文下載

主題瀏覽