附加限制條件的最長共同子序列問題之演算法設計

本篇論文探討數個最長共同子序列的變異問題，是由分子生物學和序列比對的實際應用與理論興趣發展而來。在論文的第一部份，我們研究四個附加條件限制的最長共同子序列問題，目的是在兩條序列之共同子序列中，求得包含或排除一條附加限制字串為子序列或子字串之最長序列。我們研究這些問題的最佳化特性，並針對每個問題提出一個動態規劃演算法。理論分析顯示，我們提出的演算法之時間複雜度與兩條序列及一條附加限制字串的長度乘積成正比。此外，我們也考慮任意多條附加限制字串的情況。為了使序列的相似度衡量更有彈性，在論文的第二部份，我們研究一個混合附加條件限制的問題，目的是在兩條序列之共同子序列中，求得包含一條附加限制字串為子序列且排除另一條附加限制字串為子序列之最長序列。我們提出一個動態規劃演算法來解決這個問題，演算法所需的時間與兩條序列及兩條附加限制字串的長度乘積成正比。另外，我們提出一個針對兩條序列配對位置做計算的快速演算法。在論文最後一個部份，我們在最長共同子序列問題與一個附加條件限制的最長共同子序列問題上考慮一個被廣泛使用的資料壓縮技術，稱為區段長度編碼法。為了解決以區段長度編碼的序列之最長共同子序列問題，我們研究序列以區段分割動態規劃矩陣的特性，並利用簡化矩陣內部份計算來求得最長共同子序列的長度。最後，我們設計兩個演算法，在兩條以區段長度編碼的序列之共同子序列中，求得包含一條以區段長度編碼的附加限制字串為子序列之最長序列。

關鍵字

動態規劃；最長共同子序列；附加限制條件之最長共同子序列；區段長度編碼

並列摘要

This dissertation studies several variants of the longest common subsequence (abbreviated LCS) problem. These variants arise from some applications and theoretical interests in molecular biology and sequence comparison. In the First part of this dissertation, we study four constrained LCS (abbreviated CLCS) problems, each of which is to find a longest sequence that is a common subsequence of two sequences and either includes or excludes a constrained pattern as a subsequence or substring. We investigate the optimality principles of these problems and then derive a dynamic programming algorithm for each problem. The theoretical analyses show that the time complexity of each algorithm is proportional to the product of the lengths of the given sequences and constrained pattern. We also consider the case where the number of constrained patterns in each problem is arbitrary. To make the similarity measurement of sequences more flexible, in the second part of this dissertation, we study the problem of finding a longest sequence that is a common subsequence of two sequences and not merely includes a constrained pattern as a subsequence but excludes the other constrained pattern as a subsequence. We give a dynamic programming algorithm whose time complexity is proportional to the product of the lengths of the given sequences and constrained patterns. We also present a fast algorithm which restricts the computation on the positions of matches between the sequences. In the last part of this dissertation, we consider a common used data compression scheme called run-length encoding (abbreviated RLE) on the input sequences of the LCS problem and one of the CLCS problems. To solve the LCS problem of two RLE sequences, we investigate the properties of the partition, induced by the runs of two sequences, in the dynamic programming matrix for the LCS problem and exploit the sequences for computing the length of an LCS by utilizing the simplicity of some positions. Finally, we devise two algorithms for the problem of finding a longest sequence that is a common subsequence of two RLE sequences and includes a constrained RLE pattern as a subsequence.

並列關鍵字

Dynamic Programming ； Longest Common Subsequence ； Constrained Longest Common Subsequence ； Run-Length Encoding

參考文獻

[1] A.V. Aho, D.S. Hirschberg, and J.D. Ullman. Bounds on the complexity of the longest common subsequence problem. Journal of ACM, 23:1-12, 1976.

[3] S.F. Altschul, T.L. Madden, A.A. SchaRer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-3402, 1997.

[4] C.E.R. Alves, N. Cáceres, and S.W. Song. An all-substrings common subsequence algorithm. Discrete Applied Mathematics, 156(7):1025-1035, 2008.

[5] A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proceeding of the 2nd IEEE Data Compression Conference (DCC'92), pp. 279-288, 1992.

[6] A. Amir, G, Benson, and M. Farach. An alphabet independent approach to two dimensional pattern matching. SIAM Journal on Computing, 23(2):313-323, 1994.

國際替代計量

附加限制條件的最長共同子序列問題之演算法設計

主題瀏覽