Results 1 - 10
of
38
Computing the Similarity of Two Sequences with Nested Arc Annotations
- Theoretical Computer Science
, 2003
"... We present exact algorithms for the NP-complete Longest Common Subsequence problem for sequences with nested arc annotations, a problem occurring in structure comparison of RNA. Given two sequences of length at most n and nested arc structure, one of our algorithms determines (if existent) in O(3.3 ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
(Show Context)
We present exact algorithms for the NP-complete Longest Common Subsequence problem for sequences with nested arc annotations, a problem occurring in structure comparison of RNA. Given two sequences of length at most n and nested arc structure, one of our algorithms determines (if existent) in O(3.31 time an arc-preserving subsequence of both sequences, which can be obtained by deleting (together with corresponding arcs) k 1 letters from the first and k 2 letters from the second sequence. A second algorithm shows that (in case of a four letter alphabet) we can find a length l arc-annotated subsequence in O(12 n) time. This means that the problem is fixed-parameter tractable when parameterized by the number of deletions as well as when parameterized by the subsequence length. Our findings complement known approximation results which give a quadratic time factor-2-approximation for the general and polynomial time approximation schemes for restricted versions of the problem. In addition, we obtain further fixed-parameter tractability results for these restricted versions.
Bounding the Expected Length of Longest Common Subsequences and Forests
- Proc. of WSP'96
, 1999
"... . We present two techniques to find lower and upper bounds for the expected length of longest common subsequences and forests of two random sequences of the same length, over a fixed size, uniformly distributed alphabet. We emphasize the power of the methods used, which are Markov chains and Kolmogo ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
. We present two techniques to find lower and upper bounds for the expected length of longest common subsequences and forests of two random sequences of the same length, over a fixed size, uniformly distributed alphabet. We emphasize the power of the methods used, which are Markov chains and Kolmogorov complexity. As a corollary, we obtain some new lower and upper bounds for the problems mentioned. 1 Introduction The longest common subsequence (LCS) of two strings is one of the main problems in combinatorial pattern matching. The LCS problem is related to DNA or protein alignments, file comparison, speech recognition, etc. We say that x is a subsequence of u if we can obtain x by deleting zero or more characters of u. The LCS of two strings u and v of length n is defined as the longest subsequence x common to u and v. For example, the LCS of longest and large is lge. An open problem related to the LCS is its expected length for two random strings of length n over a uniformly distrib...
Towards Optimally Solving the Longest Common Subsequence Problem for Sequences with Nested Arc Annotations in Linear Time
- In Proc. of the 13th Symposium on Combinatorial Pattern Matching (CPM02), volume 2373 of LNCS
, 2002
"... We present exact algorithms for the NP-complete Longest Common Subsequence problem for sequences with nested arc annotations, a problem occurring in structure comparison of RNA. Given two sequences of length at most n and nested arc structure, our algorithm determines (if existent) in time O(3.3 ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We present exact algorithms for the NP-complete Longest Common Subsequence problem for sequences with nested arc annotations, a problem occurring in structure comparison of RNA. Given two sequences of length at most n and nested arc structure, our algorithm determines (if existent) in time O(3.31 k 1 +k 2 n) an arc-preserving subsequence of both sequences, which can be obtained by deleting (together with corresponding arcs) k1 letters from the first and k2 letters from the second sequence. Thus, the problem is fixed-parameter tractable when parameterized by the number of deletions. This complements known approximation results which give a quadratic time factor-2-approximation for the general and polynomial time approximation schemes for restricted versions of the problem. In addition, we obtain further fixed-parameter tractability results for these restricted versions.
Experimenting an Approximation Algorithm for the LCS
- Discrete Applied Mathematics
, 1998
"... The problem of finding the longest common subsequence (lcs) of a given set of sequences over an alphabet # occurs in many interesting contexts, such as data compression and molecular biology, in order to measure the "similarity degree" among biological sequences. Since the problem is NP-co ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
The problem of finding the longest common subsequence (lcs) of a given set of sequences over an alphabet # occurs in many interesting contexts, such as data compression and molecular biology, in order to measure the "similarity degree" among biological sequences. Since the problem is NP-complete in its decision version (i.e. does there exist a lcs of length at least k, for a given k?) even over fixed alphabet, polynomial algorithms which give approximate solutions have been proposed. Among them, Long Run (LR) is the only one with guaranteed constant performance ratio.
A coarse-grained multicomputer algorithm for the longest repeated suffix ending at each point in a word, in:
- Proc. 2003 Internat. Conf. on Computational Science and its Application (ICCSA’03),
, 2003
"... Abstract The paper presents a Coarse-Grained Multicomputer algorithm that solves the problem of detection of repetitions. This algorithm can be implemented in the CGM model with P processors in O(N 2 /P ) in time and O(P ) communication steps. It is the first CGM algorithm for this problem. We pres ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract The paper presents a Coarse-Grained Multicomputer algorithm that solves the problem of detection of repetitions. This algorithm can be implemented in the CGM model with P processors in O(N 2 /P ) in time and O(P ) communication steps. It is the first CGM algorithm for this problem. We present also experimental results showing that the CGM algorithm is very efficient.
New Algorithms for the Longest Common Subsequence Problem
, 1994
"... Given two sequences A = a 1 a 2 : : : am and B = b 1 b 2 : : : b n , m n, over some alphabet \Sigma, a common subsequence C = c 1 c 2 : : : c l of A and B is a sequence that can be obtained from both A and B by deleting zero or more (not necessarily adjacent) symbols. Finding a common subsequenc ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Given two sequences A = a 1 a 2 : : : am and B = b 1 b 2 : : : b n , m n, over some alphabet \Sigma, a common subsequence C = c 1 c 2 : : : c l of A and B is a sequence that can be obtained from both A and B by deleting zero or more (not necessarily adjacent) symbols. Finding a common subsequence of maximal length is called the Longest CommonSubsequence (LCS) Problem. Two new algorithms based on the well-known paradigm of computing minimal matches are presented. One runs in time O(ns+minfds; pmg) and the other runs in time O(ns +minfp(n \Gamma p); pmg) where s = j\Sigmaj is the alphabet size, p is the length of a longest common subsequence and d is the number of minimal matches. The ns term is charged by a standard preprocessing phase. When m n both algorithms are fast in situations when a LCS is expected to be short as well as in situations when a LCS is expected to be long. Further they show a much smaller degeneration in intermediate situations, especially the second al...
Multi-column substring matching for database schema translation
- In VLDB
, 2006
"... We describe a method for discovering complex schema translations involving substrings from multiple database columns. The method does not require a training set of instances linked across databases and it is capable of dealing with both fixed- and variable-length field columns. We propose an iterati ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
We describe a method for discovering complex schema translations involving substrings from multiple database columns. The method does not require a training set of instances linked across databases and it is capable of dealing with both fixed- and variable-length field columns. We propose an iterative algorithm that deduces the correct sequence of concatenations of column substrings in order to translate from one database to another. We introduce the algorithm along with examples on common database data values and examine its performance on real-world and synthetic datasets. 1.
Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms
, 2003
"... Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg’s divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg’s divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of all dominant match points. These two algorithms use bit-vector operations and are shown to work very efficiently in practice.
Common Subsequences and Supersequences and Their Expected Length
, 1995
"... . Let f(n; k; l) be the expected length of a longest common subsequence of l sequences of length n over an alphabet of size k. It is known that there are constants fl (l) k such that f(n; k; l) ! fl (l) k n, we show that fl (l) k = \Theta(k 1=l\Gamma1 ). Bounds for the corresponding constant ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
. Let f(n; k; l) be the expected length of a longest common subsequence of l sequences of length n over an alphabet of size k. It is known that there are constants fl (l) k such that f(n; k; l) ! fl (l) k n, we show that fl (l) k = \Theta(k 1=l\Gamma1 ). Bounds for the corresponding constants for the expected length of a shortest common supersequence are also presented. 1 Introduction and preliminaries To find the expected length of a longest common subsequence of two sequences is a standard problem studied in the literature [7, 8]. In this paper we shall concentrate on the expected length of a longest common subsequence of several sequences. We show that this expected length for l sequences of length n over an alphabet of size k is \Theta( n k 1\Gamma1=l ) for n ?? k ?? l. We also consider a dual case, the expected length of a shortest common supersequence. Let \Sigma = f0; 1; : : : ; k \Gamma 1g be a fixed alphabet of size k. Let \Sigma be the set of all strings over \...
Large deviation based upper bounds for the lcs-problem. Submitted
, 2004
"... Let X: = (X1,...,Xn) and Y: = (Y1,...,Yn) be two finite sequences. Let Ln designate the length of the longest sequence which occurs as a subsequence of X as well as of Y. We analyze and apply a large deviation and Montecarlo simulation based method for the computation of improved upper bounds on the ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
(Show Context)
Let X: = (X1,...,Xn) and Y: = (Y1,...,Yn) be two finite sequences. Let Ln designate the length of the longest sequence which occurs as a subsequence of X as well as of Y. We analyze and apply a large deviation and Montecarlo simulation based method for the computation of improved upper bounds on the Chvàtal-Sankoff constant γ, which is defined by the limit γ = limn→ ∞ E[Ln]/n when X and Y are random sequences with i.i.d. entries. Our theoretical results show that this method converges to the exact value of γ when a control parameter m converges to infinity. We also give upper bounds on the complexity for numerically computing γ to any given precision via this method. Our numerical experiments confirm the theory and allow us to give new upper bounds that are correct to two digits. AMS Classification: primary 05A16, 62F10; secondary 92E10.