Results 1 - 10
of
25
Linear Approximation of Shortest Superstrings
, 1991
"... We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used ..."
Abstract
-
Cited by 65 (4 self)
- Add to MetaCart
We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of distinct strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the sup...
Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs
- Journal of the ACM
, 2003
"... A directed multigraph is said to be d-regular if the indegree and outdegree of every vertex is exactly d. By Hall’s theorem one can represent such a multigraph as a combination of at most n 2 cycle covers each taken with an appropriate multiplicity. We prove that if the d-regular multigraph does not ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
A directed multigraph is said to be d-regular if the indegree and outdegree of every vertex is exactly d. By Hall’s theorem one can represent such a multigraph as a combination of at most n 2 cycle covers each taken with an appropriate multiplicity. We prove that if the d-regular multigraph does not contain more than ⌊d/2 ⌋ copies of any 2-cycle then we can find a similar decomposition into n 2 pairs of cycle covers where each 2-cycle occurs in at most one component of each pair. Our proof is constructive and gives a polynomial algorithm to find such a decomposition. Since our applications only need one such a pair of cycle covers whose weight is at least the average weight of all pairs, we also give an alternative, simpler algorithm to extract a single such pair. This combinatorial theorem then comes handy in rounding a fractional solution of an LP relaxation of the maximum Traveling Salesman Problem (TSP) problem. The first stage of the rounding procedure obtains two cycle covers that do not share a 2-cycle with weight at least twice the weight of the optimal solution. Then we show how to extract a tour from the 2 cycle covers, whose weight is at least 2/3 of the weight of the longest tour. This improves upon the previous 5/8 approximation with a simpler algorithm. Utilizing a reduction from maximum TSP to the shortest superstring problem we obtain a 2.5-approximation algorithm for the latter problem which is again much simpler than the previous one. For minimum asymmetric TSP the same technique gives two cycle covers, not sharing a 2-cycle, with weight at most twice the weight of the optimum. Assuming triangle
Combinatorial algorithms for DNA sequence assembly
- Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Toward Simplifying and Accurately Formulating Fragment Assembly
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 1995
"... The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequence ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the 2-sided Kolmogorov-Smirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graph-theoretic terms as one of finding a non-cyclic subgraph with certain properties and the objectives of being shortest or maximally-likely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is ...
Rotation of Periodic Strings and Short Superstrings
, 1996
"... This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous a ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous algorithms in the sense that they construct a superstring by computing some optimal cycle covers on the distance graph of the given strings, and then break and merge the cycles to finally obtain a Hamiltonian path, but we make use of new bounds on the overlap between two strings. We prove that for each periodic semi-infinite string ff = a1a2 \Delta \Delta \Delta of period q, there exists an integer k, such that for any (finite) string s of period p which is inequivalent to ff, the overlap between s and the rotation ff[k] = ak ak+1 \Delta \Delta \Delta is at most p+ 1 2 q. Moreover, if p q, then the overlap between s and ff[k] is not larger than 2 3 (p+q). In the previous shortes...
Expected Length of Longest Common Subsequences
"... Contents 1 Introduction 1 2 Notation and preliminaries 4 2.1 Notation and basic definitions : : : : : : : : : : : : : : : : : : 4 2.2 Longest common subsequences : : : : : : : : : : : : : : : : : : 7 2.3 Computing longest common subsequences : : : : : : : : : : : 10 2.4 Expected length of longest c ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Contents 1 Introduction 1 2 Notation and preliminaries 4 2.1 Notation and basic definitions : : : : : : : : : : : : : : : : : : 4 2.2 Longest common subsequences : : : : : : : : : : : : : : : : : : 7 2.3 Computing longest common subsequences : : : : : : : : : : : 10 2.4 Expected length of longest common subsequences : : : : : : : 14 3 Lower Bounds 20 3.1 Css machines : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.2 Analysis of css machines : : : : : : : : : : : : : : : : : : : : : 26 3.3 Design of css machines : : : : : : : : : : : : : : : : : : : : : : 31 3.4 Labeled css machines : : : : : : : : : : : : : : : : : : : : : : : 38 4 Upper bounds 45 4.1 Collations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.2 Previous upper bounds : : : : : : : : : : : : : : : : : : : : : : 51 4.3 Simple upper bound (binary alphabet) : : : : : : : : : : : : : 55 4.4 Simple upper bound (alphabet size 3) : : : : : : : : : : : : : : 59 4.5 Upper bounds for binary alphabet : :
A 2 2/3-Approximation Algorithm for the Shortest Superstring Problem
- In Proc. 7th Symp. on Combinatorial Pattern Matching, Lecture Notes in Computer Science
, 1996
"... Given a collection of strings S = fs 1 ; : : : ; s n g over an alphabet \Sigma, a superstring ff of S is a string containing each s i as a substring; that is, for each i, 1 i n, ff contains a block of js i j consecutive characters that match s i exactly. The shortest superstring problem is the pr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Given a collection of strings S = fs 1 ; : : : ; s n g over an alphabet \Sigma, a superstring ff of S is a string containing each s i as a substring; that is, for each i, 1 i n, ff contains a block of js i j consecutive characters that match s i exactly. The shortest superstring problem is the problem of finding a superstring ff of minimum length. The shortest superstring problem has applications in both data compression and computational biology. It was shown by Blum et al. [3] to be MAX SNP-hard. The first O(1)-approximation algorithm also appeared in [3], which returns a superstring no more than 3 times the length of an optimal solution. Prior to the algorithm described in this paper, there were several published results that improved on the approximation ratio; of these, the best is our algorithm ShortString, a 2 3 4 --approximation [1]. We present our new algorithm, G-ShortString, which achieves a ratio of 2 2 3 . Our approach builds on the work in [1], in which we identifi...
Parallel and Sequential Approximations of Shortest Superstrings
- In Proceedings of Fourth Scandinavian Workshop on Algorithm Theory
, 1994
"... Abstract. Superstrings have many applications in data compression and genetics. However the decision version of the shortest superstring problem is N P�complete. In this paper we examine the complexity of approximating a shortest superstring. There are two basic measures of the approximations� the c ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. Superstrings have many applications in data compression and genetics. However the decision version of the shortest superstring problem is N P�complete. In this paper we examine the complexity of approximating a shortest superstring. There are two basic measures of the approximations� the compression ratio and the approximation ratio. The well known and practical approximation algorithm is the sequential algorithm GREEDY. It approximates the shortest superstring with the compression ratio of 1 2 and with the approximation ratio of 4. Our main results are� �1 � An N C algorithm which achieves the compression ratio of 1 4� �. �2 � The proof that the algorithm GREEDY is not parallelizable � the com� putation of its output is P�complete. �3 � An improved sequential algorithm � the approximation ratio is reduced to 2.83. Previously it was reduced by Teng and Yao from 3 to 2.89. �4 � The design of an RN C algorithm with constant approximation ratio and an N C algorithm with logarithmic approximation ratio. 1
Coevolving Solutions to the Shortest Common Superstring Problem
, 2004
"... The Shortest Common Superstring (SCS) problem, known to be NP-Complete, seeks the shortest string that contains all strings from a given set. In this paper we compare four approaches for finding solutions to the SCS problem: a standard genetic algorithm, a novel cooperative-coevolutionary algorithm, ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The Shortest Common Superstring (SCS) problem, known to be NP-Complete, seeks the shortest string that contains all strings from a given set. In this paper we compare four approaches for finding solutions to the SCS problem: a standard genetic algorithm, a novel cooperative-coevolutionary algorithm, a benchmark greedy algorithm, and a parallel coevolutionary-greedy approach. We show the coevolutionary approach produces the best results, and discuss directions for future research.
DNA Sequencing and String Learning
"... In laboratories, the majority of large-scale DNA sequencing is done following the shot-gun strategy, which is to randomly sequence large amount of relatively short fragments and then heuristically find a shortest common superstring of the fragments [26]. We study mathematical frameworks, under plau ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In laboratories, the majority of large-scale DNA sequencing is done following the shot-gun strategy, which is to randomly sequence large amount of relatively short fragments and then heuristically find a shortest common superstring of the fragments [26]. We study mathematical frameworks, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn substrings. Under certain restrictions, this may be viewed as string learning in Valiant's distribution-free learning model and in this case we give an efficient learning algorithm and a quantitative bound on how many examples suffice. One major obstacle to our approach turns out to be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last ten years [9, 29, 30]. We give the first provably good algorithm which approximates a shortest superstring of length n by a superstring of length O(n log n). The algorithm works equally well even in the presence of negative examples, i.e., when merging of some strings is prohibited.

