Results 1 - 10
of
10
Linear Approximation of Shortest Superstrings
, 1991
"... We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used ..."
Abstract
-
Cited by 65 (4 self)
- Add to MetaCart
We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of distinct strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the sup...
Combinatorial algorithms for DNA sequence assembly
- Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Toward Simplifying and Accurately Formulating Fragment Assembly
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 1995
"... The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequence ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the 2-sided Kolmogorov-Smirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graph-theoretic terms as one of finding a non-cyclic subgraph with certain properties and the objectives of being shortest or maximally-likely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is ...
Reconstructing Strings from Substrings
- Journal of Computational Biology
, 1993
"... this paper, we consider a variety of problems with application to sequencing by hybridization. First, we develop a theory of interactive sequencing by hybridization, based on ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
this paper, we consider a variety of problems with application to sequencing by hybridization. First, we develop a theory of interactive sequencing by hybridization, based on
Rotation of Periodic Strings and Short Superstrings
, 1996
"... This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous a ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous algorithms in the sense that they construct a superstring by computing some optimal cycle covers on the distance graph of the given strings, and then break and merge the cycles to finally obtain a Hamiltonian path, but we make use of new bounds on the overlap between two strings. We prove that for each periodic semi-infinite string ff = a1a2 \Delta \Delta \Delta of period q, there exists an integer k, such that for any (finite) string s of period p which is inequivalent to ff, the overlap between s and the rotation ff[k] = ak ak+1 \Delta \Delta \Delta is at most p+ 1 2 q. Moreover, if p q, then the overlap between s and ff[k] is not larger than 2 3 (p+q). In the previous shortes...
On the Learning of Rule Uncertainties and their Integration into Probabilistic Knowledge Bases
, 1993
"... We present a natural and realistic knowledge acquisition and processing scenario. In the first phase a domain expert identifies deduction rules that he thinks are good indicators of whether a specific target concept is likely to occur. In a second knowledge acquisition phase, a learning algorithm au ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We present a natural and realistic knowledge acquisition and processing scenario. In the first phase a domain expert identifies deduction rules that he thinks are good indicators of whether a specific target concept is likely to occur. In a second knowledge acquisition phase, a learning algorithm automatically adjusts, corrects and optimizes the deterministic rule hypothesis given by the domain expert by selecting an appropriate subset of the rule hypothesis and by attaching uncertainties to them. Then, in the running phase of the knowledge base we can arbitrarily combine the learned uncertainties of the rules with uncertain factual information. Formally, we introduce the natural class of disjunctive probabilistic concepts and prove that this class is efficiently distribution-free learnable. The distribution-free learning model of probabilistic concepts was introduced by Kearns and Schapire and generalizes Valiant's probably approximately correct learning model. We show how to simulate...
Greedy Algorithms For The Shortest Common Superstring That Are Asymptotically Optimal
, 1997
"... There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NP-hard, but it has been known for some time that greedy algorithms work well for this ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NP-hard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most fi times (2 fi 4) worse than optimal. We analyze the problem in a probabilistic framework, and consider the optimal total overlap O opt n and the overlap O gr n produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that with high probability lim n!1 O opt n n log n = lim n!1 O gr n n log n = 1 H where n is the number of original strings, and H is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short.
Shortest Consistent Superstrings Computable In Polynomial Time
- Comput. Sci
, 1995
"... . The shortest consistent superstring problem is, given a set of positive strings and a set of negative strings, nding a shortest string including every positive string and no negative string as a substring. This problem is NP-hard and arises in DNA sequencing by hybridization. It is also an exte ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
. The shortest consistent superstring problem is, given a set of positive strings and a set of negative strings, nding a shortest string including every positive string and no negative string as a substring. This problem is NP-hard and arises in DNA sequencing by hybridization. It is also an extension of the well-known shortest common superstring problem which corresponds to the case when the set of negative strings is empty. In this paper we show that a shortest consistent superstring can be found in polynomial time if (i) a longest common non-superstring for the set of negative strings exists or (ii) the number of positive strings is bounded and every symbol of the alphabet appears at the end of some negative string. In the case (i) a longest consistent superstring can also be found in polynomial time. 1. Introduction Jiang and Li [JL92, JL93A] were perhaps the rst to pose the shortest consistent superstring problem: Given a set P of positive strings and a set N of negat...
Sharpening Occam's Razor
- Inf. Process. Lett
, 2003
"... this paper) then better "i# Occam style" characterizations of polynomial time learnability/predicatability can be given. They rely on Schapire's result that "weak learnability" equals "strong learnability" in polynomial time [13] exploited in [9]. For a recent survey of the important related "boosti ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
this paper) then better "i# Occam style" characterizations of polynomial time learnability/predicatability can be given. They rely on Schapire's result that "weak learnability" equals "strong learnability" in polynomial time [13] exploited in [9]. For a recent survey of the important related "boosting" technique see [14]
Improved Inapproximability Results for the Shortest Superstring and Related Problems
"... We develop a new method for proving explicit approximation lower bounds for the Shortest Superstring problem, the Maximum Compression problem, the Maximum Asymmetric TSP problem, the(1,2)–ATSP problem and the(1,2)–TSP problem improving on the best known approximation lower bounds for those problems. ..."
Abstract
- Add to MetaCart
We develop a new method for proving explicit approximation lower bounds for the Shortest Superstring problem, the Maximum Compression problem, the Maximum Asymmetric TSP problem, the(1,2)–ATSP problem and the(1,2)–TSP problem improving on the best known approximation lower bounds for those problems. 1

