Results 1  10
of
10
Linear Approximation of Shortest Superstrings
, 1991
"... We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NPhard, a simple greedy procedure appears to do quite well and is routinely used ..."
Abstract

Cited by 76 (5 self)
 Add to MetaCart
We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NPhard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of distinct strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomialtime algorithm is a recent O(n log n) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the sup...
Combinatorial algorithms for DNA sequence assembly
 Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Toward Simplifying and Accurately Formulating Fragment Assembly
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 1995
"... The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequence ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximumlikelihood reconstruction with respect to the 2sided KolmogorovSmirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graphtheoretic terms as one of finding a noncyclic subgraph with certain properties and the objectives of being shortest or maximallylikely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is ...
Reconstructing Strings from Substrings
 Journal of Computational Biology
, 1993
"... this paper, we consider a variety of problems with application to sequencing by hybridization. First, we develop a theory of interactive sequencing by hybridization, based on ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
this paper, we consider a variety of problems with application to sequencing by hybridization. First, we develop a theory of interactive sequencing by hybridization, based on
Rotation of Periodic Strings and Short Superstrings
, 1996
"... This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous a ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous algorithms in the sense that they construct a superstring by computing some optimal cycle covers on the distance graph of the given strings, and then break and merge the cycles to finally obtain a Hamiltonian path, but we make use of new bounds on the overlap between two strings. We prove that for each periodic semiinfinite string ff = a1a2 \Delta \Delta \Delta of period q, there exists an integer k, such that for any (finite) string s of period p which is inequivalent to ff, the overlap between s and the rotation ff[k] = ak ak+1 \Delta \Delta \Delta is at most p+ 1 2 q. Moreover, if p q, then the overlap between s and ff[k] is not larger than 2 3 (p+q). In the previous shortes...
On the Learning of Rule Uncertainties and their Integration into Probabilistic Knowledge Bases
, 1993
"... We present a natural and realistic knowledge acquisition and processing scenario. In the first phase a domain expert identifies deduction rules that he thinks are good indicators of whether a specific target concept is likely to occur. In a second knowledge acquisition phase, a learning algorithm au ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We present a natural and realistic knowledge acquisition and processing scenario. In the first phase a domain expert identifies deduction rules that he thinks are good indicators of whether a specific target concept is likely to occur. In a second knowledge acquisition phase, a learning algorithm automatically adjusts, corrects and optimizes the deterministic rule hypothesis given by the domain expert by selecting an appropriate subset of the rule hypothesis and by attaching uncertainties to them. Then, in the running phase of the knowledge base we can arbitrarily combine the learned uncertainties of the rules with uncertain factual information. Formally, we introduce the natural class of disjunctive probabilistic concepts and prove that this class is efficiently distributionfree learnable. The distributionfree learning model of probabilistic concepts was introduced by Kearns and Schapire and generalizes Valiant's probably approximately correct learning model. We show how to simulate...
Greedy Algorithms For The Shortest Common Superstring That Are Asymptotically Optimal
, 1997
"... There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NPhard, but it has been known for some time that greedy algorithms work well for this ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NPhard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most fi times (2 fi 4) worse than optimal. We analyze the problem in a probabilistic framework, and consider the optimal total overlap O opt n and the overlap O gr n produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that with high probability lim n!1 O opt n n log n = lim n!1 O gr n n log n = 1 H where n is the number of original strings, and H is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short.
Shortest Consistent Superstrings Computable In Polynomial Time
 Comput. Sci
, 1995
"... . The shortest consistent superstring problem is, given a set of positive strings and a set of negative strings, nding a shortest string including every positive string and no negative string as a substring. This problem is NPhard and arises in DNA sequencing by hybridization. It is also an exte ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
. The shortest consistent superstring problem is, given a set of positive strings and a set of negative strings, nding a shortest string including every positive string and no negative string as a substring. This problem is NPhard and arises in DNA sequencing by hybridization. It is also an extension of the wellknown shortest common superstring problem which corresponds to the case when the set of negative strings is empty. In this paper we show that a shortest consistent superstring can be found in polynomial time if (i) a longest common nonsuperstring for the set of negative strings exists or (ii) the number of positive strings is bounded and every symbol of the alphabet appears at the end of some negative string. In the case (i) a longest consistent superstring can also be found in polynomial time. 1. Introduction Jiang and Li [JL92, JL93A] were perhaps the rst to pose the shortest consistent superstring problem: Given a set P of positive strings and a set N of negat...
Sharpening Occam's Razor
 Inf. Process. Lett
, 2003
"... this paper) then better "i# Occam style" characterizations of polynomial time learnability/predicatability can be given. They rely on Schapire's result that "weak learnability" equals "strong learnability" in polynomial time [13] exploited in [9]. For a recent survey of the important related "boosti ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
this paper) then better "i# Occam style" characterizations of polynomial time learnability/predicatability can be given. They rely on Schapire's result that "weak learnability" equals "strong learnability" in polynomial time [13] exploited in [9]. For a recent survey of the important related "boosting" technique see [14]
Improved Inapproximability Results for the Shortest Superstring and Related Problems
"... We develop a new method for proving explicit approximation lower bounds for the Shortest Superstring problem, the Maximum Compression problem, the Maximum Asymmetric TSP problem, the(1,2)–ATSP problem and the(1,2)–TSP problem improving on the best known approximation lower bounds for those problems. ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We develop a new method for proving explicit approximation lower bounds for the Shortest Superstring problem, the Maximum Compression problem, the Maximum Asymmetric TSP problem, the(1,2)–ATSP problem and the(1,2)–TSP problem improving on the best known approximation lower bounds for those problems. 1