Results 1  10
of
33
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 553 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
Linear Approximation of Shortest Superstrings
, 1991
"... We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NPhard, a simple greedy procedure appears to do quite well and is routinely used ..."
Abstract

Cited by 76 (5 self)
 Add to MetaCart
We consider the following problem: given a collection of strings s 1 ; . . . ; s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s. Although this problem is known to be NPhard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of distinct strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomialtime algorithm is a recent O(n log n) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the sup...
Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs
, 2006
"... A directed multigraph is said to be dregular if the indegree and outdegree of every vertexis exactly d. By Hall's theorem one can represent such a multigraph as a combination of atmost n2 cycle covers each taken with an appropriate multiplicity. We prove that if the dregular multigraph does ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
(Show Context)
A directed multigraph is said to be dregular if the indegree and outdegree of every vertexis exactly d. By Hall's theorem one can represent such a multigraph as a combination of atmost n2 cycle covers each taken with an appropriate multiplicity. We prove that if the dregular multigraph does not contain more than b d/2c copies of any 2cycle then we can find asimilar decomposition into n2 pairs of cycle covers where each 2cycle occurs in at most onecomponent of each pair. Our proof is constructive and gives a polynomial algorithm to find such a decomposition. Since our applications only need one such a pair of cycle covers whoseweight is at least the average weight of all pairs, we also give an alternative, simpler algorithm to extract a single such pair.This combinatorial theorem then comes handy in rounding a fractional solution of an LP relaxation of the maximum Traveling Salesman Problem (TSP) problem. The first stage of therounding procedure obtains 2cycle covers that do not share a 2cycle with weight at least twice the weight of the optimal solution. Then we show how to extract a tour from the 2 cycle covers,whose weight is at least 2 /3 of the weight of the longest tour. This improves upon the previous5/8 approximation with a simpler algorithm. Utilizing a reduction from maximum TSP to the shortest superstring problem we obtain a 2.5approximation algorithm for the latter problemwhich is again much simpler than the previous one. For minimum asymmetric TSP the same technique gives 2cycle covers, not sharing a 2cycle, with weight at most twice the weight of the optimum. Assuming triangle inequality, we then show how to obtain from this pair of cycle covers a tour whose weight is at most0.842 log2 n larger than optimal. This improves upon a previous approximation algorithm with approximation guarantee of 0.999 log2 n. Other applications of the rounding procedure are approximation algorithms for maximum 3cycle cover (factor 2/3, previously 3/5) and maximum
Toward Simplifying and Accurately Formulating Fragment Assembly
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 1995
"... The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequence ..."
Abstract

Cited by 54 (1 self)
 Add to MetaCart
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximumlikelihood reconstruction with respect to the 2sided KolmogorovSmirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graphtheoretic terms as one of finding a noncyclic subgraph with certain properties and the objectives of being shortest or maximallylikely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is ...
B: Comparing De Novo Genome Assembly: The Long and Short of It
 PLoS ONE 2011
"... Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the wholegenome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implemen ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
(Show Context)
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the wholegenome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers ’ quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (FeatureResponse Curves, FRC) that is introduced here; FRC transparently captures the tradeoffs between contigs ’ quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for lowcoverage long (Sanger) and highcoverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various readlengths, coverages, accuracies, and with and without matepairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising
Rotation of Periodic Strings and Short Superstrings
, 1996
"... This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous a ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
(Show Context)
This paper presents two simple approximation algorithms for the shortest superstring problem, with approximation ratios 2 2 3 ( 2:67) and 2 25 42 ( 2:596), improving the best previously published 2 3 4 approximation. The framework of our improved algorithms is similar to that of previous algorithms in the sense that they construct a superstring by computing some optimal cycle covers on the distance graph of the given strings, and then break and merge the cycles to finally obtain a Hamiltonian path, but we make use of new bounds on the overlap between two strings. We prove that for each periodic semiinfinite string ff = a1a2 \Delta \Delta \Delta of period q, there exists an integer k, such that for any (finite) string s of period p which is inequivalent to ff, the overlap between s and the rotation ff[k] = ak ak+1 \Delta \Delta \Delta is at most p+ 1 2 q. Moreover, if p q, then the overlap between s and ff[k] is not larger than 2 3 (p+q). In the previous shortes...
A 2 2/3Approximation Algorithms for the Shortest Superstring Problem
 DIMACS WORKSHOP ON SEQUENCING AND MAPPING
, 1995
"... Given a collection of strings S = fs1; : : : ; sng over an alphabet, a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superst ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
Given a collection of strings S = fs1; : : : ; sng over an alphabet, a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superstring of minimum length. The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS '80). Much of the recent interest in the problem is due to its application to DNA sequence assembly. The problem has been shown to be NPhard; in fact, it was shown by Blum et al.(JACM '94) to be MAX SNPhard. The rst O(1)approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithm ShortString, which achieves a 2 3
Parallel and Sequential Approximations of Shortest Superstrings
 In Proceedings of Fourth Scandinavian Workshop on Algorithm Theory
, 1994
"... Abstract. Superstrings have many applications in data compression and genetics. However the decision version of the shortest superstring problem is N P�complete. In this paper we examine the complexity of approximating a shortest superstring. There are two basic measures of the approximations� the c ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Abstract. Superstrings have many applications in data compression and genetics. However the decision version of the shortest superstring problem is N P�complete. In this paper we examine the complexity of approximating a shortest superstring. There are two basic measures of the approximations� the compression ratio and the approximation ratio. The well known and practical approximation algorithm is the sequential algorithm GREEDY. It approximates the shortest superstring with the compression ratio of 1 2 and with the approximation ratio of 4. Our main results are� �1 � An N C algorithm which achieves the compression ratio of 1 4� �. �2 � The proof that the algorithm GREEDY is not parallelizable � the com� putation of its output is P�complete. �3 � An improved sequential algorithm � the approximation ratio is reduced to 2.83. Previously it was reduced by Teng and Yao from 3 to 2.89. �4 � The design of an RN C algorithm with constant approximation ratio and an N C algorithm with logarithmic approximation ratio. 1
Aggregation of composite solutions: strategies, models, examples. Electronic preprint
"... ar ..."
(Show Context)
An 8/13approximation algorithm for the asymmetric maximum TSP
 In Proc. 13th Ann. ACMSIAM Symp. on Discrete Algorithms (SODA
, 2002
"... We present a polynomial time approximation algorithm for the asymmetric maximum traveling salesperson problem that achieves performance ratio 8 ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
We present a polynomial time approximation algorithm for the asymmetric maximum traveling salesperson problem that achieves performance ratio 8