Results 1  10
of
33
Toward Simplifying and Accurately Formulating Fragment Assembly
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 1995
"... The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequence ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximumlikelihood reconstruction with respect to the 2sided KolmogorovSmirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graphtheoretic terms as one of finding a noncyclic subgraph with certain properties and the objectives of being shortest or maximallylikely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is ...
Combinatorial algorithms for DNA sequence assembly
 Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
(Show Context)
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
A BranchandCut Algorithm for Multiple Sequence Alignment
 IN PROC. OF THE 1ST ANN. INTERN. CONF. ON COMP. MOLEC. BIO. (RECOMB 97
, 1997
"... Multiple sequence alignment is an important problem in computational biology. We study the Maximum Trace formulation introduced by Kececioglu [Kec91]. We first phrase the problem in terms of forbidden subgraphs, which enables us to express Maximum Trace as an integer linearprogramming problem, and ..."
Abstract

Cited by 35 (7 self)
 Add to MetaCart
(Show Context)
Multiple sequence alignment is an important problem in computational biology. We study the Maximum Trace formulation introduced by Kececioglu [Kec91]. We first phrase the problem in terms of forbidden subgraphs, which enables us to express Maximum Trace as an integer linearprogramming problem, and then solve the integer linear program using methods from polyhedral combinatorics. The trace polytope is the convex hull of all feasible solutions to the Maximum Trace problem; for the case of two sequences, we give a complete characterization of this polytope. This yields a polynomialtime algorithm for a general version of pairwise sequence alignment that, perhaps suprisingly, does not use dynamic programming; this yields, for instance, a nondynamic programming algorithm for sequence comparison under the 01 metric, which gives another answer to a longopen question in the area of string algorithms [PW93]. For the multiplesequence case, we derive several classes of facetdefining inequali...
Maximum Likelihood Genome Assembly
, 2009
"... Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the doublestranded nature of DNA for genome assembly. By ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the doublestranded nature of DNA for genome assembly. By combining an algorithm for the Chinese Postman Problem on bidirected graphs with the construction of a bidirected de Bruijn graph, we are able to find the shortest doublestranded DNA sequence that contains a given set of klong DNAmolecules. This is the first exact polynomial time algorithm for the assembly of a doublestranded genome. Furthermore, we propose a maximum likelihood framework for assembling the genome that is the most likely source of the reads, in lieu of the standard maximum parsimony approach (which finds the shortest genome subject to some constraints). In this setting, we give a bidirected network flowbased algorithm that, by taking advantage of high coverage, accurately estimates the copy counts of repeats in a genome. Our second algorithm combines these predicted copy counts with matepair data in order to assemble the reads into contigs.We run our algorithms on simulated read data fromEscherichia coli and predict copy counts with extremely high accuracy, while assembling long contigs.
A polyhedral approach to sequence alignment problems
 DISCRETE APPL. MATH
, 2000
"... We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchandcut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framewo ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchandcut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of Maximum Trace. The RNA Sequence Alignment Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facetdefining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branchandcut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.
An Exact Solution for the SegmenttoSegment Multiple Sequence Alignment Problem
"... In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segmenttosegment variation of the multiple alignment problem the input can be seen as a set of runs of nongapped matches (diagonals) between pairs o ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segmenttosegment variation of the multiple alignment problem the input can be seen as a set of runs of nongapped matches (diagonals) between pairs of sequences. Given a weight function that assigns a weight score to every possible diagonal, the goal is to choose a consistent set of diagonals of maximum weight. We show that the segmenttosegment multiple alignment problem is equivalent to a novel formulation of the Maximum Weight Trace (MWT) problem. Solving the generalized MWT (GMWT) problem to optimality therefore improves upon the previous greedy strategies that are used for solving the segmenttosegment multiple sequence alignment problem. We show that the GMWT can be stated in terms of an integer linear program and then solve the integer linear program using methods from polyhedral combinatorics. This leads to a branchand...
Computability of models for sequence assembly
 In WABI
, 2007
"... pashadag,cgeorg,brudno¥ Abstract. Graphtheoretic models have come to the forefront as some of the most powerful and practical methods for sequence assembly. Simultaneously, the computational hardness of the underlying graph algorithms has remained open. Here we present two theoretical results about ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
pashadag,cgeorg,brudno¥ Abstract. Graphtheoretic models have come to the forefront as some of the most powerful and practical methods for sequence assembly. Simultaneously, the computational hardness of the underlying graph algorithms has remained open. Here we present two theoretical results about the complexity of these models for sequence assembly. In the first part, we show sequence assembly to be NPhard under two different models: string graphs and de Bruijn graphs. Together with an earlier result on the NPhardness of overlap graphs, this demonstrates that all of the popular graphtheoretic sequence assembly paradigms are NPhard. In our second result, we give the first, to our knowledge, optimal polynomial time algorithm for genome assembly that explicitly models the doublestrandedness of DNA. We solve the Chinese Postman Problem on bidirected graphs using bidirected flow techniques and show to how to use it to find the shortest doublestranded DNA sequence which contains a given set of ¦long words. This algorithm has applications to sequencing by hybridization and short read assembly. 1
A 2 2/3Approximation Algorithms for the Shortest Superstring Problem
 DIMACS WORKSHOP ON SEQUENCING AND MAPPING
, 1995
"... Given a collection of strings S = fs1; : : : ; sng over an alphabet, a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superst ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
Given a collection of strings S = fs1; : : : ; sng over an alphabet, a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superstring of minimum length. The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS '80). Much of the recent interest in the problem is due to its application to DNA sequence assembly. The problem has been shown to be NPhard; in fact, it was shown by Blum et al.(JACM '94) to be MAX SNPhard. The rst O(1)approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithm ShortString, which achieves a 2 3
Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads
"... Abstract. Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them fo ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them for ab initio genome assembly. In this paper, we give a novel network flowbased algorithm that, by taking advantage of the high coverage provided by NGS, accurately estimates the copy counts of repeats in a genome. We also give a second algorithm that combines the predicted copycounts with matepair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from E. Coli and predict copycounts with extremely high accuracy, while assembling long contigs. 1
Parameterized Complexity Analysis in Computational Biology
 Comput. Appl. Biosci
, 1995
"... Many computational problems in biology involve parameters for which a small range of values cover important applications. We argue that for many problems in this setting, parameterized computational complexity rather than NPcompleteness is the appropriate tool for studying apparent intractability. ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Many computational problems in biology involve parameters for which a small range of values cover important applications. We argue that for many problems in this setting, parameterized computational complexity rather than NPcompleteness is the appropriate tool for studying apparent intractability. At issue in the theory of parameterized complexity is whether a problem can be solved in time O(n ff ) for each fixed parameter value, where ff is a constant independent of the parameter. In addition to surveying this complexity framework, we describe a new result for the Longest common subsequence problem. In particular, we show that the problem is hard for W [t] for all t when parameterized by the number of strings and the size of the alphabet. Lower bounds on the complexity of this basic combinatorial problem imply lower bounds on more general sequence alignment and consensus discovery problems. We also describe a number of open problems pertaining to the parameterized complexity of pro...