Results 1 - 10
of
18
Combinatorial algorithms for DNA sequence assembly
- Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Toward Simplifying and Accurately Formulating Fragment Assembly
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 1995
"... The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequence ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the 2-sided Kolmogorov-Smirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graph-theoretic terms as one of finding a non-cyclic subgraph with certain properties and the objectives of being shortest or maximally-likely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is ...
A Branch-and-Cut Algorithm for Multiple Sequence Alignment
- IN PROC. OF THE 1ST ANN. INTERN. CONF. ON COMP. MOLEC. BIO. (RECOMB 97
, 1997
"... Multiple sequence alignment is an important problem in computational biology. We study the Maximum Trace formulation introduced by Kececioglu [Kec91]. We first phrase the problem in terms of forbidden subgraphs, which enables us to express Maximum Trace as an integer linear-programming problem, and ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
Multiple sequence alignment is an important problem in computational biology. We study the Maximum Trace formulation introduced by Kececioglu [Kec91]. We first phrase the problem in terms of forbidden subgraphs, which enables us to express Maximum Trace as an integer linear-programming problem, and then solve the integer linear program using methods from polyhedral combinatorics. The trace polytope is the convex hull of all feasible solutions to the Maximum Trace problem; for the case of two sequences, we give a complete characterization of this polytope. This yields a polynomial-time algorithm for a general version of pairwise sequence alignment that, perhaps suprisingly, does not use dynamic programming; this yields, for instance, a nondynamic -programming algorithm for sequence comparison under the 0-1 metric, which gives another answer to a long-open question in the area of string algorithms [PW93]. For the multiple-sequence case, we derive several classes of facet-defining inequali...
A polyhedral approach to sequence alignment problems
- Discrete Appl. Math
, 2000
"... We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchand-cut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framewor ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchand-cut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of Maximum Trace. The RNA Sequence Alignment Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.
An Exact Solution for the Segment-to-Segment Multiple Sequence Alignment Problem
"... In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segment-to-segment variation of the multiple alignment problem the input can be seen as a set of runs of non-gapped matches (diagonals) between pairs o ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segment-to-segment variation of the multiple alignment problem the input can be seen as a set of runs of non-gapped matches (diagonals) between pairs of sequences. Given a weight function that assigns a weight score to every possible diagonal, the goal is to choose a consistent set of diagonals of maximum weight. We show that the segment-to-segment multiple alignment problem is equivalent to a novel formulation of the Maximum Weight Trace (MWT) problem. Solving the generalized MWT (GMWT) problem to optimality therefore improves upon the previous greedy strategies that are used for solving the segment-to-segment multiple sequence alignment problem. We show that the GMWT can be stated in terms of an integer linear program and then solve the integer linear program using methods from polyhedral combinatorics. This leads to a branch-and...
A 2 2/3-Approximation Algorithm for the Shortest Superstring Problem
- In Proc. 7th Symp. on Combinatorial Pattern Matching, Lecture Notes in Computer Science
, 1996
"... Given a collection of strings S = fs 1 ; : : : ; s n g over an alphabet \Sigma, a superstring ff of S is a string containing each s i as a substring; that is, for each i, 1 i n, ff contains a block of js i j consecutive characters that match s i exactly. The shortest superstring problem is the pr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Given a collection of strings S = fs 1 ; : : : ; s n g over an alphabet \Sigma, a superstring ff of S is a string containing each s i as a substring; that is, for each i, 1 i n, ff contains a block of js i j consecutive characters that match s i exactly. The shortest superstring problem is the problem of finding a superstring ff of minimum length. The shortest superstring problem has applications in both data compression and computational biology. It was shown by Blum et al. [3] to be MAX SNP-hard. The first O(1)-approximation algorithm also appeared in [3], which returns a superstring no more than 3 times the length of an optimal solution. Prior to the algorithm described in this paper, there were several published results that improved on the approximation ratio; of these, the best is our algorithm ShortString, a 2 3 4 --approximation [1]. We present our new algorithm, G-ShortString, which achieves a ratio of 2 2 3 . Our approach builds on the work in [1], in which we identifi...
Parameterized Complexity Analysis in Computational Biology
- Comput. Appl. Biosci
, 1995
"... Many computational problems in biology involve parameters for which a small range of values cover important applications. We argue that for many problems in this setting, parameterized computational complexity rather than NP-completeness is the appropriate tool for studying apparent intractability. ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Many computational problems in biology involve parameters for which a small range of values cover important applications. We argue that for many problems in this setting, parameterized computational complexity rather than NP-completeness is the appropriate tool for studying apparent intractability. At issue in the theory of parameterized complexity is whether a problem can be solved in time O(n ff ) for each fixed parameter value, where ff is a constant independent of the parameter. In addition to surveying this complexity framework, we describe a new result for the Longest common subsequence problem. In particular, we show that the problem is hard for W [t] for all t when parameterized by the number of strings and the size of the alphabet. Lower bounds on the complexity of this basic combinatorial problem imply lower bounds on more general sequence alignment and consensus discovery problems. We also describe a number of open problems pertaining to the parameterized complexity of pro...
Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads
"... Abstract. Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them fo ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract. Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them for ab initio genome assembly. In this paper, we give a novel network flow-based algorithm that, by taking advantage of the high coverage provided by NGS, accurately estimates the copy counts of repeats in a genome. We also give a second algorithm that combines the predicted copy-counts with mate-pair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from E. Coli and predict copy-counts with extremely high accuracy, while assembling long contigs. 1
Multiple sequence alignment
- Protein Structure Prediction — Methods and Protocols
, 2000
"... Multiple sequence alignment is a central problem in Bioinformatics and a challenging one for optimisation algorithms. An established integer programming approach is to apply branch-and-cut to a graph-theoretical model. The models are exponentially large but are represented intensionally, and violate ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Multiple sequence alignment is a central problem in Bioinformatics and a challenging one for optimisation algorithms. An established integer programming approach is to apply branch-and-cut to a graph-theoretical model. The models are exponentially large but are represented intensionally, and violated constraints can be located in polynomial time. This report describes a new integer program formulation that generates polynomial-sized models small enough to be passed to generic solvers. It is a hybrid formulation relating the sparse alignment graph with a compact encoding of the alignment matrix via channelling constraints. Alignments obtained with a pseudo-Boolean local search algorithm are competitive with those of state-of-the-art algorithms. Execution times are much longer, but in future work we aim to develop a more efficient specialised algorithm. 1
A SAT-Based Approach to Multiple Sequence Alignment
- Poster, Ninth International Conference on Principles and Practice of Constraint Programming
, 2003
"... Abstract. Multiple sequence alignment is a central problem in Bioinformatics. A known integer programming approach is to apply branch-and-cut to exponentially large graph-theoretic models. This paper describes a new integer program formulation that generates models small enough to be passed to gener ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. Multiple sequence alignment is a central problem in Bioinformatics. A known integer programming approach is to apply branch-and-cut to exponentially large graph-theoretic models. This paper describes a new integer program formulation that generates models small enough to be passed to generic solvers. The formulation is a hybrid relating the sparse alignment graph with a compact encoding of the alignment matrix via channelling constraints. Alignments obtained with a SAT-based local search algorithm are competitive with those of state-of-the-art algorithms, though execution times are much longer. 1

