Results 1 - 10
of
34
Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment
, 1996
"... The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the bas ..."
Abstract
-
Cited by 54 (4 self)
- Add to MetaCart
The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain that these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
Dynamic Programming Alignment Accuracy
- J. Comput. Biol
, 1998
"... Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modell ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap. We also give a table from which accuracy values can be predicted for commonly used scoring schemes and sequence divergences (the PAM and BLOSUM series). Finally we describe how to calculate the expected accuracy of a given alignment, and show how this can be used to construct an optimal accuracy alignment algorithm which generates significantly more accurate alignments than standard dynamic programming methods in simulated experiments. 2 Introduction Alignments of biological sequences generated by computational algorithms are routinely used as a basis for ...
Assessing The Performance Of Fold Recognition Methods By Means Of A Comprehensive Benchmark.
- Pac. Symp. Biocomput
, 1996
"... this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a fold-recognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the r ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a fold-recognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the ranking and significance estimation procedures used in the method being evaluated. Thus, it allows a systematic comparison of different methods. Benchmarks are routinely used to assess performance of sequence-sequence alignment (e.g.
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
- Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
, 1999
"... The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extrem ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extremal score statistics by using only one to a few pairwise alignments. The success of our method relies crucially on the link between the statistics of island scores and extremal score statistics. This link is motivated by heuristic arguments, and firmly established by extensive numerical simulations for a variety of scoring parameter settings and sequence lengths. Our approach is several orders of magnitude faster than the widely used shuffling method, since island counting is trivially incorporated into the basic Smith-Waterman alignment algorithm with minimal computational cost, and all islands are counted in a single alignment. The availability of a rapid and accurate si...
A polyhedral approach to sequence alignment problems
- Discrete Appl. Math
, 2000
"... We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchand-cut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framewor ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchand-cut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of Maximum Trace. The RNA Sequence Alignment Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.
The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment
- Journal of Computational Biology
, 1997
"... Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A algorithm (or goal directed unidirectional search ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A algorithm (or goal directed unidirectional search) is a technique that speeds up the computation of a shortest path by transforming the edge lengths without losing the optimality of the shortest path. We implemented the A algorithm in a computer program similar to MSA [GKS95] and FMA [SI97b]. We incorporated in this program new bounding strategies for both, lower and upper bounds and show that the A algorithm, together with our improvements, can speed up computations considerably. Additionally we show that the A algorithm together with a standard bounding technique is superior to the well known Carillo-Lipman bounding since it excludes more nodes from consideration. 1 Introduction One of the most prominent problems in computational mo...
An Exact Solution for the Segment-to-Segment Multiple Sequence Alignment Problem
"... In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segment-to-segment variation of the multiple alignment problem the input can be seen as a set of runs of non-gapped matches (diagonals) between pairs o ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segment-to-segment variation of the multiple alignment problem the input can be seen as a set of runs of non-gapped matches (diagonals) between pairs of sequences. Given a weight function that assigns a weight score to every possible diagonal, the goal is to choose a consistent set of diagonals of maximum weight. We show that the segment-to-segment multiple alignment problem is equivalent to a novel formulation of the Maximum Weight Trace (MWT) problem. Solving the generalized MWT (GMWT) problem to optimality therefore improves upon the previous greedy strategies that are used for solving the segment-to-segment multiple sequence alignment problem. We show that the GMWT can be stated in terms of an integer linear program and then solve the integer linear program using methods from polyhedral combinatorics. This leads to a branch-and...
Parameterization studies for the SAM and HMMER methods of hidden Markov model generation
- In: ISMB-96
, 1996
"... Multiple sequence alignment of distantly related viral proteins remains a challenge to all currently available alignment methods. The hidden Markov model approach offers a new, flexible method for the generation of multiple sequence alignments. The results of studies attempting to infer appropriate ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Multiple sequence alignment of distantly related viral proteins remains a challenge to all currently available alignment methods. The hidden Markov model approach offers a new, flexible method for the generation of multiple sequence alignments. The results of studies attempting to infer appropriate parameter constraints for the generation of de novo HMMs for globin, kinase, aspartic acid protease, and ribonuclease H sequences by both the SAM and HMMER methods are described.
Factored A* search for models over sequences and trees
- In Proceedings of the 17th International Joint Conference on Artificial Intelligence
, 2003
"... We investigate the calculation of A * bounds for sequence and tree models which are the explicit intersection of a set of simpler models or can be bounded by such an intersection. We provide a natural viewpoint which unifies various instances of factored A * models for trees and sequences, some prev ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We investigate the calculation of A * bounds for sequence and tree models which are the explicit intersection of a set of simpler models or can be bounded by such an intersection. We provide a natural viewpoint which unifies various instances of factored A * models for trees and sequences, some previously known and others novel, including multiple sequence alignment, weighted finitestate transducer composition, and lexicalized statistical parsing. The specific case of parsing with a product of syntactic (PCFG) and semantic (lexical dependency) components is then considered in detail. We show that this factorization gives a modular lexicalized parser which is simpler than comparably
An eulerian path approach to global multiple alignment for DNA sequences
- J. Comput. Biol
, 2003
"... With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alig ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one con � guration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance. Key words: multiple sequence alignment, de Bruijn graph, Eulerian path. 1.

