Results 1  10
of
68
Dynamic Programming Alignment Accuracy
 J. Comput. Biol
, 1998
"... Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modell ..."
Abstract

Cited by 80 (5 self)
 Add to MetaCart
(Show Context)
Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap. We also give a table from which accuracy values can be predicted for commonly used scoring schemes and sequence divergences (the PAM and BLOSUM series). Finally we describe how to calculate the expected accuracy of a given alignment, and show how this can be used to construct an optimal accuracy alignment algorithm which generates significantly more accurate alignments than standard dynamic programming methods in simulated experiments. 2 Introduction Alignments of biological sequences generated by computational algorithms are routinely used as a basis for ...
Improving the Practical Space and Time Efficiency of the ShortestPaths Approach to SumofPairs Multiple Sequence Alignment
, 1996
"... The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branchandbound technique together with a variant of Dijkstra's shortest paths algorithm to prune th ..."
Abstract

Cited by 76 (4 self)
 Add to MetaCart
The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branchandbound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain that these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
Phylogenomic inference of protein molecular function: advances and challenges
 Bioinformatics
, 2004
"... Motivation: Protein families evolve a multiplicity of functions through gene duplication, speciation and other processes. As a number of studies have shown, standard methods of protein function prediction produce systematic errors on these data. Phylogenomic analysis—combining phylogenetic tree cons ..."
Abstract

Cited by 76 (3 self)
 Add to MetaCart
(Show Context)
Motivation: Protein families evolve a multiplicity of functions through gene duplication, speciation and other processes. As a number of studies have shown, standard methods of protein function prediction produce systematic errors on these data. Phylogenomic analysis—combining phylogenetic tree construction, integration of experimental data and differentiation of orthologs and paralogs—has been proposed to address these errors and improve the accuracy of functional classification. The explicit integration of structure prediction and analysis in this framework, which we call structural phylogenomics, provides additional insights into protein superfamily evolution. Results: Results of protein functional classification using phylogenomic analysis show fewer expected false positives overall than when pairwise methods of functional classification are employed. We present an overview of the motivations and fundamental principles of phylogenomic analysis, new methods developed for the key tasks, benchmark datasets for these tasks (when available) and suggest procedures to increase accuracy. We also discuss some of the methods used in the Celera Genomics highthroughput phylogenomic classification of the human genome. Availability: Software tools from the Berkeley Phylogenomics Group are available at
Assessing The Performance Of Fold Recognition Methods By Means Of A Comprehensive Benchmark.
 Pac. Symp. Biocomput
, 1996
"... this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a foldrecognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the r ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a foldrecognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the ranking and significance estimation procedures used in the method being evaluated. Thus, it allows a systematic comparison of different methods. Benchmarks are routinely used to assess performance of sequencesequence alignment (e.g.
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
 Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
, 1999
"... The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extremal score statistics by using only one to a few pairwise alignments. The success of our method relies crucially on the link between the statistics of island scores and extremal score statistics. This link is motivated by heuristic arguments, and firmly established by extensive numerical simulations for a variety of scoring parameter settings and sequence lengths. Our approach is several orders of magnitude faster than the widely used shuffling method, since island counting is trivially incorporated into the basic SmithWaterman alignment algorithm with minimal computational cost, and all islands are counted in a single alignment. The availability of a rapid and accurate si...
A polyhedral approach to sequence alignment problems
 DISCRETE APPL. MATH
, 2000
"... We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchandcut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framewo ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchandcut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of Maximum Trace. The RNA Sequence Alignment Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facetdefining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branchandcut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.
An Exact Solution for the SegmenttoSegment Multiple Sequence Alignment Problem
"... In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segmenttosegment variation of the multiple alignment problem the input can be seen as a set of runs of nongapped matches (diagonals) between pairs o ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
In molecular biology sequence alignment is a crucial tool in studying structure and function of molecules as well as evolution of species. In the segmenttosegment variation of the multiple alignment problem the input can be seen as a set of runs of nongapped matches (diagonals) between pairs of sequences. Given a weight function that assigns a weight score to every possible diagonal, the goal is to choose a consistent set of diagonals of maximum weight. We show that the segmenttosegment multiple alignment problem is equivalent to a novel formulation of the Maximum Weight Trace (MWT) problem. Solving the generalized MWT (GMWT) problem to optimality therefore improves upon the previous greedy strategies that are used for solving the segmenttosegment multiple sequence alignment problem. We show that the GMWT can be stated in terms of an integer linear program and then solve the integer linear program using methods from polyhedral combinatorics. This leads to a branchand...
The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment
 Journal of Computational Biology
, 1997
"... Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A algorithm (or goal directed unidirectional search ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A algorithm (or goal directed unidirectional search) is a technique that speeds up the computation of a shortest path by transforming the edge lengths without losing the optimality of the shortest path. We implemented the A algorithm in a computer program similar to MSA [GKS95] and FMA [SI97b]. We incorporated in this program new bounding strategies for both, lower and upper bounds and show that the A algorithm, together with our improvements, can speed up computations considerably. Additionally we show that the A algorithm together with a standard bounding technique is superior to the well known CarilloLipman bounding since it excludes more nodes from consideration. 1 Introduction One of the most prominent problems in computational mo...
Parameterization studies for the SAM and HMMER methods of hidden Markov model generation
 Proc Int Conf Intell Syst Mol Biol 4
, 1996
"... ..."
Multiple Sequence Comparison  A Peptide Matching Approach
, 1995
"... We present in this paper a peptide matching approach to the multiple comparison of a set of protein sequences. This approach consists in looking for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as reference an object cal ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We present in this paper a peptide matching approach to the multiple comparison of a set of protein sequences. This approach consists in looking for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as reference an object called a model. In the case of proteins, a model is a product of subsets of the alphabet \Sigma of the amino acids. These subsets belong to a cover of \Sigma, that is, their union covers all of \Sigma. A word is said to be an instance of a model if it belongs to the model. A further flexibility is introduced in the comparison by allowing for up to e errors in the comparison between a word and a model. These errors may concern gaps or substitutions not allowed by the cover. A word is said to be this time an occurrence of a model if the Levenshtein distance between it and an instance of the model is inferior or equal to e. This corresponds to what we call a SetLevenshtein distance between the o...