Results 1 - 10
of
32
A greedy algorithm for aligning DNA sequences
- J. COMPUT. BIOL
, 2000
"... For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy a ..."
Abstract
-
Cited by 133 (6 self)
- Add to MetaCart
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.
Optimal alignments in linear space
- CABIOS
, 1988
"... Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the newer proposals, bo ..."
Abstract
-
Cited by 130 (3 self)
- Add to MetaCart
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the newer proposals, both in theory and in practice. The goal of this note is to give Hirschberg’s idea the visibility it deserves by developing a linear-space version of Gotoh’s algorithm, which accommodates affine gap penalties. A portable C-software package implementing this algorithm is available on the BIONET free of charge.
A Robust Model for Finding Optimal Evolutionary Trees
, 1993
"... Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species and seeks to find an edge-weighted tree T in which the distance d T ij in the tree between the leaves ..."
Abstract
-
Cited by 71 (12 self)
- Add to MetaCart
Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species and seeks to find an edge-weighted tree T in which the distance d T ij in the tree between the leaves of T corresponding to the species i and j exactly equals the observed distance, d ij . When such a tree exists, this is expressed in the biological literature by saying that the distance function or matrix is additive, and trees can be constructed from additive distance matrices in O(n 2 ) time. Real distance data is hardly ever additive, and we therefore need ways of modeling the problem of finding the best-fit tree as an optimization problem. In this paper we present several natural and realistic ways of modeling the inaccuracies in the distance data. In one model we assume that we have upper and lower bounds for the distances between pairs of species and try to find an additive distanc...
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Descartes, Champs-surMarne, 77454 Marne-la-Vallee Cedex 2, France, email: mac@univ-mlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (972-4) 824-0103, FAX: (972-4) 824-9331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 11201-3840; email: landau@poly.edu; partially supported by NSF grant CCR-0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Accurate formula for p-values of gapped local sequence and profile alignments
- J. Mol. Biol
, 2000
"... A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (ie gap penalty and substitution matr ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (ie gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of data-bank searches. The method is based on the theoretical ideas introduced in (Mott & Tribe, 1999). Extensive simulation studies show that score-thresholds produced by the method are accurate to within ±5 % 95 % of the time. We also investigate factors which affect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood.
Recent developments in linear-space alignment methods: A survey
- J. Comput. Biol
, 1994
"... A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proporti ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proportional to the sum of the lengths of the two sequences being aligned. This paper begins by reviewing the basic idea, as it applies to the global (i.e., end-to-end) alignment of two DNA or protein sequences. Three of our recent extensions of the technique are then outlined. The first extension computes an optimal alignment subject to the constraint that each position, i, of the first sequence must be aligned somewhere between positions L[i] and U[i] of the second sequence, for given values of L and U. The second finds all aligned position pairs (i.e., potential columns of the alignment) that occur in an alignment whose score exceeds a given threshold. The third treats the case where each of the two sequences is allowed to be an alignment (e.g., a sequence of aligned pairs), using a sensitive scoring scheme. We also describe two linear-space methods for computing k best local (i.e., involving only a part of each sequence) alignments, where k ≥ 1. One is a linear-space version of the algorithm of Waterman and Eggert (1987), and the other is based on the strategy proposed by Wilbur and Lipman (1983). Finally, we describe programs that implement various combinations of these techniques to provide a multi-sequence alignment method that is especially suited to handling a few very long sequences. The utility of these programs is illustrated by analysis of the locus control region of the β-like globin gene cluster of several mammals.
Speeding up Dynamic Programming
- In Proc. 29th Symp. Foundations of Computer Science
, 1988
"... this paper we consider the problem of computing two similar recurrences: the one-dimensional case ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
this paper we consider the problem of computing two similar recurrences: the one-dimensional case
On the Common Substring Alignment Problem
"... The Common Substring Alignment Problem is defined as follows: Given a set of one or more strings and a target string. is a common substring of all strings, that is. The goal is to compute the similarity of all strings with, without computing the part of again and again. Using the classical dynamic p ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
The Common Substring Alignment Problem is defined as follows: Given a set of one or more strings and a target string. is a common substring of all strings, that is. The goal is to compute the similarity of all strings with, without computing the part of again and again. Using the classical dynamic programming tables, each appearance of in a source string would require the computation of all the values in a dynamic programming table of size where is the size of. Here we describe an algorithm which is composed of an encoding stage and an alignment stage. During the first stage, a data structure is constructed which encodes the comparison of with. Then, during the alignment stage, for each comparison of a source with, the pre-compiled data structure is used to speed up the part of. We show how to reduce the alignment work, for each appearance of the common substring in a source string, to- at the cost of encoding work, which is executed only once.
Automata-Theoretic Models of Mutation and Alignment
- In International Conference on Intelligent Systems in Molecular Biology
, 1995
"... Finite-state automata called transducers, which have both input and output, can be used to model simple mechanisms of biological mutation. We present a methodology whereby numerically -weighted versions of such specifications can be mechanically adapted to create string edit machines that are essent ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Finite-state automata called transducers, which have both input and output, can be used to model simple mechanisms of biological mutation. We present a methodology whereby numerically -weighted versions of such specifications can be mechanically adapted to create string edit machines that are essentially equivalent to recurrence relations of the sort that characterize dynamic programming alignment algorithms. Based on this, we have developed a visual programming system for designing new alignment algorithms in a rapid-prototyping fashion. 1 Introduction Finite-state automata have an important place in computer science, often representing simple models of computation as the recognition or generation of strings of symbols. A wide variety of such automata have been intensively studied, including weighted automata which have numbers associated with transitions between states, and transducers which have both input and output. Allison and co-workers [2] have proposed the use of finite-stat...
Linear-Space Algorithms that Build Local Alignments from Fragments
- Algorithmica
, 1995
"... Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of "alignment fragments, " such as all occurrences of identical 5-tuples in each of two DNA sequences. We first combine a time-efficient algorithm developed by Galil and cow ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of "alignment fragments, " such as all occurrences of identical 5-tuples in each of two DNA sequences. We first combine a time-efficient algorithm developed by Galil and coworkers with a space-saving approach of Hirschberg to obtain a local alignment algorithm that uses O((M + N + F log N) log M) time and O(M + N) space to align sequences of lengths M and N from a pool of F alignment fragments. Ideas of Huang and Miller are then employed to develop a time- and space-efficient algorithm that computes n best nonintersecting alignments for any n> 1. An example illustrates the utility of these methods.

