Results 1 - 10
of
29
A fast bit-vector algorithm for approximate string matching based on dynamic programming
- J. ACM
, 1999
"... Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These alg ..."
Abstract
-
Cited by 185 (1 self)
- Add to MetaCart
Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in either O(nmk/w) orO(nm log �/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and � is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm’s performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al. [1996]. This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic programming (d.p.) matrix w entries at a time using the basic algorithm as a subroutine, is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.
PipMaker - A Web Server for Aligning Two Genomic DNA Sequences
, 2000
"... this paper we describe an automated server for generating alignments and pips. A pip shows the position in one sequence of each aligning gap-free segment and plots its percent identity. As a complementary display, we also provide a plot of the position of each aligning segment in both species. We re ..."
Abstract
-
Cited by 101 (5 self)
- Add to MetaCart
this paper we describe an automated server for generating alignments and pips. A pip shows the position in one sequence of each aligning gap-free segment and plots its percent identity. As a complementary display, we also provide a plot of the position of each aligning segment in both species. We refer to these as dot plots, even though matches shown in conventional dot plots need not be contained within a statistically significant alignment and those in our plots are. Both displays allow rich annotation to be plotted along the appropriate axes to aid in correlating aligning segments with functional or structural features of the sequence. We provide examples of the application of PipMaker for finding exons and candidate regulatory elements in mammalian, nematode, and bacterial sequences. The server is able to compare a completed sequence from one species with an incomplete sequence from a second.
A subquadratic sequence alignment algorithm for unrestricted scoring matrices
- SIAM J. Comput
"... Abstract The classical algorithm for computing the similarity between two sequences Our algorithm applies to both local and global alignment computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both string ..."
Abstract
-
Cited by 76 (5 self)
- Add to MetaCart
(Show Context)
Abstract The classical algorithm for computing the similarity between two sequences Our algorithm applies to both local and global alignment computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n2/logn) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn2/logn) where h _< 1 is the entropy of the text.
Progressive Multiple Alignment with Constraints
- J. Computational Biology
, 1996
"... A progressive alignment algorithm produces a multi-alignment of a set of sequences by repeatedly aligning pairs of sequences and/or previously generated alignments. We describe a method for guaranteeing that the alignment generated by a progressive alignment strategy satisfies a user-specified colle ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
A progressive alignment algorithm produces a multi-alignment of a set of sequences by repeatedly aligning pairs of sequences and/or previously generated alignments. We describe a method for guaranteeing that the alignment generated by a progressive alignment strategy satisfies a user-specified collection of constraints about where certain sequence positions should appear relative to others. Given a collection of C constraints over K sequences whose total length is N , our algorithm takes O(K(N 2 +KC)) time. An alignment of the fi-like globin gene clusters of several mammals illustrates the practicality of the method. Key words: Multiplesequence alignment, constrained alignment, dynamic programming 1 Introduction It is straightforward to extend the dynamic programming alignment algorithm (Needleman and Wunsch 1970) to the simultaneous alignment of K ? 2 sequences. However, the O(2 K N K ) execution time for sequences of length N makes it impractical to align more than three seque...
Improved Gapped Alignment in BLAST
"... Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centres, and commercial enterprises since the early 1990 ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centres, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the BLAST algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step — semi-gapped alignment — compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing BLAST to accurately filter sequences with lower computational cost. In addition, we propose an heuristic — restricted insertion alignment — that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimisation of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in BLAST. We conclude that our techniques are an important improvement to the BLAST algorithm. Source code for the alignment algorithms is available for download at
Memory-efficient A* heuristics for multiple sequence alignment
- In National Conference on Artificial Intelligence (AAAI-02
, 2002
"... The time and space needs of an A * search are strongly in-fluenced by the quality of the heuristic evaluation function. Usually there is a trade-off since better heuristics may re-quire more time and/or space to evaluate. Multiple sequence alignment is an important application for single-agent searc ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
The time and space needs of an A * search are strongly in-fluenced by the quality of the heuristic evaluation function. Usually there is a trade-off since better heuristics may re-quire more time and/or space to evaluate. Multiple sequence alignment is an important application for single-agent search. The traditional heuristic uses multiple pairwise alignments that require relatively little space. Three-way alignments produce better heuristics, but they are not used in practice due to the large space requirements. This paper presents a memory-efficient way to represent three-way heuristics as an octree. The required portions of the octree are computed on demand. The octree-supported three-way heuristics result in such a substantial reduction to the size of the A * open list that they offset the additional space and time requirements for the three-way alignments. The resulting multiple sequence align-ments are both faster and use less memory than using A * with traditional pairwise heuristics.
Sequence alignment using FastLSA
- In Proceedings of the 2000 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS 2000
, 2000
"... Abstract For two strings of length m and n (m n), optimal sequence alignment (as a function of the alignment scoring function) takes time and space proportional to mn to compute. The time actually consists of two parts: computing the score of the best align-ment (calculating (m+1)(n+1) values), and ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
(Show Context)
Abstract For two strings of length m and n (m n), optimal sequence alignment (as a function of the alignment scoring function) takes time and space proportional to mn to compute. The time actually consists of two parts: computing the score of the best align-ment (calculating (m+1)(n+1) values), and then extracting the alignment (by reading the computed values). The space requirement is usually prohibitive. Hirschberg's algorithm reduces the space needs to roughly 2m, but doubles the cost of computing and extract-ing the alignment. This paper introduces the FastLSA algorithm that is adaptive to the amount of space available. At one extreme, it uses linear space, while at the other it uses quadratic space. Based on the memory resources available, the algorithm saves the maximum amount of information to achieve the lowest extraction cost. The algorithm is shown to be analytically and experimentally superior to Hirschberg's algorithm.
Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions
- Bioinformatics
, 2006
"... doi:10.1093/bioinformatics/bti828 ..."