Results 1  10
of
32
A fast bitvector algorithm for approximate string matching based on dynamic programming
 J. ACM
, 1999
"... Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These alg ..."
Abstract

Cited by 179 (2 self)
 Add to MetaCart
Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current stateset of the kdifference automaton for the query, and asymptotically run in either O(nmk/w) orO(nm log �/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and � is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm’s performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4Russians algorithm of Wu et al. [1996]. This gives rise to an O(kn/w) expectedtime algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic programming (d.p.) matrix w entries at a time using the basic algorithm as a subroutine, is significantly faster than our previous 4Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.
Flexible sequence similarity searching with the FASTA3 program package
 Methods Mol. Biol
, 2000
"... Since the publication of the first rapid method for comparing biological sequences 15 years ago (1), DNA and protein sequence comparison have become routine steps in biochemical characterization, from newly cloned proteins to entire genomes. As the DNA and protein sequence databases become more comp ..."
Abstract

Cited by 108 (3 self)
 Add to MetaCart
Since the publication of the first rapid method for comparing biological sequences 15 years ago (1), DNA and protein sequence comparison have become routine steps in biochemical characterization, from newly cloned proteins to entire genomes. As the DNA and protein sequence databases become more complete, a sequence similarity search is more likely to reveal
Indexing and Retrieval for Genomic Databases
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Aminoacid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract

Cited by 57 (6 self)
 Add to MetaCart
(Show Context)
Genomic sequence databases are widely used by molecular biologists for homology searching. Aminoacid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an indexbased approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that indexbased searching is as accurate as existing exhaustive search schemes.
Recent developments in linearspace alignment methods: A survey
 J. Comput. Biol
, 1994
"... A dynamicprogramming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely spaceefficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proporti ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
A dynamicprogramming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely spaceefficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proportional to the sum of the lengths of the two sequences being aligned. This paper begins by reviewing the basic idea, as it applies to the global (i.e., endtoend) alignment of two DNA or protein sequences. Three of our recent extensions of the technique are then outlined. The first extension computes an optimal alignment subject to the constraint that each position, i, of the first sequence must be aligned somewhere between positions L[i] and U[i] of the second sequence, for given values of L and U. The second finds all aligned position pairs (i.e., potential columns of the alignment) that occur in an alignment whose score exceeds a given threshold. The third treats the case where each of the two sequences is allowed to be an alignment (e.g., a sequence of aligned pairs), using a sensitive scoring scheme. We also describe two linearspace methods for computing k best local (i.e., involving only a part of each sequence) alignments, where k ≥ 1. One is a linearspace version of the algorithm of Waterman and Eggert (1987), and the other is based on the strategy proposed by Wilbur and Lipman (1983). Finally, we describe programs that implement various combinations of these techniques to provide a multisequence alignment method that is especially suited to handling a few very long sequences. The utility of these programs is illustrated by analysis of the locus control region of the βlike globin gene cluster of several mammals.
LinearSpace Algorithms that Build Local Alignments from Fragments
 Algorithmica
, 1995
"... Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of "alignment fragments, " such as all occurrences of identical 5tuples in each of two DNA sequences. We first combine a timeefficient algorithm developed by Galil ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of &quot;alignment fragments, &quot; such as all occurrences of identical 5tuples in each of two DNA sequences. We first combine a timeefficient algorithm developed by Galil and coworkers with a spacesaving approach of Hirschberg to obtain a local alignment algorithm that uses O((M + N + F log N) log M) time and O(M + N) space to align sequences of lengths M and N from a pool of F alignment fragments. Ideas of Huang and Miller are then employed to develop a time and spaceefficient algorithm that computes n best nonintersecting alignments for any n> 1. An example illustrates the utility of these methods.
Near Optimal Multiple Alignment Within a Band In Polynomial Time
 In Proc. of 32nd ACM STOC
, 2000
"... Multiple sequence alignment is one of the most important problems in computational biology. Because of its notorious difficulties, aligning sequences within a constant band is a popular practice in bioinformatics with good results [17; 13; 14; 15; 1; 3; 6; 20; 18]. However, the problem is still NPh ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Multiple sequence alignment is one of the most important problems in computational biology. Because of its notorious difficulties, aligning sequences within a constant band is a popular practice in bioinformatics with good results [17; 13; 14; 15; 1; 3; 6; 20; 18]. However, the problem is still NPhard for multiple sequences. In this paper, we present polynomial time approximation schemes (PTAS) for multiple sequence alignment within a constant band, under standard models of SP alignment and consensus (star) alignment. The algorithms work for very general score schemes. In order to prove our main results, we also present a PTAS for SP alignment and a PTAS for consensus alignment, allowing only constant number of insertion and deletion gaps (of arbitrary length) per sequence on the average. 1.
Stochastic models of sequence evolution including insertiondeletion events
, 2008
"... Abstract. Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertionsdeletions (abbr. indels) is central in sequenc ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertionsdeletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 45 sequences. MCMC techniques can bring this to about 1015 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets. 1
Parameterized Complexity and Biopolymer Sequence Comparison
, 2007
"... The paper surveys parameterized algorithms and complexities for computational tasks on biopolymer sequences, including the problems of longest common subsequence, shortest common supersequence, pairwise sequence alignment, multiple sequencing alignment, structure–sequence alignment and structure–str ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The paper surveys parameterized algorithms and complexities for computational tasks on biopolymer sequences, including the problems of longest common subsequence, shortest common supersequence, pairwise sequence alignment, multiple sequencing alignment, structure–sequence alignment and structure–structure alignment. Algorithm techniques, built on the structuralunit level as well as on the residue level, are discussed.
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
 IEEE/ACM Transactions on Computational Biology and Bioinformatics
"... Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn for nucleotide collections. Surprisingly, blastn has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 blast paper (Altschul, Madden, Schaffer, Zhang, Zhang, Miller & Lipman 1997) and no exact description has been published. It is important that blastn is stateoftheart: nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and take many minutes to search on modern generalpurpose workstations. This paper proposes significant improvements to the blastn algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of blastn with no effect on accuracy and have been integrated into our new version of blast that is freely available for download from