Results 1  10
of
62
A fast bitvector algorithm for approximate string matching based on dynamic programming
 J. ACM
, 1999
"... Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These alg ..."
Abstract

Cited by 185 (1 self)
 Add to MetaCart
Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current stateset of the kdifference automaton for the query, and asymptotically run in either O(nmk/w) orO(nm log �/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and � is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm’s performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4Russians algorithm of Wu et al. [1996]. This gives rise to an O(kn/w) expectedtime algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic programming (d.p.) matrix w entries at a time using the basic algorithm as a subroutine, is significantly faster than our previous 4Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.
PatternHunter II: Highly Sensitive and Fast Homology Search
, 2003
"... Extending the single optimized spaced seed of PatternHunter [20] to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of SmithWaterman, for homology search. At Blastn speed, PatternHunter II approaches SmithWaterman sensitivity, bring ..."
Abstract

Cited by 120 (12 self)
 Add to MetaCart
Extending the single optimized spaced seed of PatternHunter [20] to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of SmithWaterman, for homology search. At Blastn speed, PatternHunter II approaches SmithWaterman sensitivity, bringing homology search technology back to a full circle.
Designing seeds for similarity search in genomic dna
 Journal of Computer and System Sciences
, 2003
"... Abstract: Largescale comparisons of genomic DNA are of fundamental importance in annotating functional elements in genomes. To perform large comparisons efficiently, BLAST [3, 2] and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common patt ..."
Abstract

Cited by 103 (4 self)
 Add to MetaCart
Abstract: Largescale comparisons of genomic DNA are of fundamental importance in annotating functional elements in genomes. To perform large comparisons efficiently, BLAST [3, 2] and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or “seed ” of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging. This work addresses problems arising in seed design. We give the fastest known algorithm for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, as well as theoretical results on which seeds are good choices. We also describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice. 1
Efficient LargeScale Sequence Comparison by LocalitySensitive Hashing
 Bioinformatics
, 2001
"... Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#cie ..."
Abstract

Cited by 91 (6 self)
 Add to MetaCart
(Show Context)
Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#ciently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other di#erences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade o# between e#ciency and sensitivity to features without long exact matches. Results: We introduce a new algorithm, lshallpairs, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments e#ciently using a randomized search technique, localitysensitive hashing. We have found lshallpairs to be both e#cient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length. Availability: Contact the author at the address below. Contact: jbuhler@cs.washington.edu Supplementary Information: the sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhlerbioinformatics2001/. Keywords: local alignment, genome annotation, localitysensitive hashing Sequence Comparison by LocalitySensitive Hashing 1
qgram based database searching using a suffix array
 QUASAR). Proceedings of the third annual international conference on Computational molecular biology (Recomb 99
, 1999
"... With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Her ..."
Abstract

Cited by 82 (7 self)
 Add to MetaCart
With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Qgram Alignment based on Suffix ARrays) which was designed to quickly detect sequences with strong similarity to the query in a context where many searches are conducted on one database. Our algorithm applies a modification of qtuple filtering implemented on top of a suffix array. Two versions were developed, one for a RAM resident suffix array and one for access to the suffix array on disk. We compared our implementation with BLAST and found that our approach is an order of magnitude faster. It is, however, restricted to the search for strongly similar DNA sequences as is typically required, e.g., in the context of clustering expressed sequence tags (ESTs). 1
Designing multiple simultaneous seeds for DNA similarity search
 in: Proc. of RECOMB’04, ACM Press, 76 – 85
, 2004
"... The challenge of similarity search in massive DNA sequence databases has inspired major changes in BLASTstyle alignment tools, which accelerate search by inspecting only pairs of sequences sharing a common short “seed, ” or pattern of matching residues. Some of these changes raise the possibility o ..."
Abstract

Cited by 75 (5 self)
 Add to MetaCart
The challenge of similarity search in massive DNA sequence databases has inspired major changes in BLASTstyle alignment tools, which accelerate search by inspecting only pairs of sequences sharing a common short “seed, ” or pattern of matching residues. Some of these changes raise the possibility of improving search performance by probing sequence pairs with several distinct seeds, any one of which is sufficient for a seed match. However, designing a set of seeds to maximize their combined sensitivity to biologically meaningful sequence alignments is computationally difficult, even given recent advances in designing single seeds. This work describes algorithmic improvements to seed design that address the problem of designing a set of n seeds to be used simultaneously. We give a new local search method to optimize the sensitivity of seed sets. The method relies on efficient incremental computation of the probability that an alignment contains a match to a seed π, given that it has already failed to match any of the seeds in a set . We demonstrate experimentally that multiseed designs, even with relatively few seeds, can be significantly more sensitive than even optimized singleseed designs. Key words: DNA sequence comparison, sequence alignment, database search, seed design,
Improved hit criteria for DNA local alignment
, 2004
"... The hit criterion is a key component of heuristic local alignment algorithms. It speciﬁes a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. In this paper, we propose two ways to improve the hit criterio ..."
Abstract

Cited by 55 (12 self)
 Add to MetaCart
The hit criterion is a key component of heuristic local alignment algorithms. It speciﬁes a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. In this paper, we propose two ways to improve the hit criterion. First, we deﬁne the group criterion combining the advantages of the singleseed and doubleseed approaches used in existing algorithms. Second, we introduce transitionconstrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with our YASS software, supporting both improvements.
Multiseed lossless filtration
 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (TCBB)
, 2005
"... We study a method of seedbased lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute severa ..."
Abstract

Cited by 54 (13 self)
 Add to MetaCart
(Show Context)
We study a method of seedbased lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a largescale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.
Estimating seed sensitivity on homogeneous alignments
 in Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE2004
, 2004
"... We address the problem of estimating the sensitivity of seedbased similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignment ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
We address the problem of estimating the sensitivity of seedbased similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition. 1.