Results 1  10
of
75
A unifying framework for seed sensitivity and its application to subset seeds
 JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (JBCB)
, 2006
"... We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem  a set of target alignments, an associated probability distribution, and a seed model  that are specified by d ..."
Abstract

Cited by 57 (22 self)
 Add to MetaCart
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem  a set of target alignments, an associated probability distribution, and a seed model  that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.
Improved hit criteria for DNA local alignment
, 2004
"... The hit criterion is a key component of heuristic local alignment algorithms. It speciﬁes a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. In this paper, we propose two ways to improve the hit criterio ..."
Abstract

Cited by 55 (12 self)
 Add to MetaCart
The hit criterion is a key component of heuristic local alignment algorithms. It speciﬁes a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. In this paper, we propose two ways to improve the hit criterion. First, we deﬁne the group criterion combining the advantages of the singleseed and doubleseed approaches used in existing algorithms. Second, we introduce transitionconstrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with our YASS software, supporting both improvements.
Multiseed lossless filtration
 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (TCBB)
, 2005
"... We study a method of seedbased lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute severa ..."
Abstract

Cited by 54 (13 self)
 Add to MetaCart
(Show Context)
We study a method of seedbased lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a largescale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.
Optimizing Multiple Spaced Seeds for Homology Search
 In: Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Volume 3109 of Lecture Notes in Computer Science
, 2004
"... Abstract. Optimized spaced seeds improve sensitivity and specificity in localhomology search [1]. Recently, several authors [24] have shown that multiple seeds can have better sensitivity and specificity than single seeds. We describea linear programmingbased algorithm to optimize a set of seeds. ..."
Abstract

Cited by 54 (6 self)
 Add to MetaCart
Abstract. Optimized spaced seeds improve sensitivity and specificity in localhomology search [1]. Recently, several authors [24] have shown that multiple seeds can have better sensitivity and specificity than single seeds. We describea linear programmingbased algorithm to optimize a set of seeds. Our algorithm offers a performance guarantee: the sensitivity of a chosen seed set is at least 70%of what can be achieved, in most reasonable models of homologous sequences. Our method achieves performance comparable to that of a greedy algorithm, butour work gives this area a mathematical foundation.
Superiority and Complexity of the Spaced Seeds
 SODA
, 2006
"... Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their in ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a nonuniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NPhard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length. 1
PerM: Efficient mapping of short sequencing reads with periodic full sensitive spaced seeds
 BIOINFORMATICS
, 2009
"... Motivation: The explosion of next generation sequencing data has spawned the design of new algorithms and software tools to provide efficient mapping for different read lengths and sequencing technologies. In particular, ABI’s sequencer (SOLiD system) poses a big computational challenge with its cap ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Motivation: The explosion of next generation sequencing data has spawned the design of new algorithms and software tools to provide efficient mapping for different read lengths and sequencing technologies. In particular, ABI’s sequencer (SOLiD system) poses a big computational challenge with its capacity to produce very large amounts of data, and its unique strategy of encoding sequence data into color signals. Results: We present the mapping software, named PerM (Periodic Seed Mapping) that uses periodic spaced seeds to significantly improve mapping efficiency for large reference genomes when compared to stateoftheart programs. The data structure in PerM requires only 4.5 bytes per base to index the human genome, allowing entire genomes to be loaded to memory while multiple processors simultaneously map reads to the reference. Weight maximized periodic seeds offer full sensitivity for up to three mismatches and high sensitivity for four and five mismatches while minimizing the number random hits per query, significantly speeding up running time. Such sensitivity makes PerM a valuable mapping tool for SOLiD and Solexa reads.
Optimizing multiple seeds for protein homology search
 IEEE Transactions on Computational Biology and Bioinformatics
, 2005
"... Abstract—We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed models as an integer programming proble ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NPhard, and QuasiNPhard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP. Index Terms—Bioinformatics database applications, similarity measures, biology and genetics. 1
Hardness of optimal spaced seed design
 PARK (EDS.), PROCEEDINGS OF THE 16TH ANNUAL SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING (CPM’05)
, 2005
"... Speeding up approximate pattern matching is a line of research in stringology since the 80’s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dy ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
(Show Context)
Speeding up approximate pattern matching is a line of research in stringology since the 80’s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dynamic programming. Among the conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, counting spaced subwords instead of substrings improves the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called spaced seeds (or gapped seeds), for the subwords, depending on the search parameters. Two distinct lines of research appear the literature: one with probabilistic formulations of seed design problems, in which one wishes for instance to compute a seed with the highest probability to detect the desired similarities (lossy filtration), a second line with combinatorial formulations, where the goal is to find a seed that detects all or a maximum number
Indel seeds for homology search
 BIOINFORMATICS
, 2006
"... We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To deter mine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappro priate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homo logous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.
A tutorial of recent developments in the seeding of local alignment
 JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL
, 2004
"... We review recent results on local alignment. We begin with a review of classical methods and early heuristic methods, and then focus on more recent work on the seeding of local alignment. We show that these techniques give a vast improvement in both sensitivity and specificity over previous methods, ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
We review recent results on local alignment. We begin with a review of classical methods and early heuristic methods, and then focus on more recent work on the seeding of local alignment. We show that these techniques give a vast improvement in both sensitivity and specificity over previous methods, and can achieve sensitivity at the level of classical algorithms while requiring orders of magnitude less runtime.