Results 1  10
of
14
Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome biology
"... ..."
Hardness of optimal spaced seed design
 PARK (EDS.), PROCEEDINGS OF THE 16TH ANNUAL SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING (CPM’05)
, 2005
"... Speeding up approximate pattern matching is a line of research in stringology since the 80’s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dy ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
Speeding up approximate pattern matching is a line of research in stringology since the 80’s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dynamic programming. Among the conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, counting spaced subwords instead of substrings improves the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called spaced seeds (or gapped seeds), for the subwords, depending on the search parameters. Two distinct lines of research appear the literature: one with probabilistic formulations of seed design problems, in which one wishes for instance to compute a seed with the highest probability to detect the desired similarities (lossy filtration), a second line with combinatorial formulations, where the goal is to find a seed that detects all or a maximum number
Subset Seed Automaton
 in "12th International Conference on Implementation and Application of Automata (CIAA 07)", Lecture Notes in Computer Science
"... Abstract. We study the pattern matching automaton introduced in [1] for the purpose of seedbased similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the AhoCorasick construction. We study properties of this automaton and prese ..."
Abstract

Cited by 16 (10 self)
 Add to MetaCart
(Show Context)
Abstract. We study the pattern matching automaton introduced in [1] for the purpose of seedbased similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the AhoCorasick construction. We study properties of this automaton and present an efficient implementation of the automaton construction. We also present some experimental results and show that this automaton can be successfully applied to more general situations. inria00170414, version 1 7 Sep 2007 1
Superiority of Spaced Seeds for Homology Search
 TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (TCBB)
, 2007
"... In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyz ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of nonoverlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that when the length of a nonuniformly spaced seed is bounded above by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed of the same weight in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds.
ReferenceBased Alignment in Large Sequence Databases
"... This paper introduces a novel method, called ReferenceBased String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the SmithWaterman similarity measure. RBSA operates using the assumption that the optimal match d ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
This paper introduces a novel method, called ReferenceBased String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the SmithWaterman similarity measure. RBSA operates using the assumption that the optimal match deviates by a relatively small amount from the query, an amount that does not exceed a prespecified fraction of the query length. RBSA has an exact version that guarantees no false dismissals and can handle large queries efficiently. An approximate version of RBSA is also described, that achieves significant additional improvements over the exact version, with negligible losses in retrieval accuracy. RBSA performs filtering of candidate matches using precomputed alignment scores between the database sequence and a set of fixedlength reference sequences. At query time, the query sequence is partitioned into segments of length equal to that of the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate subsequence matches. An alphabet collapsing technique is employed to improve the pruning power of the filter step. In our experimental evaluation, RBSA significantly outperforms stateoftheart biological sequence alignment methods, such as qgrams, BLAST, and BWT. 1.
Seed design framework for mapping SOLiD reads
"... Abstract. The advent of highthroughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications. We propose a rigorous and flexible algorithmic solution to mapping SOLiD colorspace reads to a reference genome. The solution reli ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The advent of highthroughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications. We propose a rigorous and flexible algorithmic solution to mapping SOLiD colorspace reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency. 1
Spaced seed design on profile HMMs for precise HTS readmapping efficient sliding window product on the matrix semigroup
, 2012
"... We propose a new method and its associated algorithm to efficiently compute seed sensitivity when considering that High Throughput Sequencing reads are mapped along subparts of a known HMM alignment profile. This computation particularly makes sense with positioned spaced seeds. It relies on both a ..."
Abstract
 Add to MetaCart
We propose a new method and its associated algorithm to efficiently compute seed sensitivity when considering that High Throughput Sequencing reads are mapped along subparts of a known HMM alignment profile. This computation particularly makes sense with positioned spaced seeds. It relies on both automata theory (previous work [KNR06]) combined with a matrix product problem. Interestingly, it brings into light an interval product problem considered more than twenty years ago in [AS87], but here with a sliding window aspect: we propose an efficient algorithm to compute this sliding window set of products using a linear number of unit products on the (associative, but non commutative and non invertible) matrix semigroup. This computational scheme is implemented in the ongoing 1.06 version of Iedera which is available at
Efficient computation of spaced seeds
 BMC RESEARCH NOTES
, 2012
"... Background: The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, lineartime heuristics such as BLAST are used. Spaced seeds are much more sensitive than th ..."
Abstract
 Add to MetaCart
Background: The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, lineartime heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program. Findings: SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity. Conclusion: Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.
ACCURATE ALIGNMENT OF SEQUENCING READS FROM VARIOUS GENOMIC ORIGINS
, 2014
"... iDeclaration I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has not been submitted for any degree in any university previously. ..."
Abstract
 Add to MetaCart
(Show Context)
iDeclaration I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has not been submitted for any degree in any university previously.
RESEARCH ARTICLE Open Access BOND: Basic OligoNucleotide Design
"... Background: DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, th ..."
Abstract
 Add to MetaCart
Background: DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, that is, all genes that allow such unique regions, and (iii) are computed fast. None of the existing programs meet all these criteria. Results: We introduce a new approach with our software program BOND (Basic OligoNucleotide Design). According to Kane’s criteria for oligo design, BOND computes highly specific DNA oligonucleotides, for all the genes that admit unique probes, while running orders of magnitude faster than the existing programs. The same approach enables us to introduce also an evaluation procedure that correctly measures the quality of the oligonucleotides. Extensive comparison is performed to prove our claims. BOND is flexible, easy to use, requires no additional software, and is freely available for noncommercial use from