Results 1  10
of
20
Superiority and Complexity of the Spaced Seeds
 SODA
, 2006
"... Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their in ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a nonuniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NPhard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length. 1
Seed optimization is no easier than optimal Golomb ruler design
 in Proceedings of the 6th Asia Pacific Bioinformatics Conference (APBC
, 2008
"... Spaced seed is a filter method invented to efficiently identify the regions of interest in similarity searches. It is now well known that certain spaced seeds hit (detect) a randomly sampled similarity region with higher probabilities than the others. Assume each position of the similarity region is ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Spaced seed is a filter method invented to efficiently identify the regions of interest in similarity searches. It is now well known that certain spaced seeds hit (detect) a randomly sampled similarity region with higher probabilities than the others. Assume each position of the similarity region is identity with probability p independently. The seed optimization problem seeks for the optimal seed achieving the highest hit probability with given length and weight. Despite that the problem was previously shown not to be NPhard, in practice it seems difficult to solve. The only algorithm known to compute the optimal seed is still exhaustive search in exponential time. In this article we put some insight into the hardness of the seed design problem by demonstrating the relation between the seed optimization problem and the optimal Golomb ruler design problem, which is a well known difficult problem in combinatorial design.
Rapid Homology Search with Neighbor Seeds
, 2005
"... Using a seed to rapidly "hit" possible homologies for further scrutiny is a common practice to speed up homology search in molecular sequences. ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Using a seed to rapidly "hit" possible homologies for further scrutiny is a common practice to speed up homology search in molecular sequences.
Evidence Combination in Hidden Markov Models for Gene Prediction
, 2005
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii This thesis introduces new techniques for finding gene ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to
A survey of seeding for sequence alignment
, 2007
"... We survey recent work in the seeding of alignments, particularly the followups from the 2002 work of Ma, Tromp and Li that brought the concept of spaced seeds into the bioinformatics literature [25]. Our focus is on the extensions of this work to increasingly complicated models of alignments, comin ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
We survey recent work in the seeding of alignments, particularly the followups from the 2002 work of Ma, Tromp and Li that brought the concept of spaced seeds into the bioinformatics literature [25]. Our focus is on the extensions of this work to increasingly complicated models of alignments, coming up to the most recent efforts in this area. 1
Amino Acid Classification and Hash Seeds for Homology Search
 BICOB
, 2009
"... Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on kmers, where two kmers have the same hash value if and only if they are identical at the w (w <k) positions designated by the seed. Spaced seeds subs ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on kmers, where two kmers have the same hash value if and only if they are identical at the w (w <k) positions designated by the seed. Spaced seeds substantially increased the homology search sensitivity. It is then a natural question to ask whether there is a better hash function (called hash seed) that provides better sensitivity than the spaced seed. We study this question in the paper. We propose a strategy to classify amino acids, which leads to a better hash seed. Our results raise a new question about how to design the best hash seed.
MPSCAN: Fast Localisation of Multiple Reads in Genomes
 WABI
, 2009
"... Abstract. With Next Generation Sequencers, sequence based transcriptomic or epigenomic assays yield millions of short sequence reads that need to be mapped back on a reference genome. The upcoming versions of these sequencers promise even higher sequencing capacities; this may turn the read mapping ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. With Next Generation Sequencers, sequence based transcriptomic or epigenomic assays yield millions of short sequence reads that need to be mapped back on a reference genome. The upcoming versions of these sequencers promise even higher sequencing capacities; this may turn the read mapping task into a bottleneck for which alternative pattern matching approaches must be experimented. We present an algorithm and its implementation, called mpscan, which uses a sophisticated filtration scheme to match a set of patterns/reads exactly on a sequence. mpscan can search for millions of reads in a single pass through the genome without indexing its sequence. Moreover, we show that mpscan offers an optimal average time complexity, which is sublinear in the text length, meaning that it does not need to examine all sequence positions. Comparisons with BLATlike tools and with six specialised read mapping programs (like Bowtie or ZOOM) demonstrate that mpscan also is the fastest algorithm in practice for exact matching. Our accuracy and scalability comparisons reveal that some tools are inappropriate for read mapping. Moreover, we provide evidence suggesting that exact matching may be a valuable solution in some read mapping applications. As most read mapping programs somehow rely on exact matching procedures to perform approximate pattern mapping, the filtration scheme we experimented may reveal useful in the design of future algorithms. The absence of genome index gives mpscan its low memory requirement and flexibility that let it run on a desktop computer and avoids a timeconsuming genome preprocessing. 1
Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design
 INFORMATION PROCESSING LETTERS
, 2009
"... The spaced seed is a filtration method to efficiently identify the regions of interest in string similarity searches. It is important to find the optimal spaced seed that achieves the highest search sensitivity. For some simple distributions of the similarities, the seed optimization problem was pro ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The spaced seed is a filtration method to efficiently identify the regions of interest in string similarity searches. It is important to find the optimal spaced seed that achieves the highest search sensitivity. For some simple distributions of the similarities, the seed optimization problem was proved to be not NPhard. On the other hand, no polynomial time algorithm has been found despite the extensive researches in the literature. In this article we examine the hardness of the seed optimization problem by a polynomial time reduction from the optimal Golomb ruler design problem, which is a wellknown difficult (but not NPhard) problem in combinatorial design.
Department of Computer Science,
, 802
"... A gapped pattern is a sequence consisting of regular alphabet symbols and of joker symbols that match any alphabet symbol. The content of a gapped pattern is defined as the number of its nonjoker symbols. A gapped motif is a gapped pattern that occurs repeatedly in a string or in a set of strings. ..."
Abstract
 Add to MetaCart
(Show Context)
A gapped pattern is a sequence consisting of regular alphabet symbols and of joker symbols that match any alphabet symbol. The content of a gapped pattern is defined as the number of its nonjoker symbols. A gapped motif is a gapped pattern that occurs repeatedly in a string or in a set of strings. The aim of this paper is to study the complexity of several gapped motif finding problems. The following three decision problems are shown NPcomplete, even if the input alphabet is binary. (i) Given a string T and two integers c and q, decide whether or not there exists a gapped pattern with content c (or more) that occurs in T at q distinct positions (or more). (ii) Given a set of strings S and an integer c, decide whether or not there exists a gapped pattern with content c that occurs at least once in each string of S. (iii) Given m strings with the same length, and two integers c and q, decide whether or not there exists a gapped pattern with content c, matching at least q input strings. We also present a nonnaive quadratictime algorithm that solves the following optimization problem: given a string T and an integer q ≥ 0, compute a maximumcontent gapped pattern Q such that q consecutive copies of Q occur in T. Key words: gapped pattern, motif discovery, string matching with don’t care symbols, NPcomplete, tandem motifs. ∗ Corresponding author.