Superiority and Complexity of the Spaced Seeds
 SODA
, 2006
Abstract

Cited by 27 (6 self)
Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a nonuniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NPhard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length. 1
Separating Real Motifs From Their Artifacts
, 2001
Abstract

Cited by 22 (3 self)
The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real motifs from their artifacts. This produces a short list of high quality motifs that is sufficient to explain the overrepresentation of all motifs in the given sequences. Using synthetic data sets, we show that the output of our method is very accurate. On various sets of upstream sequences in S. cerevisiae, our program identifies several known binding sites, as well as a number of significant novel motifs. Contact: fblanchem,saurabhg@cs.washington.edu
Rare Events and Conditional Events on Random Strings
 DMTCS
, 2004
Abstract

Cited by 14 (3 self)
this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are precise and can be computed efficiently. These results have applications in computational biology, where a genome is viewed as a text
Hidden Word Statistics
Abstract

Cited by 9 (3 self)
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
Three Variations on Word Counting
 In GCB'00
Abstract

Cited by 6 (3 self)
Motivation We address the problem of assessing statistical significance of pattern occurrence frequency in biopolymer sequences. We consider three applications. First, several searching algorithms for regulatory sites make use of statistics on consensus words in a singlestranded DNA text. Second, we propose a new counting scheme to count consensus words on both strands in double stranded DNA. Our third application is the counting of profiles, especially PROSITE regular expressions.
Minimal Markov chain embeddings of pattern problems
 University of California, San Diego
Abstract

Cited by 5 (1 self)
Abstract — The Markov chain embedding technique is commonly used to study the distribution of statistics associated with regular patterns (i.e. set of strings described by a regular expression) in random strings. In this extended abstract, we formalize the concept Markov chain embedding for random strings produced by a possibly nonstationary Markov source. A notion of memory conveyed by the states of a deterministic finite automaton is introduced. This notion is used to characterize the smallest statespace size Markov chain required to specify the distribution of the count statistic of a given regular pattern. The research finds applications in problems associated with regular patterns in random strings that demand exponentially large state spaces. I.
Mastering seeds for genomic size nucleotide BLAST searches
 Nucleic Acids Res
, 2003
Superiority of Spaced Seeds for Homology Search
 TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (TCBB)
, 2007
Abstract

Cited by 4 (3 self)
In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of nonoverlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that when the length of a nonuniformly spaced seed is bounded above by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed of the same weight in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds.
Regexpcount, a Symbolic Package for Counting Problems on Regular Expressions and Words
, 2000
Abstract

Cited by 4 (2 self)
In previous work (Nicod`eme et al., 1999), we considered algorithms related to the statistics of word occurrences and regular expression occurrences in texts generated by Bernoulli or Markov sources. In this work these algorithms are extended for two purposes: to determine the statistics of simultaneous counting of different motifs, and to compute the waiting time for the first match with a motif in a model which may be constrained. This extension also handles matches with errors. The package is fully implemented and gives access to high and low level commands. We also consider an example corresponding to a practical biological problem: getting the statistics for the number of matches of words of size 8 in a genome (a Markovian sequence), knowing that an (overrepresented DNA protecting) Chi pattern occurs a given number of times.