Results 1  10
of
54
Superiority and Complexity of the Spaced Seeds
 SODA
, 2006
"... Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their in ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a nonuniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of nonoverlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NPhard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length. 1
Separating Real Motifs From Their Artifacts
, 2001
"... The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real moti ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real motifs from their artifacts. This produces a short list of high quality motifs that is sufficient to explain the overrepresentation of all motifs in the given sequences. Using synthetic data sets, we show that the output of our method is very accurate. On various sets of upstream sequences in S. cerevisiae, our program identifies several known binding sites, as well as a number of significant novel motifs. Contact: fblanchem,saurabhg@cs.washington.edu
Rare Events and Conditional Events on Random Strings
 DMTCS
, 2004
"... this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expec ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are precise and can be computed efficiently. These results have applications in computational biology, where a genome is viewed as a text
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
Three Variations on Word Counting
 In GCB'00
"... Motivation We address the problem of assessing statistical significance of pattern occurrence frequency in biopolymer sequences. We consider three applications. First, several searching algorithms for regulatory sites make use of statistics on consensus words in a singlestranded DNA text. Second, w ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Motivation We address the problem of assessing statistical significance of pattern occurrence frequency in biopolymer sequences. We consider three applications. First, several searching algorithms for regulatory sites make use of statistics on consensus words in a singlestranded DNA text. Second, we propose a new counting scheme to count consensus words on both strands in double stranded DNA. Our third application is the counting of profiles, especially PROSITE regular expressions.
Minimal Markov chain embeddings of pattern problems
 University of California, San Diego
"... Abstract — The Markov chain embedding technique is commonly used to study the distribution of statistics associated with regular patterns (i.e. set of strings described by a regular expression) in random strings. In this extended abstract, we formalize the concept Markov chain embedding for random s ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract — The Markov chain embedding technique is commonly used to study the distribution of statistics associated with regular patterns (i.e. set of strings described by a regular expression) in random strings. In this extended abstract, we formalize the concept Markov chain embedding for random strings produced by a possibly nonstationary Markov source. A notion of memory conveyed by the states of a deterministic finite automaton is introduced. This notion is used to characterize the smallest statespace size Markov chain required to specify the distribution of the count statistic of a given regular pattern. The research finds applications in problems associated with regular patterns in random strings that demand exponentially large state spaces. I.
Mastering seeds for genomic size nucleotide BLAST searches
 Nucleic Acids Res
, 2003
"... searches ..."
(Show Context)
Property matching and weighted matching
 In CPM
, 2006
"... In many pattern matching applications the text has some properties attached to various of its parts. Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. Some imme ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
In many pattern matching applications the text has some properties attached to various of its parts. Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. Some immediate examples come from molecular biology where it has long been a practice to consider special areas in the genome by their structure. It is straightforward to do sequential matching in a text with properties. However, indexing in a text with properties becomes difficult if we desire the time to be output dependent. We present an algorithm for indexing a text with properties in O(n log Σ  + n log log n) time for preprocessing and O(P  log Σ  + toccπ) per query, where n is the length of the text, P is the sought pattern, and toccπ is the number of occurrences of the pattern that satisfy some property π. As a practical use of Property Matching we show how to solve Weighted Matching problems using techniques from Property Matching. Weighted sequences have been recently introduced as a tool to handle a set of sequences that are not identical but have many local similarities. The weighted sequence is a “statistical image ” of this set, where we are given the probability of every symbol’s occurrence at every text location. Weighted matching problems are pattern matching problems where the given text is weighted. We present a reduction from Weighted Matching to Property Matching that allows offtheshelf solutions to numerous weighted matching problems including indexing (which is nontrivial without this reduction). Assuming that one seeks the occurrence of pattern P with probability ɛ in weighted text T of length n, we reduce the problem to a property matching problem of pattern P in text T ′ of length O(n ( 1 ɛ)2 log 1 ɛ). 1