Results 1 - 10
of
37
Motif Statistics
, 1999
"... We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata and formal language theory), in particular, the rationality of generating functions of regular languages; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra for determining generating functions explicitly, analysing generating functions and extracting coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces multivariate generating functions for the statistics under study. A fast computation of Taylor coefficients of the generating functions then yields exact values of the moments with typical application to random t...
Monotony of Surprise and Large-Scale Quest for Unusual Words
- In proceedings of the 6 th Int’l Conference on Research in Computational Molecular Biology
, 2002
"... The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionall ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation.
On the Approximate Pattern Occurrences in a Text
- In IEEE Computer Society, editor, Compression and Complexity of SEQUENCES 1997
, 1997
"... Consider a given pattern H and a random text T generated randomly according to the Bernoulli model. We study the frequency of approximate occurrences of the pattern a random text when overlapping copies of the approximate pattern are counted separately. We provide exact and asymptotic formul# for ..."
Abstract
-
Cited by 27 (13 self)
- Add to MetaCart
Consider a given pattern H and a random text T generated randomly according to the Bernoulli model. We study the frequency of approximate occurrences of the pattern a random text when overlapping copies of the approximate pattern are counted separately. We provide exact and asymptotic formul# for mean, variance and probability of occurrence as well as asymptotic results including the central limit theorem and large deviations. Our approach is combinatorial: we #rst construct some language expressions that characterize pattern occurrences which are translated into generating functions, and #nally we use analytical methods to extract asymptotic behaviors of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, games, and stock market analysis. These #ndings are of particular interest to information theory #e.g., second-order properties of the relative frequency#, and molecular biology problems #e.g., #nding patterns with unexpected high or low frequencies, and gene recognition#.
Reliable Detection of Episodes in Event Sequences
- Knowledge and Information Systems
, 2004
"... Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed events will almost certainly contain any subsequence, and setting thresholds for alarm is an important issue in a monitoring system that seeks to avoid false alarms. Suppose a long sequence T of observed events contains a suspicious subsequence pattern S within it, where the suspicious subsequence S consists of m events and spans a window of size w within T . We address the fundamental problem: is a certain number of occurrences of a particular subsequence unlikely to be generated by randomness itself (i.e., indicative of suspicious activity)? If the probability of an occurrence generated by randomness is high and an automated monitoring system ags it as suspicious anyway, then such a system will suer from generating too many false alarms. This paper quanti es the probability of such an S occurring in T within a window of size w, the number of distinct windows containing S as a subsequence, the expected number of such occurrences, its variance, and establishes its limiting distribution that allows to set up an alarm threshold so that the probability of false alarms is very small. We report on experiments con rming the theory and showing that we can detect bad subsequences with low false alarm rate.
A Fast Algorithm For Finding Frequent Episodes In Event Streams ABSTRACT
"... Frequent episode discovery is a popular framework for mining data available as a long sequence of events. An episode is essentially a short ordered sequence of event types and the frequency of an episode is some suitable measure of how often the episode occurs in the data sequence. Recently, we prop ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Frequent episode discovery is a popular framework for mining data available as a long sequence of events. An episode is essentially a short ordered sequence of event types and the frequency of an episode is some suitable measure of how often the episode occurs in the data sequence. Recently, we proposed a new frequency measure for episodes based on the notion of non-overlapped occurrences of episodes in the event sequence, and showed that, such a definition, in addition to yielding computationally efficient algorithms, has some important theoretical properties in connecting frequent episode discovery with HMM learning. This paper presents some new algorithms for frequent episode discovery under this non-overlapped occurrences-based frequency definition. The algorithms presented here are better (by a factor of N, where N denotes the size of episodes being discovered) in terms of both time and space complexities when compared to existing methods for frequent episode discovery. We show through some simulation experiments, that our algorithms are very efficient. The new algorithms presented here have arguably the least possible orders of space and time complexities for the task of frequent episode discovery.
Rare Events and Conditional Events on Random Strings
- DMTCS
, 2004
"... this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expec ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are precise and can be computed efficiently. These results have applications in computational biology, where a genome is viewed as a text
Analysis of the average depth in a suffix tree under a Markov model
- In International Conference on the Analysis of Algorithms
, 2005
"... In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the depth of the suffix tree, where h is the entropy of the Markov model and C is constant. Our proof compares the generating functions for the average depth in tries and in suffix trees; the difference between these generating functions is shown to be asymptotically small. We conclude by using the asymptotic behavior of the average depth in a trie under the Markov model found by Jacquet and Szpankowski ([4]).
Assessing Statistical Significance of Overrepresented Oligonucleotides
, 2001
"... Assessing statistical significance of overrepresentation of exceptional words is becoming an important task in computational biology. ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Assessing statistical significance of overrepresentation of exceptional words is becoming an important task in computational biology.
On the Number of Occurrences of a Symbol in Words of Regular Languages
, 2002
"... We study the random variable Yn representing the number of occurrences of a symbol a in a word of length n chosen at random in a regular language L fa; bg where the random choice is de ned via a nonnegative rational formal series r of support L. Assuming that the transition matrix associated ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We study the random variable Yn representing the number of occurrences of a symbol a in a word of length n chosen at random in a regular language L fa; bg where the random choice is de ned via a nonnegative rational formal series r of support L. Assuming that the transition matrix associated with r is primitive we obtain asymptotic estimates for the mean value and the variance of Yn and present a central limit theorem for its distribution. Under a further condition on such a matrix, we also derive an asymptotic approximation of the discrete Fourier transform of Yn that allows to prove a local limit theorem for Yn . Further consequences of our analysis concern the growth of the coecients in rational formal series; in particular, it turns out that, for a wide class of regular languages L, the maximum number of words of length n in L having the same number of occurrences of a given symbol is of the order of growth n , for some constant > 1.
Error resilient LZ’77 data compression: Algorithms, analysis, and experiments. (Under submission
- Analysis, and Experiments, IEEE Trans. Information Theory
, 2006
"... Abstract—We propose a joint source–channel coding algorithm capable of correcting some errors in the popular Lempel–Ziv’77 (LZ’77) scheme without introducing any measurable degradation in the compression performance. This can be achieved because the LZ’77 encoder does not completely eliminate the re ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract—We propose a joint source–channel coding algorithm capable of correcting some errors in the popular Lempel–Ziv’77 (LZ’77) scheme without introducing any measurable degradation in the compression performance. This can be achieved because the LZ’77 encoder does not completely eliminate the redundancy present in the input sequence. One source of redundancy can be observed when an LZ’77 phrase has multiple matches. In this case, LZ’77 can issue a pointer to any of those matches, and a particular choice carries some additional bits of information. We call a scheme with embedded redundant information the LZS’77 algorithm. We analyze the number of longest matches in such a scheme and prove that it follows the logarithmic series distribution with mean Ia � (plus some fluctuations), where � is the source entropy. Thus, the distribution associated with the number of redundant bits is well concentrated around its mean, a highly

