Results 1 - 10
of
14
Motif Statistics
, 1999
"... We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata and formal language theory), in particular, the rationality of generating functions of regular languages; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra for determining generating functions explicitly, analysing generating functions and extracting coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces multivariate generating functions for the statistics under study. A fast computation of Taylor coefficients of the generating functions then yields exact values of the moments with typical application to random t...
Euclidean algorithms are gaussian
- Journal of Number Theory, Volume 110, Issue
, 2006
"... Abstract. We obtain a Central Limit Theorem for a general class of additive parameters (costs, observables) associated to three standard Euclidean algorithms, with optimal speed of convergence. We also provide very precise asymptotic estimates and error terms for the mean and variance of such parame ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
Abstract. We obtain a Central Limit Theorem for a general class of additive parameters (costs, observables) associated to three standard Euclidean algorithms, with optimal speed of convergence. We also provide very precise asymptotic estimates and error terms for the mean and variance of such parameters. For costs that are lattice (including the number of steps), we go further and establish a Local Limit Theorem, with optimal speed of convergence. We view an algorithm as a dynamical system restricted to rational inputs, and combine tools imported from dynamics, such as transfer operators, with various other techniques: Dirichlet series, Perron’s formula, quasi-powers theorems, and the saddle-point method. Such dynamical analyses had previously been used to perform the average-case analysis of algorithms. For the present (dynamical) analysis in distribution, we require estimates on transfer operators when a parameter varies along vertical lines in the complex plane. To prove them, we adapt techniques introduced recently by Dolgopyat in the context of continuous-time dynamics [16]. 1.
Analysis of the average depth in a suffix tree under a Markov model
- In International Conference on the Analysis of Algorithms
, 2005
"... In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the depth of the suffix tree, where h is the entropy of the Markov model and C is constant. Our proof compares the generating functions for the average depth in tries and in suffix trees; the difference between these generating functions is shown to be asymptotically small. We conclude by using the asymptotic behavior of the average depth in a trie under the Markov model found by Jacquet and Szpankowski ([4]).
Pattern matching statistics on correlated sources
- In Proc. of LATIN’06 (2006
, 1992
"... Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may d ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may differ in a significant way. Here, we consider a general framework where the text is produced by a probabilistic source, which can be built by a dynamical system. Such “dynamical sources ” encompass the classical sources –memoryless sources, and Markov chains–, and may possess a high degree of correlations. We are mainly interested in two situations: the pattern is a general word of a regular expression, and we study the number of occurrence positions – the pattern is a finite set of strings, and we study the number of occurrences. In both cases, we determine the mean and the variance of the parameter, and prove that its distribution is asymptotically Gaussian. In this way, we extend methods and results which have been already obtained for classical sources [for instance in [9] and in [6]] to this general “dynamical ” framework. Our methods use various techniques: formal languages, and generating functions, as in previous works. However, in this correlated model, it is not possible to use a direct transfer into generating functions, and we mainly deal with generating operators which generate... generating functions. 1
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
CONCENTRATION INEQUALITIES AND ESTIMATION OF CONDITIONAL PROBABILITIES
"... Abstract. We prove concentration inequalities inspired from [DP] to obtain estimators of conditional probabilities for weak dependant sequences. This generalize results from Csiszár ([Cs]). For Gibbs measures and dynamical systems, these results lead to construct estimators of the potential function ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. We prove concentration inequalities inspired from [DP] to obtain estimators of conditional probabilities for weak dependant sequences. This generalize results from Csiszár ([Cs]). For Gibbs measures and dynamical systems, these results lead to construct estimators of the potential function and also to test the nullity of the asymptotic variance of the system. This paper deals with the problems of typicality and conditional typicality of “empirical probabilities ” for stochastic process and the estimation of potential functions for Gibbs measures and dynamical systems. The questions of typicality have been studied in [FKT] for independent sequences, in [BRY, R] for Markov chains. In order to prove the consistency of estimators of transition probability for Markov chains of unknown order, results on typicality and conditional typicality for some (Ψ)-mixing process where obtained in [CsS, Cs]. Unfortunately, lots of natural mixing process do not satisfy this Ψ-mixing condition (see [DP]). We consider a class of mixing process inspired from [DP]. For this class, we prove strong typicality and strong conditional typicality. In the particular case of Gibbs measures (or complete connexions chains) and for certain dynamical systems, from the typicality results we derive an estimation of the potential as well as procedure to test the nullity of the asymptotic variance of the process. More formally, we consider X0,...., Xn,... a stochastic process taking values on an complete set Σ and a sequence of countable partitions of Σ, (Pk)k∈N such that if P ∈ Pk then there exists a unique � P ∈ Pk−1 such that almost surely, Xj ∈ P ⇒ Xj−1 ∈ � P. Our aim is to obtain empirical estimates on the probabilities: P(Xj ∈ P), P ∈ Pk, and the conditional probabilities:
Average Redundancy for Known Sources: Ubiquitous Trees in Source Coding
, 2008
"... Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle poi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle point method, analytic poissonization and depoissonization, and singularity analysis. This approach lies at the crossroad of computer science and information theory. In this survey we concentrate on one facet of information theory (i.e., source coding better known as data compression), namely the redundancy rate problem. The redundancy rate problem determines by how much the actual code length exceeds the optimal code length. We further restrict our interest to the average redundancy for known sources, that is, when statistics of information sources are known. We present precise analyses of three types of lossless data compression schemes, namely fixed-to-variable (FV) length codes, variable-to-fixed (VF) length codes, and variable-to-variable (VV) length codes. In particular, we investigate average redundancy of Huffman, Tunstall, and Khodak codes. These codes have succinct representations as trees, either as coding or parsing trees, and we analyze here some of their parameters (e.g., the average path from the root to a leaf).
Euclidean dynamics
- Discrete and Continuous Dynamical Systems
"... Abstract. We study a general class of Euclidean algorithms which compute the greatest common divisor [gcd], and we perform probabilistic analyses of their main parameters. We view an algorithm as a dynamical system restricted to rational inputs, and combine tools imported from dynamics, such as tran ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We study a general class of Euclidean algorithms which compute the greatest common divisor [gcd], and we perform probabilistic analyses of their main parameters. We view an algorithm as a dynamical system restricted to rational inputs, and combine tools imported from dynamics, such as transfer operators, with various tools of analytic combinatorics: generating functions, Dirichlet series, Tauberian theorems, Perron’s formula and quasi-powers theorems. Such dynamical analyses can be used to perform the average-case analysis of algorithms, but also (dynamical) analysis in distribution. 1. Introduction. Computing the Greatest Common Divisor [Gcd
Statistical properties of Markov dynamical sources: applications to information theory
- Discrete Math. Theor. Comput. Sci
"... In (V1), the author studies statistical properties of words generated by dynamical sources. This is done using generalized Ruelle operators. The aim of this article is to generalize the notion of sources for which the results hold. First, we avoid the use of Grothendieck theory and Fredholm determin ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In (V1), the author studies statistical properties of words generated by dynamical sources. This is done using generalized Ruelle operators. The aim of this article is to generalize the notion of sources for which the results hold. First, we avoid the use of Grothendieck theory and Fredholm determinants, this allows dynamical sources that cannot be extended to a complex disk or that are not analytic. Second, we consider Markov sources: the language generated by the source over an alphabet M is not necessarily M ∗.
ON BUFFON MACHINES AND NUMBERS
, 906
"... Abstract. Buffon’s needle experiment is well-known: take a plane on which parallel lines at unit distance one from the next have been marked, throw a needle of unit length at random, and, finally, declare the experiment a success if the needle intersects one of the lines. Basic calculus implies that ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Buffon’s needle experiment is well-known: take a plane on which parallel lines at unit distance one from the next have been marked, throw a needle of unit length at random, and, finally, declare the experiment a success if the needle intersects one of the lines. Basic calculus implies that the probability of success is 2. π = 0.63661, and the experiment can be regarded as an analog (i.e., continuous) device that stochastically “computes ” 2 π. Generalizing the experiment and simplifying the computational framework, we ask ourselves which probability distributions can be produced perfectly, from a discrete source of unbiased coin flips. We describe and analyse a few simple Buffon machines that can generate geometric, Poisson, and logarithmic-series distributions (these are in particular required to transform continuous Boltzmann samplers of classical combinatorial structures into purely discrete random generators). Say that a number is Buffon if it is the probability of success of a probabilistic experiment based on discrete coin flippings. We provide human-accessible Buffon machines, which require a dozen coin flips or less, on average, and produce experiments whose probabilities are expressible in terms of numbers such as π, exp(−1), log 2, √ 3, cos ( 1 4), ζ(5). More generally, we develop a collection of constructions based on simple probabilistic mechanisms that enable one to create Buffon experiments involving compositions of exponentials and logarithms, polylogarithms, direct and inverse trigonometric functions, algebraic and hypergeometric functions, as well as functions defined by integrals, such as the Gaussian error function.

