Results 1  10
of
19
Motif Statistics
, 1999
"... We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata ..."
Abstract

Cited by 48 (4 self)
 Add to MetaCart
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) a constructive approach to classical results in theoretical computer science (automata and formal language theory), in particular, the rationality of generating functions of regular languages; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra for determining generating functions explicitly, analysing generating functions and extracting coefficients efficiently. We provide constructions for overlapping or nonoverlapping matches of a regular expression. A companion implementation produces multivariate generating functions for the statistics under study. A fast computation of Taylor coefficients of the generating functions then yields exact values of the moments with typical application to random t...
Euclidean algorithms are Gaussian
, 2003
"... Abstract. We prove a Central Limit Theorem for a general class of costparameters associated to the three standard Euclidean algorithms, with optimal speed of convergence, and error terms for the mean and variance. For the most basic parameter of the algorithms, the number of steps, we go further an ..."
Abstract

Cited by 22 (10 self)
 Add to MetaCart
Abstract. We prove a Central Limit Theorem for a general class of costparameters associated to the three standard Euclidean algorithms, with optimal speed of convergence, and error terms for the mean and variance. For the most basic parameter of the algorithms, the number of steps, we go further and prove a Local Limit Theorem (LLT), with speed of convergence O((log N) −1/4+ǫ). This extends and improves the LLT obtained by Hensley [27] in the case of the standard Euclidean algorithm. We use a “dynamical analysis ” methodology, viewing an algorithm as a dynamical system (restricted to rational inputs), and combining tools imported from dynamics, such as the crucial transfer operators, with various other techniques: Dirichlet series, Perron’s formula, quasipowers theorems, the saddle point method. Dynamical analysis had previously been used to perform averagecase analysis of algorithms. For the present (dynamical) analysis in distribution, we require precise estimates on the transfer operators, when a parameter varies along vertical lines in the complex plane. Such estimates build on results obtained only recently by Dolgopyat in the context of continuoustime dynamics [20]. 1.
Analysis of the average depth in a suffix tree under a Markov model
 In International Conference on the Analysis of Algorithms
, 2005
"... In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the depth of the suffix tree, where h is the entropy of the Markov model and C is constant. Our proof compares the generating functions for the average depth in tries and in suffix trees; the difference between these generating functions is shown to be asymptotically small. We conclude by using the asymptotic behavior of the average depth in a trie under the Markov model found by Jacquet and Szpankowski ([4]).
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
Digital Trees and Memoryless Sources: from Arithmetics to Analysis
 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA’10), Discrete Math. Theor. Comput. Sci. Proc
, 2010
"... Digital trees, also known as “tries”, are fundamental to a number of algorithmic schemes, including radixbased searching and sorting, lossless text compression, dynamic hashing algorithms, communication protocols of the tree or stack type, distributed leader election, and so on. This extended abstr ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Digital trees, also known as “tries”, are fundamental to a number of algorithmic schemes, including radixbased searching and sorting, lossless text compression, dynamic hashing algorithms, communication protocols of the tree or stack type, distributed leader election, and so on. This extended abstract develops the asymptotic form of expectations of the main parameters of interest, such as tree size and path length. The analysis is conducted under the simplest of all probabilistic models; namely, the memoryless source, under which letters that data items are comprised of are drawn independently from a fixed (finite) probability distribution. The precise asymptotic structure of the parameters’ expectations is shown to depend on fine singular properties in the complex plane of a ubiquitous Dirichlet series. Consequences include the characterization of a broad range of asymptotic regimes for error terms associated with trie parameters, as well as a classification that depends on specific arithmetic properties, especially irrationality measures, of the sources under consideration.
Pattern matching statistics on correlated sources
 In Proc. of LATIN’06 (2006
, 1992
"... Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may d ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may differ in a significant way. Here, we consider a general framework where the text is produced by a probabilistic source, which can be built by a dynamical system. Such “dynamical sources ” encompass the classical sources –memoryless sources, and Markov chains–, and may possess a high degree of correlations. We are mainly interested in two situations: the pattern is a general word of a regular expression, and we study the number of occurrence positions – the pattern is a finite set of strings, and we study the number of occurrences. In both cases, we determine the mean and the variance of the parameter, and prove that its distribution is asymptotically Gaussian. In this way, we extend methods and results which have been already obtained for classical sources [for instance in [9] and in [6]] to this general “dynamical ” framework. Our methods use various techniques: formal languages, and generating functions, as in previous works. However, in this correlated model, it is not possible to use a direct transfer into generating functions, and we mainly deal with generating operators which generate... generating functions. 1
CONCENTRATION INEQUALITIES AND ESTIMATION OF CONDITIONAL PROBABILITIES
"... Abstract. We prove concentration inequalities inspired from [DP] to obtain estimators of conditional probabilities for weak dependant sequences. This generalize results from Csiszár ([Cs]). For Gibbs measures and dynamical systems, these results lead to construct estimators of the potential function ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Abstract. We prove concentration inequalities inspired from [DP] to obtain estimators of conditional probabilities for weak dependant sequences. This generalize results from Csiszár ([Cs]). For Gibbs measures and dynamical systems, these results lead to construct estimators of the potential function and also to test the nullity of the asymptotic variance of the system. This paper deals with the problems of typicality and conditional typicality of “empirical probabilities ” for stochastic process and the estimation of potential functions for Gibbs measures and dynamical systems. The questions of typicality have been studied in [FKT] for independent sequences, in [BRY, R] for Markov chains. In order to prove the consistency of estimators of transition probability for Markov chains of unknown order, results on typicality and conditional typicality for some (Ψ)mixing process where obtained in [CsS, Cs]. Unfortunately, lots of natural mixing process do not satisfy this Ψmixing condition (see [DP]). We consider a class of mixing process inspired from [DP]. For this class, we prove strong typicality and strong conditional typicality. In the particular case of Gibbs measures (or complete connexions chains) and for certain dynamical systems, from the typicality results we derive an estimation of the potential as well as procedure to test the nullity of the asymptotic variance of the process. More formally, we consider X0,...., Xn,... a stochastic process taking values on an complete set Σ and a sequence of countable partitions of Σ, (Pk)k∈N such that if P ∈ Pk then there exists a unique � P ∈ Pk−1 such that almost surely, Xj ∈ P ⇒ Xj−1 ∈ � P. Our aim is to obtain empirical estimates on the probabilities: P(Xj ∈ P), P ∈ Pk, and the conditional probabilities:
ON BUFFON MACHINES AND NUMBERS
, 906
"... Abstract. Buffon’s needle experiment is wellknown: take a plane on which parallel lines at unit distance one from the next have been marked, throw a needle of unit length at random, and, finally, declare the experiment a success if the needle intersects one of the lines. Basic calculus implies that ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract. Buffon’s needle experiment is wellknown: take a plane on which parallel lines at unit distance one from the next have been marked, throw a needle of unit length at random, and, finally, declare the experiment a success if the needle intersects one of the lines. Basic calculus implies that the probability of success is 2. π = 0.63661, and the experiment can be regarded as an analog (i.e., continuous) device that stochastically “computes ” 2 π. Generalizing the experiment and simplifying the computational framework, we ask ourselves which probability distributions can be produced perfectly, from a discrete source of unbiased coin flips. We describe and analyse a few simple Buffon machines that can generate geometric, Poisson, and logarithmicseries distributions (these are in particular required to transform continuous Boltzmann samplers of classical combinatorial structures into purely discrete random generators). Say that a number is Buffon if it is the probability of success of a probabilistic experiment based on discrete coin flippings. We provide humanaccessible Buffon machines, which require a dozen coin flips or less, on average, and produce experiments whose probabilities are expressible in terms of numbers such as π, exp(−1), log 2, √ 3, cos ( 1 4), ζ(5). More generally, we develop a collection of constructions based on simple probabilistic mechanisms that enable one to create Buffon experiments involving compositions of exponentials and logarithms, polylogarithms, direct and inverse trigonometric functions, algebraic and hypergeometric functions, as well as functions defined by integrals, such as the Gaussian error function.
Euclidean dynamics
 Discrete and Continuous Dynamical Systems
"... Abstract. We study a general class of Euclidean algorithms which compute the greatest common divisor [gcd], and we perform probabilistic analyses of their main parameters. We view an algorithm as a dynamical system restricted to rational inputs, and combine tools imported from dynamics, such as tran ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. We study a general class of Euclidean algorithms which compute the greatest common divisor [gcd], and we perform probabilistic analyses of their main parameters. We view an algorithm as a dynamical system restricted to rational inputs, and combine tools imported from dynamics, such as transfer operators, with various tools of analytic combinatorics: generating functions, Dirichlet series, Tauberian theorems, Perron’s formula and quasipowers theorems. Such dynamical analyses can be used to perform the averagecase analysis of algorithms, but also (dynamical) analysis in distribution. 1. Introduction. Computing the Greatest Common Divisor [Gcd
Average Redundancy for Known Sources: Ubiquitous Trees in Source Coding
, 2008
"... Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle poi ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle point method, analytic poissonization and depoissonization, and singularity analysis. This approach lies at the crossroad of computer science and information theory. In this survey we concentrate on one facet of information theory (i.e., source coding better known as data compression), namely the redundancy rate problem. The redundancy rate problem determines by how much the actual code length exceeds the optimal code length. We further restrict our interest to the average redundancy for known sources, that is, when statistics of information sources are known. We present precise analyses of three types of lossless data compression schemes, namely fixedtovariable (FV) length codes, variabletofixed (VF) length codes, and variabletovariable (VV) length codes. In particular, we investigate average redundancy of Huffman, Tunstall, and Khodak codes. These codes have succinct representations as trees, either as coding or parsing trees, and we analyze here some of their parameters (e.g., the average path from the root to a leaf).