Results 1  10
of
28
Loglog Counting of Large Cardinalities
 In ESA
, 2003
"... Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes&quo ..."
Abstract

Cited by 85 (3 self)
 Add to MetaCart
(Show Context)
Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1= m. The "small bytes" to be used in order to count cardinalities till Nmax comprise about log log Nmax bits, so that cardinalities well in the range of billions can be determined using one or two kilobytes of memory only. The basic version of the LogLog algorithm is validated by a complete analysis. An optimized version, superLogLog, is also engineered and tested on reallife data. The algorithm parallelizes optimally.
Hyperloglog: The analysis of a nearoptimal cardinality estimation algorithm
 IN AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS
, 2007
"... This extended abstract describes and analyses a nearoptimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pa ..."
Abstract

Cited by 46 (1 self)
 Add to MetaCart
(Show Context)
This extended abstract describes and analyses a nearoptimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04 / √ m. This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10 9 with a typical accuracy of 2 % while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.
Singularity Analysis, Hadamard Products, and Tree Recurrences
, 2003
"... We present a toolbox for extracting asymptotic information on the coecients of combinatorial generating functions. This toolbox notably includes a treatment of the eect of Hadamard products on singularities in the context of the complex Tauberian technique known as singularity analysis. As a consequ ..."
Abstract

Cited by 39 (9 self)
 Add to MetaCart
We present a toolbox for extracting asymptotic information on the coecients of combinatorial generating functions. This toolbox notably includes a treatment of the eect of Hadamard products on singularities in the context of the complex Tauberian technique known as singularity analysis. As a consequence, it becomes possible to unify the analysis of a number of divideandconquer algorithms, or equivalently random tree models, including several classical methods for sorting, searching, and dynamically managing equivalence relations.
Practical Suffix Tree Construction
 In Proc. 13th International Conference on Very Large Data Bases
, 2004
"... Large string datasets are common in a number of emerging text and biological database applications. ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Large string datasets are common in a number of emerging text and biological database applications.
Digital Trees and Memoryless Sources: from Arithmetics to Analysis
 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA’10), Discrete Math. Theor. Comput. Sci. Proc
, 2010
"... Digital trees, also known as “tries”, are fundamental to a number of algorithmic schemes, including radixbased searching and sorting, lossless text compression, dynamic hashing algorithms, communication protocols of the tree or stack type, distributed leader election, and so on. This extended abstr ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
Digital trees, also known as “tries”, are fundamental to a number of algorithmic schemes, including radixbased searching and sorting, lossless text compression, dynamic hashing algorithms, communication protocols of the tree or stack type, distributed leader election, and so on. This extended abstract develops the asymptotic form of expectations of the main parameters of interest, such as tree size and path length. The analysis is conducted under the simplest of all probabilistic models; namely, the memoryless source, under which letters that data items are comprised of are drawn independently from a fixed (finite) probability distribution. The precise asymptotic structure of the parameters’ expectations is shown to depend on fine singular properties in the complex plane of a ubiquitous Dirichlet series. Consequences include the characterization of a broad range of asymptotic regimes for error terms associated with trie parameters, as well as a classification that depends on specific arithmetic properties, especially irrationality measures, of the sources under consideration.
Practical methods for constructing suffix trees
, 2005
"... Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluati ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very timeconsuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not
Analytic combinatorics  Symbolic Combinatorics
, 2002
"... This booklet develops in nearly 200 pages the basics of combinatorial enumeration through an approach that revolves around generating functions. The major objects of interest here are words, trees, graphs, and permutations, which surface recurrently in all areas of discrete mathematics. The text pre ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
This booklet develops in nearly 200 pages the basics of combinatorial enumeration through an approach that revolves around generating functions. The major objects of interest here are words, trees, graphs, and permutations, which surface recurrently in all areas of discrete mathematics. The text presents the core of the theory with chapters on unlabelled enumeration and ordinary generating functions, labelled enumeration and exponential generating functions, and finally multivariate enumeration and generating functions. It is largely oriented towards applications of combinatorial enumeration to random discrete structures and discrete mathematics models, as they appear in various branches of science, like statistical physics, computational biology, probability theory, and, last not least, computer science and the analysis of algorithms.
Limit theorems for pattern in phylogenetic trees
 Journal of Mathematical Biology
, 2010
"... Studying the shape of phylogenetic trees under different random models is an important issue in evolutionary biology. In this paper, we propose a general framework for deriving detailed statistical results for patterns in phylogenetic trees under the YuleHarding model and the uniform model, two of ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
(Show Context)
Studying the shape of phylogenetic trees under different random models is an important issue in evolutionary biology. In this paper, we propose a general framework for deriving detailed statistical results for patterns in phylogenetic trees under the YuleHarding model and the uniform model, two of the most fundamental random models considered in phylogenetics. Our framework will unify several recent studies which were mainly concerned with the mean value and the variance. Moreover, refined statistical results such as central limit theorems, BerryEsseen bounds, local limit theorems, etc. are obtainable with our approach as well. A key contribution of the current study is that our results are applicable to the whole range of possible sizes of the pattern. 1
The Number of Symbol Comparisons in QuickSort and QuickSelect
"... We revisit the classical QuickSort and QuickSelect algorithms, under a complexity model that fully takes into account the elementary comparisons between symbols composing the records to be processed. Our probabilistic models belong to a broad category of information sources that encompasses memory ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
We revisit the classical QuickSort and QuickSelect algorithms, under a complexity model that fully takes into account the elementary comparisons between symbols composing the records to be processed. Our probabilistic models belong to a broad category of information sources that encompasses memoryless (i.e., independentsymbols) and Markov sources, as well as many unboundedcorrelation sources. We establish that, under our conditions, the averagecase complexity of QuickSort is O(n log² n) [rather than O(n log n), classically], whereas that of QuickSelect remains O(n). Explicit expressions for the implied constants are provided by our combinatorial–analytic methods.
ON BUFFON MACHINES AND NUMBERS
, 2009
"... Buffon’s needle experiment is wellknown: take a plane on which parallel lines at unit distance one from the next have been marked, throw a needle of unit length at random, and, finally, declare the experiment a success if the needle intersects one of the lines. Basic calculus implies that the proba ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Buffon’s needle experiment is wellknown: take a plane on which parallel lines at unit distance one from the next have been marked, throw a needle of unit length at random, and, finally, declare the experiment a success if the needle intersects one of the lines. Basic calculus implies that the probability of success is 2. π = 0.63661, and the experiment can be regarded as an analog (i.e., continuous) device that stochastically “computes ” 2 π. Generalizing the experiment and simplifying the computational framework, we ask ourselves which probability distributions can be produced perfectly, from a discrete source of unbiased coin flips. We describe and analyse a few simple Buffon machines that can generate geometric, Poisson, and logarithmicseries distributions (these are in particular required to transform continuous Boltzmann samplers of classical combinatorial structures into purely discrete random generators). Say that a number is Buffon if it is the probability of success of a probabilistic experiment based on discrete coin flippings. We provide humanaccessible Buffon machines, which require a dozen coin flips or less, on average, and produce experiments whose probabilities are expressible in terms of numbers such as π, exp(−1), log 2, √ 3, cos ( 1 4), ζ(5). More generally, we develop a collection of constructions based on simple probabilistic mechanisms that enable one to create Buffon experiments involving compositions of exponentials and logarithms, polylogarithms, direct and inverse trigonometric functions, algebraic and hypergeometric functions, as well as functions defined by integrals, such as the Gaussian error function.