## Loglog Counting of Large Cardinalities (2003)

Venue: | In ESA |

Citations: | 74 - 3 self |

### BibTeX

@INPROCEEDINGS{Durand03loglogcounting,

author = {Marianne Durand and Philippe Flajolet},

title = {Loglog Counting of Large Cardinalities},

booktitle = {In ESA},

year = {2003},

pages = {605--617}

}

### Years of Citing Articles

### OpenURL

### Abstract

Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1= m. The "small bytes" to be used in order to count cardinalities till Nmax comprise about log log Nmax bits, so that cardinalities well in the range of billions can be determined using one or two kilobytes of memory only. The basic version of the LogLog algorithm is validated by a complete analysis. An optimized version, super-LogLog, is also engineered and tested on real-life data. The algorithm parallelizes optimally.

### Citations

2203 |
The art of computer programming
- Knuth
- 1981
(Show Context)
Citation Context ...U into suciently long binary strings, in such a way that bits composing the hashed value closely resemble random uniform independent bits. This pragmatic attitude 1 is justied by Knuth who writes in [=-=10-=-]: \It is theoretically impossible to dene a hash function that creates random data from non-random data in actualsles. But in practice it is not dicult to produce a pretty good imitation of random da... |

1874 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...ve since it is totally independent of the order 1 The more theoretically inclined reader may prefer to draw h at random from a family of universal hash functions; see, e.g., the general discussion in =-=[12-=-] and the specic [1]. LOGLOG COUNTING OF LARGE CARDINALITIES 5 and the replication structure of the multiset M. In fact, in probabilistic terms, the quantity R is precisely distributed in the same way... |

700 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1999
(Show Context)
Citation Context ...uracy is about 1:20= p m, which is about 50% less ecient than Probabilistic Counting. An insightful complexity-theoretic discussion of approximate counting is provided by Alon, Matias, and Szegedy in =-=[1]. The-=- authors discuss a class of \frequency{ moments" statistics which includes ours (as their F 0 statistics). Our LogLog Algorithm has principles that evoke some of those found in the intersection o... |

338 | Probabilistic Counting Algorithms for Data Base Applications
- Flajolet, Martin
- 1985
(Show Context)
Citation Context ...ions (\stationarity") regarding the data input to the algorithm. (We recommend the thorough engineering discussion of [3].) Closer to us is the Probabilistic Counting algorithm of Flajolet and Ma=-=rtin [7]-=-. This uses a certain observable that has excellent statistical properties but is relatively costly to maintain in terms of storage. Indeed, Probabilistic Counting estimates cardinalities with an erro... |

168 | Mellin transforms and asymptotics: Harmonic sums. Theoret
- Flajolet, Gourdon, et al.
- 1995
(Show Context)
Citation Context ...s sums of the form P f(x) have a transform that factorizes as ( P s ) f ? (s). The conjunction of both properties then renders possible the analysis of fairly intricate combinatorial sums: see [6] for an extensive survey and Szpankowski's book [14] for many applications to the analysis of algorithms. (Property (i) results from the Mellin inversion formula and the residue theorem; Property (ii)... |

93 | A linear-time probabilistic counting algorithm for database applications
- Whang, Vander-Zanden, et al.
- 1990
(Show Context)
Citation Context ...0 H H 0 3), then the eect of hashing collisions must be compensated for. This is achieved by inverting the function that gives the expected value of the number of collisions in a hash table (see [3,=-=-=- 15] for an analogous discussion). The estimator is then to be changed into 2 H log 1 e mm 2 H 2 1 m P ? M (j) : (No detectable degradation of performance results from the last modication of the est... |

58 |
Counting large numbers of events in small registers
- Morris
- 1978
(Show Context)
Citation Context ...One has log 2 log 2 2 17 : = 4:09. Adopt ! = 0:91, so that each register has a size of ` = 5 4 A counting algorithm exhibiting a log-log feature in a dierent context is Morris's Approximate Counting [=-=11]-=- analysed in [4]. LOGLOG COUNTING OF LARGE CARDINALITIES 11 bits, i.e., a value less than 32. Applying the upperbound of (8) shows that an `{ restriction will have little incidence on the result: the ... |

52 | Analytical depoissonization and its applications
- Jacquet, Szpankowski
- 1998
(Show Context)
Citation Context ... 10 of Szpankowski's book [14]. We choose the method called \analytic depoissonization" by Jacquet and Szpankowski, whose underlying engine is the saddle point method applied to Cauchy integrals;=-= see [9, 14]-=-. In essence, the values of an exponential generating function at large arguments are closely related to the asymptotic form of its coecients provided the generating function decays fast enough away f... |

42 | Approximate counting: a detailed analysis
- FLAJOLET
- 1985
(Show Context)
Citation Context ... 2 2 17 : = 4:09. Adopt ! = 0:91, so that each register has a size of ` = 5 4 A counting algorithm exhibiting a log-log feature in a dierent context is Morris's Approximate Counting [11] analysed in [=-=4]-=-. LOGLOG COUNTING OF LARGE CARDINALITIES 11 bits, i.e., a value less than 32. Applying the upperbound of (8) shows that an `{ restriction will have little incidence on the result: the probability of a... |

38 | Combinatorics of geometrically distributed random variables: new q-tangent and q-secant numbers
- Prodinger
(Show Context)
Citation Context ...stic terms, the quantity R is precisely distributed in the same way as 1 plus the maximum of n independent geometric variables of parameter 1 2 . This is an extensively researched subject; see, e.g., =-=[13]. It-=- turns out that R estimates log 2 n with an additive bias of 1:33 and a standard deviation of 1:87. Thus, in a sense, the observed value of R estimates \logarithmically" n within 1:87 binary orde... |

26 |
Average-Case Analysis of Algorithms on Sequences
- Szpankowski
- 2001
(Show Context)
Citation Context ...actorizes as ( P s ) f ? (s). The conjunction of both properties then renders possible the analysis of fairly intricate combinatorial sums: see [6] for an extensive survey and Szpankowski's book [14=-=]-=- for many applications to the analysis of algorithms. (Property (i) results from the Mellin inversion formula and the residue theorem; Property (ii) re ects the action of Mellin transforms on rescaled... |

22 | Aqua: System and techniques for approximate query answering
- Gibbons, Poosala, et al.
- 1998
(Show Context)
Citation Context ...the goal is to provide an approximate response in time that is orders-of-magnitude less than what computing an exact answer would require: see the description of the Aqua Project by Gibbons et al. in =-=[8-=-]. The analysis of trac in routers, as already mentioned, benets greatly of cardinality estimators|this is lucidly exposed by Estan et al. in [2, 3]. Certain types of attacks (e.g., \denial of service... |

14 | On adaptative sampling
- Flajolet
- 1990
(Show Context)
Citation Context ...ith selectivity p 1, store exactly and without duplicates the data itemssltered and return as estimate 1=p times the corresponding cardinality. Wegner's Adaptive Sampling (described and analysed in [=-=5]) is an el-=-egant way to maintain dynamically varying values of p. For m \words" of memory (where here \word" refers to the space needed by a data item), the accuracy is about 1:20= p m, which is about ... |

6 |
New directions in tra#c measurement and accounting
- Estan, Varghese
- 2002
(Show Context)
Citation Context ...e description of the Aqua Project by Gibbons et al. in [8]. The analysis of trac in routers, as already mentioned, benets greatly of cardinality estimators|this is lucidly exposed by Estan et al. in [=-=2, 3]. Certain -=-types of attacks (e.g., \denial of service" and \port scans") are betrayed by alarmingly high counts of certain characteristic events at the level of routers. In such situations, there is us... |

1 |
Bitmap algorithms for counting active on high speed links
- Estan, Varghese, et al.
- 2003
(Show Context)
Citation Context ...rac in routers. In such contexts, the data may be either too large tost at once in core memory or even too massive to be stored, being a huge continuoussow of data packets. For instance, Estan et al. =-=[3] repo-=-rt traces of packet headers, produced at a rate of 0.5GB per hour of compressed data (!), which were collected while trying to trace a \worm" (Code Red, August 1 to 12, 2001), and on which it was... |