## Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm (2007)

### Cached

### Download Links

Venue: | IN AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS |

Citations: | 18 - 1 self |

### BibTeX

@INPROCEEDINGS{Flajolet07hyperloglog:the,

author = {Philippe Flajolet and Éric Fusy and Olivier Gandouet and et al.},

title = {Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm},

booktitle = {IN AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS},

year = {2007},

publisher = {}

}

### OpenURL

### Abstract

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04 / √ m. This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10 9 with a typical accuracy of 2 % while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

### Citations

2197 |
The art of computer programming
- Knuth
- 1973
(Show Context)
Citation Context ...he hashed values closely resemble a uniform model of randomness, namely, bits of hashed values are assumed to be independent and to have each probability 1 2 of occurring— practical methods are known =-=[20]-=-, which vindicate this assumption, based on cyclic redundancy codes (CRC), modular arithmetics, or a simplified cryptographic use of boolean algebra (e.g., sha1). The best known cardinality estimators... |

702 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1999
(Show Context)
Citation Context ...hed values h(M ) of the input multiset M , then inferring a plausible estimate of the unknown cardinality n. Define an observable of a multiset S ≡ h(M ) of {0,1} ∞ strings (or, equivalently, of real =-=[0,1]-=- numbers) to be a function that only depends on the set underlying S, that is, a quantity independent of replications. Then two broad categories of cardinality observables have been studied. — Bit-pat... |

341 | On the resemblance and containment of documents
- Broder
- 1997
(Show Context)
Citation Context ...dinalities are measured (see the lucid exposition by Estan and Varghese in [11]). Other applications of cardinality estimators include data mining of massive data sets of sorts—natural language texts =-=[4, 5]-=-, biological data [17, 18], very large structured databases, or the internet graph, where the authors of [22] report computational gains by a factor of 500 + attained by probabilistic cardinality esti... |

340 | Probabilistic counting algorithms for data base applications
- Flajolet, Martin
- 1985
(Show Context)
Citation Context ...ting n approximately. (In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range of algorithms have been developed that only require a sublinear memory =-=[2, 6, 10, 11, 15, 16]-=-, or, at worst a linear memory, but with a small implied constant [24]. All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions. The elements to... |

228 | Maintaining stream statistics over sliding windows:(extended abstract - Datar, Gionis, et al. - 2002 |

169 | Mellin transforms and asymptotics: Harmonic sums
- Flajolet, Gourdon, et al.
- 1995
(Show Context)
Citation Context ...erLogLog: analysis of a near-optimal cardinality algorithm 7 The integral representation. The asymptotic analysis of the Poisson expectation departs from the usual paradigm of analysis exemplified by =-=[10, 14, 23]-=- because of the coupling introduced by the harmonic mean, namely, the factor (∑2−k j) −1 . This is remedied by a use of the simple identity 1 a = Z ∞ e 0 −at dt, (7) which then leads to a crucial sepa... |

144 | Counting distinct elements in a data stream - Bar-Yossef, Jayram, et al. - 2002 |

114 | Broder ,"Identifying and Filtering Near-Duplicate Documents
- Andrei
(Show Context)
Citation Context ...dinalities are measured (see the lucid exposition by Estan and Varghese in [11]). Other applications of cardinality estimators include data mining of massive data sets of sorts—natural language texts =-=[4, 5]-=-, biological data [17, 18], very large structured databases, or the internet graph, where the authors of [22] report computational gains by a factor of 500 + attained by probabilistic cardinality esti... |

105 |
Asymptotic Methods in Analysis
- Bruijn
- 1981
(Show Context)
Citation Context ... ∞ Js(m) = u 0 s f (u) m du. 1 mJ0(m) , βm = √ √ J1(m) m − 1. J0(m) 2HyperLogLog: analysis of a near-optimal cardinality algorithm 13 The integrals Js(m) are routinely amenable to the Laplace method =-=[8]-=-: ⎧ ⎪⎨ ⎪⎩ J0(m) = 2log2 ( m J1(m) = (2log2)2 m2 m (3log2 − 1) + O(m−2 ) ) 1 + 1 ( 1 + 3 m (3log2 − 1) + O(m−2 ) ) . (28) Thus the bias correction αm and the variance constant βm satisfy αm ∼ 1 . = 0.7... |

98 | Bitmap algorithms for counting active flows on high-speed links - Estan, Varghese, et al. - 2006 |

92 | A linear-time probabilistic counting algorithm for database applications
- Whang, Vander-zanden, et al.
- 1990
(Show Context)
Citation Context ...the result is acceptable.) A whole range of algorithms have been developed that only require a sublinear memory [2, 6, 10, 11, 15, 16], or, at worst a linear memory, but with a small implied constant =-=[24]-=-. All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions. The elements to be counted belonging to a certain data domain D, we assume given a ha... |

74 | Loglog counting of large cardinalities
- Durand, Flajolet
(Show Context)
Citation Context ...ting n approximately. (In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range of algorithms have been developed that only require a sublinear memory =-=[2, 6, 10, 11, 15, 16]-=-, or, at worst a linear memory, but with a small implied constant [24]. All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions. The elements to... |

72 |
Random Allocations
- Kolchin, Sevastyanov, et al.
- 1978
(Show Context)
Citation Context ... that is, n must be close to mlog(m/V ). (The quality of this estimate can be precisely analysed, since exact and asymptotic forms are known for the mean, variance, and distribution of V ; see, e.g., =-=[21]-=-.) Here the bins are the ones associated to the m “submultisets”, and one knows that a bin j is empty from the fact that its corresponding register M[ j] has preserved its initial value 0. This correc... |

53 | Analytical de-Poissonization and its applications
- Jacquet, Szpankowski
- 1998
(Show Context)
Citation Context ...el (Proposition 2) applies to the fixed size model, up to negligible error terms. To this aim, we appeal to a technique known as “analytic depoissonization”, pioneered by Jacquet and Szpankowski (see =-=[19]-=- and [23, p. 456]) and based on the saddle point method. To wit:10 P. Flajolet, É. Fusy, O. Gandouet, F. Meunier Theorem (Analytic depoissonization). Let f (z) = e−z ∑ fkzk /k! be the Poisson generat... |

27 | Using rank propagation and probabilistic counting for link-based spam detection
- Becchetti, Castillo, et al.
- 2006
(Show Context)
Citation Context ...ever growing number of applications in networking and traffic monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service), and of link-based spam on the web =-=[3]-=-. For instance, a data stream over a network consists of a sequence of packets, each packet having a header, which contains a pair (source–destination) of addresses, followed by a body of specific dat... |

25 |
Average-case analysis of algorithms on sequences
- Szpankowski
- 2001
(Show Context)
Citation Context ...erLogLog: analysis of a near-optimal cardinality algorithm 7 The integral representation. The asymptotic analysis of the Poisson expectation departs from the usual paradigm of analysis exemplified by =-=[10, 14, 23]-=- because of the coupling introduced by the harmonic mean, namely, the factor (∑2−k j) −1 . This is remedied by a use of the simple identity 1 a = Z ∞ e 0 −at dt, (7) which then leads to a crucial sepa... |

24 | Order statistics and estimating cardinalities of massive data sets
- Giroire
(Show Context)
Citation Context ...ting n approximately. (In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range of algorithms have been developed that only require a sublinear memory =-=[2, 6, 10, 11, 15, 16]-=-, or, at worst a linear memory, but with a small implied constant [24]. All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions. The elements to... |

14 | On adaptative sampling
- Flajolet
- 1990
(Show Context)
Citation Context ...s m increases. The benefit of this approach [ m−1 mHyperLogLog: analysis of a near-optimal cardinality algorithm 3 Algorithm Cost (units) Accuracy 1 Hit Counting [24] 10 N bits ≈2% Adaptive Sampling =-=[12]-=- m words (≈32 bits) 1.20/ √ m Probabilistic Counting [15] m words (≤32 bits) 0.78/ √ m MINCOUNT [2, 6, 16, 18] m words (≤32 bits) 1.00/ √ m LOGLOG [10] m bytes (≤5 bits) 1.30/ √ m HYPERLOGLOG m bytes ... |

10 |
Counting by coin tossings
- FLAJOLET
(Show Context)
Citation Context ...tions per element of the multiset M (as opposed to a quantity proportional to m), while only one hash function is now needed. The performances of several algorithms are compared in Figure 1; see also =-=[13]-=- for a review. HYPERLOGLOG, described in detail in the next section, is based on the same observable as LOGLOG, namely the largest ρ value obtained, where ρ(x) is the position of the leftmost 1-bit in... |

5 | Combinatoire analytique et algorithmique des ensembles de données - DURAND - 2004 |

3 | Efficient estimation of the cardinality of large data sets
- CHASSAING, GÉRIN
- 2006
(Show Context)
Citation Context |

2 | Directions to use probabilistic algorithms for cardinality for DNA analysis
- GIROIRE
(Show Context)
Citation Context ...(see the lucid exposition by Estan and Varghese in [11]). Other applications of cardinality estimators include data mining of massive data sets of sorts—natural language texts [4, 5], biological data =-=[17, 18]-=-, very large structured databases, or the internet graph, where the authors of [22] report computational gains by a factor of 500 + attained by probabilistic cardinality estimators. Clearly, the cardi... |

2 |
algorithmique et analyse combinatoire de grands ensembles
- GIROIRE
- 2006
(Show Context)
Citation Context ...(see the lucid exposition by Estan and Varghese in [11]). Other applications of cardinality estimators include data mining of massive data sets of sorts—natural language texts [4, 5], biological data =-=[17, 18]-=-, very large structured databases, or the internet graph, where the authors of [22] report computational gains by a factor of 500 + attained by probabilistic cardinality estimators. Clearly, the cardi... |

2 | Data mining on large graphs
- PALMER, GIBBONS, et al.
(Show Context)
Citation Context ...lity estimators include data mining of massive data sets of sorts—natural language texts [4, 5], biological data [17, 18], very large structured databases, or the internet graph, where the authors of =-=[22]-=- report computational gains by a factor of 500 + attained by probabilistic cardinality estimators. Clearly, the cardinality of a multiset can be exactly determined with a storage complexity essentiall... |