## An Optimal Algorithm for the Distinct Elements Problem

### Cached

### Download Links

Citations: | 28 - 4 self |

### BibTeX

@MISC{Kane_anoptimal,

author = {Daniel M. Kane and Jelani Nelson and David P. Woodruff},

title = {An Optimal Algorithm for the Distinct Elements Problem},

year = {}

}

### OpenURL

### Abstract

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,..., n}, our algorithm computes a (1 ± ε)approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.

### Citations

1870 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...r the conditional expectation E[Tr ′′(t) | E′ ∧ A ′ ] for r ′′ > r ′ + 1. Call these conditional expectations Er. Then since (1 − 1/n)/e ≤ (1 − 1/n) n ≤ 1/e for all real n ≥ 1 (see Proposition B.3 of =-=[29]-=-), we have that Er/KRE lies in the interval ⎡ ⎛ ( ) ( |Ir(t)| |Ir(t)| − − ⎣ K 1 − e RE , ⎝1 K − e RE 1 − 1 ⎞⎤ ) |Ir(t)| KRE ⎠⎦ KRE Thus for r ′′ > r ′ + 1, ( Er ′′ ≤ 1 − e −7/24 ( 1 − 1 ) ) 7/24 ( KRE... |

697 | The space complexity of approximating the frequency moments
- ALON, MATIAS, et al.
- 1996
(Show Context)
Citation Context ...elligence Laboratory. minilek@mit.edu. ‡ IBM Almaden Research Center, 650 Harry Road, San Jose, CA, USA. dpwoodru@us.ibm.com. 1Paper Space Update Time Notes [21] O(log n) - Random oracle, constant ε =-=[3]-=- O(log n) O(log n) constant ε [25] O(ε −2 log n) O(ε −2 ) [5] O(ε −3 log n) O(ε −3 ) [4] O(ε −2 log n) O(log(ε −1 )) Algorithm I in the paper [4] O(ε −2 log log n + poly(log(ε −1 log n)) log n) ε −2 p... |

674 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ...e lsb(0) = log(n). All our logarithms are base 2 unless stated otherwise. We also use Hk(U, V ) to denote some k-wise independent hash family of functions mapping U into V . Using known constructions =-=[11]-=-, a random h ∈ Hk(U, V ) can be represented in O(k log(|U| + |V |)) bits when |U|, |V | are powers of 2, and computed in the same amount of space. Also, henceforth, whenever we discuss picking an h ∈ ... |

482 | Access path selection in a relational database management system
- Selinger, Astrahan, et al.
- 1979
(Show Context)
Citation Context ...s a fundamental problem in network traffic monitoring, query optimization, data mining, and several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan =-=[34]-=-, database design [19], OLAP [31, 35], data integration [10, 14], and data warehousing [1]. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs... |

339 | Probabilistic Counting Algorithms for Data Base Applications
- Flajolet, Martin
- 1985
(Show Context)
Citation Context ... † MIT Computer Science and Artificial Intelligence Laboratory. minilek@mit.edu. ‡ IBM Almaden Research Center, 650 Harry Road, San Jose, CA, USA. dpwoodru@us.ibm.com. 1Paper Space Update Time Notes =-=[21]-=- O(log n) - Random oracle, constant ε [3] O(log n) O(log n) constant ε [25] O(ε −2 log n) O(ε −2 ) [5] O(ε −3 log n) O(ε −3 ) [4] O(ε −2 log n) O(log(ε −1 )) Algorithm I in the paper [4] O(ε −2 log lo... |

145 |
Surpassing the information theoretic bound with fusion trees
- Fredman, Willard
- 1993
(Show Context)
Citation Context ...ction we discuss an implementation of our F0 algorithm in Figure 3 with O(1) update and reporting times. We first state a few theorems from previous works. Theorem 5 (Brodnik [8], Fredman and Willard =-=[22]-=-). The least and most significant bits of an integer fitting in a machine word can be computed in constant time. The next two theorems give hash families which have strong independence properties whil... |

143 | Counting distinct elements in a data stream - BAR-YOSSEF, JAYRAM, et al. - 2002 |

129 | The Design of Dynamic Data Structures - Overmars - 1983 |

114 | Reductions in streaming algorithms, with an application to counting triangles in graphs
- Bar-Yosseff, Kumar, et al.
- 2002
(Show Context)
Citation Context ... Center, 650 Harry Road, San Jose, CA, USA. dpwoodru@us.ibm.com. 1Paper Space Update Time Notes [21] O(log n) - Random oracle, constant ε [3] O(log n) O(log n) constant ε [25] O(ε −2 log n) O(ε −2 ) =-=[5]-=- O(ε −3 log n) O(ε −3 ) [4] O(ε −2 log n) O(log(ε −1 )) Algorithm I in the paper [4] O(ε −2 log log n + poly(log(ε −1 log n)) log n) ε −2 poly(log(ε −1 log n)) Algorithm II in the paper [4] O(ε −2 (lo... |

114 | Size-estimation framework with applications to transitive closure and reachability
- Cohen
- 1997
(Show Context)
Citation Context ...be amplified by independent repetition. The problem of space-efficient F0-estimation is well-studied, beginning with the work of Flajolet and Martin [21], and continuing with a long line of research, =-=[3, 4, 5, 6, 9, 12, 16, 18, 20, 24, 25, 27, 37]-=-. In this work, we finally settle both the space- and time-complexities of F0-estimation by giving an algorithm using O(ε−2 + log(n)) bits of space, with worst-case update and reporting times O(1). By... |

99 | Bitmap Algorithms for Counting Active Flows on High Speed Links - Estan, Varghese, et al. - 2006 |

96 | Distinct sampling for highly-accurate answers to distinct values queries and event reports
- GIBBONS
(Show Context)
Citation Context ...be amplified by independent repetition. The problem of space-efficient F0-estimation is well-studied, beginning with the work of Flajolet and Martin [21], and continuing with a long line of research, =-=[3, 4, 5, 6, 9, 12, 16, 18, 20, 24, 25, 27, 37]-=-. In this work, we finally settle both the space- and time-complexities of F0-estimation by giving an algorithm using O(ε−2 + log(n)) bits of space, with worst-case update and reporting times O(1). By... |

88 | Estimating simple functions on the union of data streams
- Gibbons, Tirthapura
- 2001
(Show Context)
Citation Context ....edu. ‡ IBM Almaden Research Center, 650 Harry Road, San Jose, CA, USA. dpwoodru@us.ibm.com. 1Paper Space Update Time Notes [21] O(log n) - Random oracle, constant ε [3] O(log n) O(log n) constant ε =-=[25]-=- O(ε −2 log n) O(ε −2 ) [5] O(ε −3 log n) O(ε −3 ) [4] O(ε −2 log n) O(log(ε −1 )) Algorithm I in the paper [4] O(ε −2 log log n + poly(log(ε −1 log n)) log n) ε −2 poly(log(ε −1 log n)) Algorithm II ... |

76 | Physical Database Design for Relational Databases
- Finkelstein, Schkolnick, et al.
- 1988
(Show Context)
Citation Context ...m in network traffic monitoring, query optimization, data mining, and several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan [34], database design =-=[19]-=-, OLAP [31, 35], data integration [10, 14], and data warehousing [1]. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs, requested URLs, and ... |

73 | Storage estimation for multidimensional aggregates in the presence of hierarchies
- Shukla, Deshpande, et al.
- 1996
(Show Context)
Citation Context ...k traffic monitoring, query optimization, data mining, and several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan [34], database design [19], OLAP =-=[31, 35]-=-, data integration [10, 14], and data warehousing [1]. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs, requested URLs, and source-destinat... |

71 | Comparing data streams using hamming norms (how to zero in
- Cormode, Datar, et al.
(Show Context)
Citation Context ...ar Yossef et al. [4], who provide algorithms with various tradeoffs (Algorithms I, II, and III in Figure 1). We also give a new algorithm for estimating L0, also known as the Hamming norm of a vector =-=[13]-=-, with optimal running times and near-optimal space. This problem is a generalization of F0estimation to the case when items can be removed from the stream. While F0-estimation is useful for a single ... |

59 | Optimal space lower bounds for all frequency moments
- Woodruff
- 2004
(Show Context)
Citation Context ...be amplified by independent repetition. The problem of space-efficient F0-estimation is well-studied, beginning with the work of Flajolet and Martin [21], and continuing with a long line of research, =-=[3, 4, 5, 6, 9, 12, 16, 18, 20, 24, 25, 27, 37]-=-. In this work, we finally settle both the space- and time-complexities of F0-estimation by giving an algorithm using O(ε−2 + log(n)) bits of space, with worst-case update and reporting times O(1). By... |

58 | Mining database structure; or, how to build a data quality browser
- Dasu, Johnson, et al.
- 2002
(Show Context)
Citation Context ... optimization, data mining, and several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan [34], database design [19], OLAP [31, 35], data integration =-=[10, 14]-=-, and data warehousing [1]. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs, requested URLs, and source-destination pairs on a link. Distin... |

52 |
S.: The Aqua Approximate Query Answering System
- Acharya, Gibbons, et al.
- 1999
(Show Context)
Citation Context ... several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan [34], database design [19], OLAP [31, 35], data integration [10, 14], and data warehousing =-=[1]-=-. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs, requested URLs, and source-destination pairs on a link. Distinct elements estimation is ... |

42 |
Balls and bins: A study in negative dependence. Random Struct
- Dubhashi, Ranjan
- 1998
(Show Context)
Citation Context ... ( Er ′′ ≤ 1 − e −7/24 ( 1 − 1 ) ) 7/24 ( KRE, and Er ′ ≥ 1 − e KRE −1/3) KRE. A calculation shows that Er ′′ < .99Er ′ since KRE ≥ 8. By negative dependence in the balls and bins random process (see =-=[15]-=-), the Chernoff bound applies to Tr(t) and thus Pr [|Tr′(t) − Er ′| ≥ ɛEr ′ | E ∧ A] ≤ 2e−ɛ2 Er ′/3 for any ɛ > 0, and thus by taking ɛ a small enough constant, Pr[E ′′ | E ∧ A] ≥ 1 − e −Ω(KRE) . We a... |

42 | Tight lower bounds for the distinct elements problem
- Indyk, Woodruff
- 2003
(Show Context)
Citation Context |

41 | Algorithms for dynamic geometric problems over data streams
- Indyk
- 2004
(Show Context)
Citation Context ...r of streams to measure the number of unequal item counts. This makes it more flexible than F0, and can be used in applications such as maintaining ad-hoc communication networks amongst cheap sensors =-=[26]-=-. It also has applications to data cleaning to find columns that are mostly similar [14]. Even if the rows in the two columns are in different orders, streaming algorithms for L0 can quickly identify ... |

37 | On synopses for distinct-value estimation under multiset operations
- Beyer, Haas, et al.
- 2007
(Show Context)
Citation Context ... the paper [4] O(ε −2 (log(ε −1 log n) + log n)) O(ε −2 (log(ε −1 log n))) Algorithm III in the paper [16] O(ε −2 log log n + log n) - Random oracle, additive error [18] O(ε −2 log n) - Random oracle =-=[6]-=- O(ε −2 log n) O(log(ε −1 )) [20] O(ε −2 log log n + log n) - Random oracle, additive error This work O(ε −2 + log n) O(1) Optimal Figure 1: Comparison of our algorithm to previous algorithms on estim... |

19 | Detecting ddos attacks on isp networks
- Akella, Bharambe, et al.
- 2003
(Show Context)
Citation Context ... statistics such as distinct destination IPs, requested URLs, and source-destination pairs on a link. Distinct elements estimation is also useful in detecting Denial of Service attacks and port scans =-=[2, 18]-=-. In such applications the data is too large to fit at once in main memory or too massive to be stored, being a continuous flow of data packets. This makes small-space algorithms necessary. Furthermor... |

19 | Access path selection in a relational database management system - Lorie, Price - 1979 |

19 | Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Conference on Analysis of Algorithms
- Flajolet, Fusy, et al.
- 2007
(Show Context)
Citation Context ...log n) + log n)) O(ε −2 (log(ε −1 log n))) Algorithm III in the paper [16] O(ε −2 log log n + log n) - Random oracle, additive error [18] O(ε −2 log n) - Random oracle [6] O(ε −2 log n) O(log(ε −1 )) =-=[20]-=- O(ε −2 log log n + log n) - Random oracle, additive error This work O(ε −2 + log n) O(1) Optimal Figure 1: Comparison of our algorithm to previous algorithms on estimating the number of distinct elem... |

18 | On the Exact Space Complexity of Sketching and Streaming Small Norms
- Kane, Nelson, et al.
- 2010
(Show Context)
Citation Context ...O(log(1/ε)) update time and required O(ε −2 log(n) log(mM)) space. Our update and reporting times are optimal, and the space is optimal up to the log(1/ε) + log log(mM) term due to known lower bounds =-=[3, 28]-=-. Furthermore, unlike with Ganguly’s algorithm, our algorithm does not require that xi ≥ 0 for each i to operate correctly. 1.1 Overview of our algorithms Our algorithms build upon several techniques ... |

16 |
Loglog counting of large cardinalities (extended abstract
- Durand, Flajolet
- 2003
(Show Context)
Citation Context ...per [4] O(ε −2 log log n + poly(log(ε −1 log n)) log n) ε −2 poly(log(ε −1 log n)) Algorithm II in the paper [4] O(ε −2 (log(ε −1 log n) + log n)) O(ε −2 (log(ε −1 log n))) Algorithm III in the paper =-=[16]-=- O(ε −2 log log n + log n) - Random oracle, additive error [18] O(ε −2 log n) - Random oracle [6] O(ε −2 log n) O(log(ε −1 )) [20] O(ε −2 log log n + log n) - Random oracle, additive error This work O... |

16 | The connectivity and fault-tolerance of the Internet topology
- Palmer, Faloutsos, et al.
- 2001
(Show Context)
Citation Context ...istinct queries made to a search engine, or distinct users clicking on a link or visiting a website. Distinct item estimation was also used in estimating connectivity properties of the Internet graph =-=[33]-=-. ∗ Harvard University, Department of Mathematics. dankane@math.harvard.edu. † MIT Computer Science and Artificial Intelligence Laboratory. minilek@mit.edu. ‡ IBM Almaden Research Center, 650 Harry Ro... |

16 |
Counting distinct items over update streams
- Ganguly
- 2007
(Show Context)
Citation Context ...mation algorithm with O(1) update and reporting times, using a total of O(ε −2 log(n)(log(1/ε) + log log(mM))) bits of space, both of which improve upon the previously best known algorithm of Ganguly =-=[23]-=-, which had O(log(1/ε)) update time and required O(ε −2 log(n) log(mM)) space. Our update and reporting times are optimal, and the space is optimal up to the log(1/ε) + log log(mM) term due to known l... |

15 | Computation of the least significant set bit
- Brodnik
- 1993
(Show Context)
Citation Context ...unning time In this subsection we discuss an implementation of our F0 algorithm in Figure 3 with O(1) update and reporting times. We first state a few theorems from previous works. Theorem 5 (Brodnik =-=[8]-=-, Fredman and Willard [22]). The least and most significant bits of an integer fitting in a machine word can be computed in constant time. The next two theorems give hash families which have strong in... |

10 | Uniform hashing in constant time and optimal space
- Pagh, Pagh
- 2008
(Show Context)
Citation Context ... (see [30, Ch. 5]). Furthermore, we analyze our algorithm without assuming a truly random hash function, and show that a combination of fast k-wise independent hash functions [36] and uniform hashing =-=[32]-=- suffice to have sufficient concentration in all probabilistic events we consider. Our L0-estimation algorithm also uses subsampling and a balls-and-bins approach, but needs a different subroutine for... |

9 | A multi-round communication lower bound for gap hamming and some consequences
- Brody, Chakrabarti
- 2009
(Show Context)
Citation Context |

9 | Toward automated large-scale information integration and discovery
- Brown, Haas, et al.
- 2005
(Show Context)
Citation Context ... optimization, data mining, and several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan [34], database design [19], OLAP [31, 35], data integration =-=[10, 14]-=-, and data warehousing [1]. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs, requested URLs, and source-destination pairs on a link. Distin... |

9 |
Huras: Multi-Dimensional Clustering: a new data layout scheme
- Padmanabhan, Bhattacharjee, et al.
(Show Context)
Citation Context ...k traffic monitoring, query optimization, data mining, and several other database areas. For example, this statistic is useful for selecting a minimum-cost query plan [34], database design [19], OLAP =-=[31, 35]-=-, data integration [10, 14], and data warehousing [1]. In network traffic monitoring, routers with limited memory track statistics such as distinct destination IPs, requested URLs, and source-destinat... |

5 | Blelloch,” Compact Dictionaries for Variable-Length Keys and Data
- Blandford, Guy
- 2008
(Show Context)
Citation Context ... counters (when R increases), or to locate the starting position of a counter in a bitpacked array when reading and writing entries. For the former, we use a “variable-bitlength array” data structure =-=[7]-=-, and for the latter we use an approach inspired by the technique of deamortization of global rebuilding (see [30, Ch. 5]). Furthermore, we analyze our algorithm without assuming a truly random hash f... |

1 | Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm - Math, Sci - 2007 |

1 | On universal classes of extremely random hash functions - Siegel |

1 | Mert Saglam. Periodicity in streams - Ergün, Jowhari - 2010 |