## Why simple hash functions work: Exploiting the entropy in a data stream (2008)

### Cached

### Download Links

- [eecs.harvard.edu]
- [www.eecs.harvard.edu]
- [eecs.harvard.edu]
- [www.eecs.harvard.edu]
- [eecs.harvard.edu]
- [people.seas.harvard.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms |

Citations: | 33 - 6 self |

### BibTeX

@INPROCEEDINGS{Mitzenmacher08whysimple,

author = {Michael Mitzenmacher and Salil Vadhan},

title = {Why simple hash functions work: Exploiting the entropy in a data stream},

booktitle = {In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms},

year = {2008},

pages = {746--755}

}

### OpenURL

### Abstract

Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2-universal or O(1)-wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2-universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby

### Citations

1460 | Space/Time Trade-Offs in Hash Coding with Allowable Errors
- Bloom
- 1970
(Show Context)
Citation Context ... the data items come from a block source with roughly (d + 2) log T bits of entropy per data item. For 4-wise independence, the entropy requirement is reduced to roughly (d + 1) log T . Bloom filters =-=[Blo]-=- are data structures for approximately storing sets in which membership tests can result in false positives with some bounded probability. We begin by showing that there is a constant gap in the false... |

726 | A pseudorandom generator from any one-way function - H˚astad, Impagliazzo, et al. - 1999 |

696 |
The art of computer programming., volume 3: Sorting and searching
- Knuth
- 1974
(Show Context)
Citation Context .... . . . . . . . . . . . 17 6 Alternative Approaches 18 7 Conclusion 19s1 Introduction Hashing is at the core of many fundamental algorithms and data structures, including all varieties of hash tables =-=[Knu]-=-, Bloom filters and their many variants [BM2], summary algorithms for data streams [Mut], and many others. Traditionally, applications of hashing are analyzed as if the hash function is a truly random... |

668 |
Universal classes of hash functions
- Carter, Wegman
- 1977
(Show Context)
Citation Context ...ruly random function mapping {0, 1} n to {0, 1} m requires an exponential number of bits to describe. For this reason, a line of theoretical work, starting with the seminal paper of Carter and Wegman =-=[CW]-=- on universal hashing, has sought to provide rigorous bounds on performance when explicit families of hash functions are used, e.g. ones whose description and computational complexity are polynomial i... |

378 | Data streams: algorithms and applications
- Muthukrishnan
- 2005
(Show Context)
Citation Context ...shing is at the core of many fundamental algorithms and data structures, including all varieties of hash tables [Knu], Bloom filters and their many variants [BM2], summary algorithms for data streams =-=[Mut]-=-, and many others. Traditionally, applications of hashing are analyzed as if the hash function is a truly random function (a.k.a. “random oracle”) mapping each data item independently and uniformly to... |

345 | Network Applications of Bloom Filters: A Survey
- Broder, Mitzenmacher, et al.
- 2002
(Show Context)
Citation Context ...oaches 18 7 Conclusion 19s1 Introduction Hashing is at the core of many fundamental algorithms and data structures, including all varieties of hash tables [Knu], Bloom filters and their many variants =-=[BM2]-=-, summary algorithms for data streams [Mut], and many others. Traditionally, applications of hashing are analyzed as if the hash function is a truly random function (a.k.a. “random oracle”) mapping ea... |

331 |
New Hash Functions and Their Use in Authentication and Set Equality
- Wegman, Carter
- 1981
(Show Context)
Citation Context ... the set all functions mapping [N] to [M], i.e. the N random variables {H(x)} x∈[N] are independent and uniformly distributed over [M]. For s ∈ N, H is s-wise independent (a.k.a. strongly s-universal =-=[WC]-=-) if for every sequence of distinct elements x1, . . . , xs ∈ [N], the random variables H(x1), . . . , H(xs) are independent and uniformly distributed over [M]. H is s-universal if for every sequence ... |

249 | Balanced allocations
- Azar, Broder, et al.
- 1999
(Show Context)
Citation Context ...M bits of (Renyi) entropy per item, where M is the size of the hash table. For 4-wise independent hashing, we only need roughly 3 log M bits of entropy per item. With the balanced allocation paradigm =-=[ABKU]-=-, it is known that when T items are hashed to T buckets, with each item being sequentially placed in the least loaded of d choices (e.g. d = 2), the maximum load is log log T/ log d + O(1) with high p... |

230 | Randomness is linear in space
- Nisan, Zuckerman
- 1996
(Show Context)
Citation Context ...x x y ′ � 1/2 · � � Pr[Y = y] · Pr[Y = y ′ ] · cp(X|Y = y) 1/2 · cp(X|Y = y ′ ) 1/2 y,y ′ ≤ max cp(X|Y = y) y∈Supp(Y ) 10 x Pr[X = x|Y = y ′ ] 2 � 1/2s4.2 Extracting Randomness A randomness extractor =-=[NZ]-=- can be viewed as a family of hash functions with the property that for any random variable X with enough entropy, if we pick a random hash function h from the family, then h(X) is “close” to being un... |

186 | Unbiased bits from sources of weak randomness and probabilistic communication complexity
- Chor, Goldreich
- 1988
(Show Context)
Citation Context ... in {0, 1} n is also very unrealistic (not to mention that it trivializes many applications). Here we propose that an intermediate model, previously studied in the literature on randomness extraction =-=[CG]-=-, may also be an appropriate data model for hashing applications. Under the assumption that the data fits this model, we show that relatively weak hash functions achieve essentially the same performan... |

183 | How to recycle random bits
- Impagliazzo, Zuckerman
- 1989
(Show Context)
Citation Context ... that with high probability over h ← H, the random variable h(X) is close to uniform. The above formulation of the Leftover Hash Lemma, passing through collision probability, is attributed to Rackoff =-=[IZ]-=-. It relies on the fact that if the collision probability of a random variable is close to that uniform distribution, then the random variable is close to uniform in statistical difference. This fact ... |

182 |
Privacy amplification by public discussion
- Bennett, Brassard, et al.
- 1988
(Show Context)
Citation Context ...lled ε-close if ∆(X, Y ) ≤ ε. The classic Leftover Hash Lemma shows that universal hash functions are randomnness extractors with respect to statistical difference. Lemma 4.4 (The Leftover Hash Lemma =-=[BBR, ILL]-=-) Let H : [N] → [M] be a random hash function from a 2-universal family H. For every random variable X taking values in [N] with cp(X) ≤ 1/K, we have cp(H, H(X)) ≤ (1/|H|)·(1/M+1/K), and thus (H, H(X)... |

162 | Handbook of Algorithms and Data Structures - Gonnet - 1984 |

157 | Deep packet inspection using parallel bloom filters - Dharmapurikar, Krishnamurthy, et al. |

144 | Recent developments in explicit constructions of extractors - Shaltiel |

144 | Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time
- Spielman, Teng
- 2001
(Show Context)
Citation Context ...n worst-case and average-case analysis of algorithms for other kinds of problems. Examples include the semi-random graph model of Blum and Spencer [BS], and the smoothed analysis of Spielman and Teng =-=[ST]-=-. Interestingly, Blum and Spencer’s semi-random graph models are based on Santha and Vazirani’s model of semi-random sources [SV], which in turn were the precursor to the Chor–Goldreich model of block... |

123 | Cuckoo hashing
- Pagh, Rodler
- 2004
(Show Context)
Citation Context ... of distinct data items, we have Pr[MaxLoadBA(x, H) > log log T log d 1 + c] ≤ . T γ There are other variations on this scheme, including the asymmetric version do to Vöcking [Vöc] and cuckoo hashing =-=[PR]-=-; we choose to study the original setting for simplicity. The asymmetric scheme has been recently studied under explicit functions [Woe], similar to those of [DW]. At this point, we know of no non-tri... |

110 | Simulating BPP using a general weak random source - Zuckerman - 1996 |

99 |
Generating quasi-random sequences from semi-random sources
- Santha, Vazirani
- 1986
(Show Context)
Citation Context ... Blum and Spencer [BS], and the smoothed analysis of Spielman and Teng [ST]. Interestingly, Blum and Spencer’s semi-random graph models are based on Santha and Vazirani’s model of semi-random sources =-=[SV]-=-, which in turn were the precursor to the Chor–Goldreich model of block sources [CG]. Chor and Goldreich suggest using block sources as an input model for communication complexity, but surprisingly it... |

90 | Extracting randomness: A survey and new constructions - Nisan, Ta-Shma - 1999 |

81 | How Asymmetry Helps Load Balancing
- Vöcking
- 2003
(Show Context)
Citation Context ... every sequence x ∈ [N] T of distinct data items, we have Pr[MaxLoadBA(x, H) > log log T log d 1 + c] ≤ . T γ There are other variations on this scheme, including the asymmetric version do to Vöcking =-=[Vöc]-=- and cuckoo hashing [PR]; we choose to study the original setting for simplicity. The asymmetric scheme has been recently studied under explicit functions [Woe], similar to those of [DW]. At this poin... |

67 | Using multiple hash functions to improve ip lookups
- Broder, Mitzenmacher
- 2001
(Show Context)
Citation Context ...int, we know of no non-trivial upper or lower bounds for the balanced allocation paradigm using families of hash functions with constant independence, although performance has been tested empirically =-=[BM1]-=-. Such bounds have been a long-standing open question in this area. 3.3 Bloom Filters A Bloom filter [Blo] represents a set x = {x1, . . . , xT } where each xi ∈ [N] using an array of M bits and ℓ has... |

61 |
Tabulation based 4-universal hashing with applications to second moment estimation
- Thorup, Zhang
- 2004
(Show Context)
Citation Context ...nction is “good” from the uniformity of the hashed values h(Xi). We can reduce the entropy required even further if we use 4-wise independent hash functions, which also have very fast implementations =-=[TZ]-=-. Applications. We illustrate our approach with several specific applications. Here we informally summarize the results; definitions and discussions appear in Sections 3 and 4. 1 Chor and Goldreich ca... |

53 | Efficient hardware hashing functions for high performance computers - Ramakrishna, Fu, et al. - 1997 |

49 | Coloring random and semi-random k-colorable graphs
- Blum, Spencer
- 1995
(Show Context)
Citation Context ...s works that have examined intermediate models between worst-case and average-case analysis of algorithms for other kinds of problems. Examples include the semi-random graph model of Blum and Spencer =-=[BS]-=-, and the smoothed analysis of Spielman and Teng [ST]. Interestingly, Blum and Spencer’s semi-random graph models are based on Santha and Vazirani’s model of semi-random sources [SV], which in turn we... |

42 |
Practical performance of Bloom filters and parallel free-text searching
- Ramakrishna
- 1989
(Show Context)
Citation Context ...actice, however, the performance of standard universal hashing seems to match what is predicted for ideal hashing. This phenomenon was experimentally observed long ago in the setting of Bloom filters =-=[Ram2]-=-; other reported examples include [BM1, DKSL, PR, Ram1, RFB]. Thus, it does not seem truly necessary to use the more complex hash functions for which this kind of performance can be proven. We view th... |

34 | Randomness Extraction and Key Derivation Using the CBC, Cascade and HMAC Modes - Dodis, Gennaro, et al. - 2004 |

30 | Less hashing, same performance: Building a better Bloom filter
- Kirsch, Mitzenmacher
- 2008
(Show Context)
Citation Context ...lters on worst-case data using O(1)-wise independence. But the following more mild reduction in randomness, using 2 truly random hash functions instead of ℓ, will be useful to use later. Theorem 3.7 (=-=[KM]-=-) Let H = (H1, H2) be a truly random hash function mapping [N] to [M/ℓ] 2 , where M/ℓ is a prime integer. Define H ′ = (H ′ 1 , . . . , H′ ℓ ) : [N] → [M/ℓ]ℓ by H ′ i (w) = H1(w) + (i − 1)H2(w) mod M/... |

26 | On universal classes of extremely random constant-time hash functions - Siegel |

21 | Almost random graphs with simple hash functions
- Dietzfelbinger, Woelfel
- 2003
(Show Context)
Citation Context ...o to Vöcking [Vöc] and cuckoo hashing [PR]; we choose to study the original setting for simplicity. The asymmetric scheme has been recently studied under explicit functions [Woe], similar to those of =-=[DW]-=-. At this point, we know of no non-trivial upper or lower bounds for the balanced allocation paradigm using families of hash functions with constant independence, although performance has been tested ... |

17 | Uniform hashing in constant time and linear space - Pagh, Pagh |

17 | The analysis of closed hashing under limited randomness (extended abstract - Schmidt, Siegel - 1990 |

14 | Linear probing with constant independence
- Pagh, Pagh, et al.
- 2007
(Show Context)
Citation Context ...versal or O(1)-wise independent), but the performance guarantees are noticeably weaker than for ideal hashing. (A motivating recent example is the analysis of linear probing under 5-wise independence =-=[PPR]-=-.) In other cases, the performance guarantees are (essentially) optimal, but the hash functions are more complex and expensive (e.g. with a super-linear time or space requirement). For example, if at ... |

9 | Hashing practice: analysis of hashing and universal hashing - Ramakrishna - 1988 |

4 |
Asymmetric balanced allocation with simple hash functions
- Woelfel
- 2006
(Show Context)
Citation Context ...ng the asymmetric version do to Vöcking [Vöc] and cuckoo hashing [PR]; we choose to study the original setting for simplicity. The asymmetric scheme has been recently studied under explicit functions =-=[Woe]-=-, similar to those of [DW]. At this point, we know of no non-trivial upper or lower bounds for the balanced allocation paradigm using families of hash functions with constant independence, although pe... |