## String hashing for linear probing (2009)

Venue: | In Proc. 20th SODA |

Citations: | 8 - 3 self |

### BibTeX

@INPROCEEDINGS{Thorup09stringhashing,

author = {Mikkel Thorup},

title = {String hashing for linear probing},

booktitle = {In Proc. 20th SODA},

year = {2009},

pages = {655--664}

}

### OpenURL

### Abstract

Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where the standard implementation of 2-universal hashing leads to an expected number of Ω(log n) probes. They also showed that with 5-universal hashing, the expected number of probes is constant. Unfortunately, we do not have 5-universal hashing for, say, variable length strings. When we want to do such complex hashing from a complex domain, the generic standard solution is that we first do collision free hashing (w.h.p.) into a simpler intermediate domain, and second do the complicated hash function on this intermediate domain. Our contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5-universal. This means that the intermediate domain can be n times smaller, and such a smaller intermediate domain typically means that the overall hash function can be made simpler and at least twice as fast. The same doubling of hashing speed for O(1) expected probes follows for most domains bigger than 32-bit integers, e.g., 64-bit integers and fixed length strings. In addition, we study how the overhead from linear probing diminishes as the array gets larger, and what happens if strings are stored directly as intervals of the array. These cases were not considered by Pagh et al. 1

### Citations

667 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ... understanding the properties of linear probing, is based on the assumption that h is a truly random function, mapping all keys independently. In 1977, Carter and Wegman’s notion of universal hashing =-=[3]-=- initiated a new era in the design of hashing algorithms, where explicit and efficient ways of choosing provably good hash functions replaced the unrealistic assumption of complete randomness. They as... |

329 |
New hash functions and their use in authentication and set equality
- Wegman, Carter
- 1981
(Show Context)
Citation Context ...ness. They asked to extend the analysis to linear probing. Carter and Wegman [3] defined universal hashing as having low collision probability, that is 1/t if we hash to a domain of size t. Later, in =-=[20]-=-, they define k-universal hashing as a function mapping any k keys independently and uniformly at random. Note that 2-universal hashing is stronger than universal hashing in that the identity is unive... |

128 |
The Art of Computer Programming, Volume III: Sorting and Searching
- Knuth
- 1973
(Show Context)
Citation Context ...y as intervals of the array that we probe, and thus avoid the pointers. Practice The practical use of linear probing dates back at least to 1954 to an assembly program by Samuel, Amdahl, Boehme (c.f. =-=[10]-=-). It is one of the simplest schemes to implement in dynamic settings where keys can be inserted and deleted. Several recent experimental studies [2, 7, 14] have found linear probing to be the fastest... |

123 | Cuckoo hashing
- Pagh, Rodler
- 2004
(Show Context)
Citation Context ... assembly program by Samuel, Amdahl, Boehme (c.f. [10]). It is one of the simplest schemes to implement in dynamic settings where keys can be inserted and deleted. Several recent experimental studies =-=[2, 7, 14]-=- have found linear probing to be the fastest hash table organization for moderate load factors (30-70%). While linear probing is known to require more instructions than other open addressing methods, ... |

111 | UMAC: Fast and secure message authentication
- Black, Halevi, et al.
- 1999
(Show Context)
Citation Context ...domain. 5.3 Fixed length strings We now consider the case that the input domain is a string of 2r 32-bit characters. For h1 we will use a straightforward combination of a trick for fast signatures in =-=[1]-=- with the 2-universal hashing from [4]. The combination is folklore [13]. Thus, our input is x = x0 ···x2r−1, wherexiisa32 bit integer. The hash function is defined in terms of 2r random 64-bit intege... |

61 |
Tabulation based 4-universal hashing with applications to second moment estimation
- Thorup, Zhang
- 2004
(Show Context)
Citation Context ...stly letters and digits, and then they would be concentrated in two intervals. On the positive side, Pagh et al. [12] showed that with 5-universal hashing, the expected number of probes is O(1). From =-=[19]-=- we know that 5-universal hashing is fast for small domains like 32-bit integers. Main contribution Our main contribution is that for an expected constant number of linear probes, we can salvage 2-uni... |

59 |
A reliable randomized algorithm for the closest-pair problem
- Dietzfelbinger, Hagerup, et al.
- 1997
(Show Context)
Citation Context ... domain will require more than twice as many look-ups and twice as much space. 5.2 64-bit integers We only need universal hashing from 64-bit integers to ℓ-bit integers, so we can use the method from =-=[6]-=-: We pick a random odd 64-bit integer a, and compute ha(x) =a∗x >>(64−ℓ). This method actually exploits that the overflow from a∗x is discarded. The probability of collision between any two keys is at... |

53 |
The C++ Programming Language – Special Edition
- Stroustrup
- 2007
(Show Context)
Citation Context ...twice as slow. First we consider 64-bit integers, then fixed length strings, and finally variable length strings. We assume that the implementation is done in a programming language like C [8] or C++ =-=[18]-=- leading to efficient and portable code that can be inlined from many other programming languages. We note that this section in itself has no theoretical contribution. It is demonstrating how the theo... |

39 |
The C Programming Language, 2nd ed
- Kernighan, Ritchie
- 1988
(Show Context)
Citation Context ...e at least twice as slow. First we consider 64-bit integers, then fixed length strings, and finally variable length strings. We assume that the implementation is done in a programming language like C =-=[8]-=- or C++ [18] leading to efficient and portable code that can be inlined from many other programming languages. We note that this section in itself has no theoretical contribution. It is demonstrating ... |

37 |
Universal hashing and k-wise independent random variables via integer arithmetic without primes
- Dietzfelbinger
- 1996
(Show Context)
Citation Context ...w consider the case that the input domain is a string of 2r 32-bit characters. For h1 we will use a straightforward combination of a trick for fast signatures in [1] with the 2-universal hashing from =-=[4]-=-. The combination is folklore [13]. Thus, our input is x = x0 ···x2r−1, wherexiisa32 bit integer. The hash function is defined in terms of 2r random 64-bit integers a0, ..., a2r−1. The hash function i... |

33 | Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream
- Mitzenmacher, Vadhan
- 2008
(Show Context)
Citation Context ...t-case is one or two intervals — something that could very well appear in practice, possibly explaining the experienced unreliability from [7]. It is interesting to contrast this with the result from =-=[11]-=- that simple hashing works if the input has high entropy. The situation is similar to the classic one for non-randomized quick sort, where we get into quadratic running time if the input is already so... |

27 |
Polynomial hash functions are reliable (extended abstract
- Dietzfelbinger, Gil, et al.
(Show Context)
Citation Context ...is prohibited.a factor 2 in speed with our new smaller intermediate domain. 5.4 Variable length strings For the initial hashing h1 of variable length strings x = x0x1 ···xv, wecanuse the method from =-=[5]-=-. We view x0, ..., xv as coefficients of a polynomial over Zp, assuming x0, ..., xv ∈ [p]. We pick a single random a ∈ [p], and compute the hash function v∑ ha(x0 ···xv) = xia i . i=0 If y is another ... |

17 |
The analysis of closed hashing under limited randomness (extended abstract
- Schmidt, Siegel
- 1990
(Show Context)
Citation Context ... Often the uniformity does not have to be exact, but the independence is critical for the analysis. The first analysis of linear probing based on kuniversal hashing was given by Siegel and Schmidt in =-=[16, 17]-=-. Specifically, they show that O(log n)universal hashing is sufficient to achieve essentially the same performance as in the fully random case. Here n denotes the number of keys inserted in the hash t... |

14 | Linear probing with constant independence
- Pagh, Pagh, et al.
- 2007
(Show Context)
Citation Context ...nce as in the fully random case. Here n denotes the number of keys inserted in the hash table. However, we do not have any practical implementation of O(log n)-universal hashing. In 2007, Pagh et al. =-=[12]-=- studied the expected number of probes with worst-case data sets. They showed that with the standard implementation of 2universal hashing, the expected number of linear probes could be Ω(log n). The w... |

12 | Graph and hashing algorithms for modern architectures: Design and performance
- Black, Martel, et al.
- 1998
(Show Context)
Citation Context ... assembly program by Samuel, Amdahl, Boehme (c.f. [10]). It is one of the simplest schemes to implement in dynamic settings where keys can be inserted and deleted. Several recent experimental studies =-=[2, 7, 14]-=- have found linear probing to be the fastest hash table organization for moderate load factors (30-70%). While linear probing is known to require more instructions than other open addressing methods, ... |

10 |
How caching affects hashing
- Heileman, Luo
- 2005
(Show Context)
Citation Context ... assembly program by Samuel, Amdahl, Boehme (c.f. [10]). It is one of the simplest schemes to implement in dynamic settings where keys can be inserted and deleted. Several recent experimental studies =-=[2, 7, 14]-=- have found linear probing to be the fastest hash table organization for moderate load factors (30-70%). While linear probing is known to require more instructions than other open addressing methods, ... |

6 | Notes on ”open” addressing
- Knuth
- 1963
(Show Context)
Citation Context ...ularly sensitive to a bad choice of hash function, Heileman and Luo [7] advice against linear probing for general-purpose use. Analysis Linear probing was first analyzed by Knuth in a 1963 memorandum =-=[9]-=- now considered to be the birth of the area of analysis of algorithms [15]. Knuth’s analysis, as well as most of the work that has since gone into understanding the properties of linear probing, is ba... |

6 | Closed hashing is computable and optimally randomizable with universal hash functions
- Siegel, Schmidt
- 1995
(Show Context)
Citation Context ... Often the uniformity does not have to be exact, but the independence is critical for the analysis. The first analysis of linear probing based on kuniversal hashing was given by Siegel and Schmidt in =-=[16, 17]-=-. Specifically, they show that O(log n)universal hashing is sufficient to achieve essentially the same performance as in the fully random case. Here n denotes the number of keys inserted in the hash t... |

3 |
Special issue on average case analysis of algorithms
- Prodinger, S
- 1998
(Show Context)
Citation Context ...dvice against linear probing for general-purpose use. Analysis Linear probing was first analyzed by Knuth in a 1963 memorandum [9] now considered to be the birth of the area of analysis of algorithms =-=[15]-=-. Knuth’s analysis, as well as most of the work that has since gone into understanding the properties of linear probing, is based on the assumption that h is a truly random function, mapping all keys ... |