## Cache-Oblivious Hashing (2010)

### Cached

### Download Links

Citations: | 1 - 0 self |

### BibTeX

@MISC{Pagh10cache-oblivioushashing,

author = {Rasmus Pagh and Zhewei Wei and Ke Yi and Qin Zhang},

title = {Cache-Oblivious Hashing },

year = {2010}

}

### OpenURL

### Abstract

The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average tq = 1+1/2 Ω(b) disk accesses for any load factor α bounded away from 1. However, such near-perfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cache-oblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardware-specific tuning, an important feature in autonomous databases. We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cacheoblivious but it only achieves tq = 1 + O(α/b). Then we demonstrate that it is possible to obtain tq = 1 + 1/2 Ω(b), thus matching the cache-aware bound, if the following two conditions hold: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Both conditions hold on a real machine, although they are not stated in the cache-oblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is tq = 1 + O(α/b), which is exactly what linear probing achieves.

### Citations

8523 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...t also in a memory hierarchy that consists of any number of levels, where each level has a different capacity m and block size b. Among them, the most successful approach is the cache-oblivious model =-=[10]-=- due to its elegance and simplicity. This model actually only features two levels of memory: a data structure is laid out in external memory and accessed in exactly the same way as in the standard two... |

1870 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...search on Iu where the average is taken over all keys in Iu. We will only consider deterministic hash tables; the lower bounds also hold for randomized hash tables by invoking Yao’s minimax principle =-=[19]-=- because we are using a random input. The hash table can employ any hash functions to distribute the input. We assume u > n 3 , then with probability 1 − O(1/n) all keys in Iu are distinct by the birt... |

667 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ... assume a truly random function, as we do in this paper. Since such a function requires a large space to describe, there are also a lot of works on hashing using explicit and efficient hash functions =-=[6, 20]-=-. Meanwhile, although most works focus on the expected search cost, there are also hashing schemes that guarantee good worst-case search costs [9, 21]. Hashing has been well studied in the external me... |

537 |
The input/output complexity of sorting and related problems
- Aggarwal, Vitter
- 1988
(Show Context)
Citation Context ...sion of chaining also achieves the same bound. These results basically have explained why hash tables work so well in external memory. These classical analyses assumed a simple two-level memory model =-=[2]-=-, where the (sufficiently large) external memory is partitioned into blocks of size b and are fetched into the internal memory of size m as they are probed. Here both sizes are measured in terms of (l... |

329 |
New hash functions and their use in authentication and set equality
- Wegman, Carter
- 1981
(Show Context)
Citation Context .... We assume that all families used in this paper are uniform so we do not distinguish ¯α from α. For non-uniform families, all results in this paper hold if we substitute α with ¯α. Carter and Wegman =-=[26]-=- exhibited the following family of k-wise independent hash functions where U = [p], R = [r], and p is a prime: Hk = { h : h(x) = ( (ak−1x k−1 ) } + · · · + a0) mod p mod r, aj ∈ [p] . This could be ea... |

253 |
Sorting and Searching, volume 3 of The Art of Computer Programming
- Knuth
- 1998
(Show Context)
Citation Context ...ce due to its sequential access pattern, provided that the load factor α = n/r is not too close to 1. The mathematical analysis of hashing is usually considered as the birth of analysis of algorithms =-=[14]-=-, and it is still attracting a lot of attention nowadays. Most analyses on hashing assume h to be a truly random function, i.e., each h(x) is independently uniformly distributed on [r]. It has been ob... |

217 |
Storing a sparse table with O(1) worst case access time
- Fredman, os, et al.
- 1984
(Show Context)
Citation Context ...shing using explicit and efficient hash functions [6, 20]. Meanwhile, although most works focus on the expected search cost, there are also hashing schemes that guarantee good worst-case search costs =-=[9, 21]-=-. Hashing has been well studied in the external memory model. The 1 + 1/2 Ω(b) search cost holds as long as the load factor α 2 Chaining would perform worse cache-obliviously because the list associat... |

216 |
Introduction to analytic and probabilistic number theory, Cambridge
- Tenenbaum
- 1995
(Show Context)
Citation Context ...Iu)] ≥ α n, proving Theorem 4, since 17b otherwise we would have X E[ρ(b, Iu)] ≤ b∈P α X 1 α n ≤ n(log log r + O(1)). 17 b 17 b∈P Here we use the following approximation for the prime harmonic series =-=[22]-=-: X b∈Pr 1 = log log r + O(1). b Thus P α b∈P E[ρ(b, Iu)] ≤ n(log log r + O(1)), contradict17 ing Lemma 3. Proof of Lemma 3. In the rest of this subsection we prove Lemma 3. We need the following fact... |

133 | Concurrent cache-oblivious B-trees
- Bender, Fineman, et al.
- 2005
(Show Context)
Citation Context ...eceived a lot of attention, and most fundamental problems have been solved. For example, cache-oblivious sorting takes O( n ) I/Os [10], and a cache-oblivious B-tree takes O(logb n) I/Os for a search =-=[4]-=-. Please see the survey [7] for other results. However, hashing has not been considered in the cache-oblivious b logm/b n b model so far. In most cases the cache-oblivious bounds match their cache-awa... |

123 | Cuckoo hashing
- Pagh, Rodler
- 2004
(Show Context)
Citation Context ...hing using explicit and efficient hash functions [6, 20]. Meanwhile, although most work focuses on the expected search cost, there are also hashing schemes that guarantee good worst-case search costs =-=[9, 21]-=-. Hashing has been well studied in the external memory model. The 1 + 1/2Ω(b) search cost holds as long as the load factor α is bounded away from 1 [14], and there are various techniques in the databa... |

102 | Chernoff-Hoeffding Bounds for Applications with Limited Independence
- Schmidt, Siegel, et al.
- 1994
(Show Context)
Citation Context ... 2 j−1 That h is a truly random hash function is an unrealistic assumption. To analyze blocked probing with limited independence, we need the following variant of the Chernoff bound by Schmidt et al. =-=[23]-=-: Lemma 2 ([23]) Let X1, . . . , Xn be a sequence of k-wise independent random variables, that satisfy |Xi − E[Xi]| ≤ 1. Let X = ∑n i=1 Xi with E[X] = µ, and let δ2 [X] denote the variance of X, so th... |

51 |
Extendible hashing-A fast access method for dynamic files
- Fagin, Nievergelt, et al.
- 1979
(Show Context)
Citation Context ...position is not laid out consecutively.is bounded away from 1 [14], and there are various techniques in the database literature to keep the load factor in a desired range, such as extensible hashing =-=[8]-=- or linear hashing [17]. Jensen and Pagh [13] designed a hashing scheme that has α = 1 − O(1/ √ b) while supporting searches with 1 + O(1/ √ b) I/Os. In all these hashing schemes a small faction of th... |

49 |
Dynamic hash tables
- Larson
- 1988
(Show Context)
Citation Context ...alysis. However, the above solution has a poor space utilization. A number of methods have been proposed that maintain a higher load factor, and also allow the rehashing to be done incrementally; see =-=[15]-=- for an overview. To our best knowledge these methods are all cache-aware — however, we now describe how they can be made cache-oblivious while maintaining the load factor of α = 1 − Θ(ε). Suppose ini... |

45 |
hashing: A new tool for file and table addressing
- Linear
- 1980
(Show Context)
Citation Context ...ut consecutively.is bounded away from 1 [14], and there are various techniques in the database literature to keep the load factor in a desired range, such as extensible hashing [8] or linear hashing =-=[17]-=-. Jensen and Pagh [13] designed a hashing scheme that has α = 1 − O(1/ √ b) while supporting searches with 1 + O(1/ √ b) I/Os. In all these hashing schemes a small faction of the keys still need two o... |

39 | On the limits of cache-obliviousness, in
- Brodal, Fagerberg
- 2003
(Show Context)
Citation Context ...s always be an interesting problem to see for what problems we have a separation between the cache-oblivious model and the cache-aware model. Until today there have been only three separation results =-=[1, 3,5]-=-. Our lower bound adds to that list, furthering our understanding of cache-obliviousness. 2. ANALYSIS OF LINEAR PROBING IN THE CACHE-OBLIVIOUS MODEL Linear probing while ignoring the blocking is natur... |

34 | Cache-oblivious algorithms and data structures
- Demaine
- 2002
(Show Context)
Citation Context ... and most fundamental problems have been solved. For example, cache-oblivious sorting takes O( n ) I/Os [10], and a cache-oblivious B-tree takes O(logb n) I/Os for a search [4]. Please see the survey =-=[7]-=- for other results. However, hashing has not been considered in the cache-oblivious b logm/b n b model so far. In most cases the cache-oblivious bounds match their cache-aware versions, and it has alw... |

33 | Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream
- Mitzenmacher, Vadhan
- 2008
(Show Context)
Citation Context ...ed on [r]. It has been observed that these analyses match what actually happens on real-world data surprisingly well, even with some very simple hash functions. Recently, some theoretical explanation =-=[18]-=- has also been put forward justifying such an assumption. We will also adopt the truly random hash function assumption in this paper. Under such an assumption, Knuth [14] showed that the expected aver... |

17 | The cost of cacheoblivious searching
- Bender, Brodal, et al.
- 2003
(Show Context)
Citation Context ...s always be an interesting problem to see for what problems we have a separation between the cache-oblivious model and the cache-aware model. Until today there have been only three separation results =-=[1, 3,5]-=-. Our lower bound adds to that list, furthering our understanding of cache-obliviousness. 2. ANALYSIS OF LINEAR PROBING IN THE CACHE-OBLIVIOUS MODEL Linear probing while ignoring the blocking is natur... |

14 | Linear probing with constant independence
- Pagh, Pagh, et al.
- 2007
(Show Context)
Citation Context ...ε keys need two or more I/Os). Next, we explore other collision resolution strategies to see if they work better in the cache-oblivious model. In Section 3, we show that the blocked probing algorithm =-=[20]-=- achieves the desired 1 + 1/2 Ω(b) search cost, but under the following two conditions: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Neither of these condition... |

6 |
Linear hashing with separators—a dynamic hashing scheme achieving one-access retrieval
- Larson
- 1988
(Show Context)
Citation Context ... √ b) I/Os. In all these hashing schemes a small faction of the keys still need two or more disk accesses to retrieve. Meanwhile there are also schemes that guarantee a single I/O to retrieve any key =-=[11, 16]-=-, but they all need the internal memory to have size m = Θ(n/b). Note that on the other hand, all the other hashing schemes achieving tq = 1+ε only need the internal memory to store a constant number ... |

6 | Dynamic external hashing: The limit of buffering
- Wei, Yi, et al.
- 2009
(Show Context)
Citation Context ...ed update cost of 1 + O(1/b) I/Os, but can we improve it to o(1) I/Os, possibly by buffering the updates in internal memory and write them to external memory in batches? A recent result by Wei et al. =-=[24]-=- has eliminated this possibility by proving a 1 − 1/2 Ω(b) lower bound (in the cache-aware model) on the amortized update cost if the successful query cost is to be tq = 1 + 1/2 Ω(b) . Even more recen... |

5 |
External hashing with limited internal storage
- Gonnet, Larson
- 1988
(Show Context)
Citation Context ... √ b) I/Os. In all these hashing schemes a small faction of the keys still need two or more disk accesses to retrieve. Meanwhile there are also schemes that guarantee a single I/O to retrieve any key =-=[11, 16]-=-, but they all need the internal memory to have size m = Θ(n/b). Note that on the other hand, all the other hashing schemes achieving tq = 1+ε only need the internal memory to store a constant number ... |

5 | Cache-oblivious databases: Limitations and opportunities
- He, Luo
- 2008
(Show Context)
Citation Context ...e-specific tuning. This is particularly important in autonomous databases, and is in fact the main motivation of the recent efforts in bringing cache-oblivious techniques to databases, such as EaseDB =-=[12]-=-. Note that the external versions of linear probing and chaining mentioned above only work for a single b, so they are not cache-oblivious. In this paper we investigate whether it is possible to lay o... |

4 | Cache-Oblivious Range Reporting with Optimal Queries Requires Superlinear Space
- Afshani, Hamilton, et al.
(Show Context)
Citation Context ...s always be an interesting problem to see for what problems we have a separation between the cache-oblivious model and the cache-aware model. Until today there have been only three separation results =-=[1, 3,5]-=-. Our lower bound adds to that list, furthering our understanding of cache-obliviousness. 2. ANALYSIS OF LINEAR PROBING IN THE CACHE-OBLIVIOUS MODEL Linear probing while ignoring the blocking is natur... |

4 | Optimality in external memory hashing
- Jensen, Pagh
- 2008
(Show Context)
Citation Context ...ounded away from 1 [14], and there are various techniques in the database literature to keep the load factor in a desired range, such as extensible hashing [8] or linear hashing [17]. Jensen and Pagh =-=[13]-=- designed a hashing scheme that has α = 1 − O(1/ √ b) while supporting searches with 1 + O(1/ √ b) I/Os. In all these hashing schemes a small faction of the keys still need two or more disk accesses t... |

4 | The limits of buffering: A tight lower bound for dynamic membership in the external memory model
- Verbin, Zhang
- 2010
(Show Context)
Citation Context ...ity by proving a 1 − 1/2 Ω(b) lower bound (in the cache-aware model) on the amortized update cost if the successful query cost is to be tq = 1 + 1/2 Ω(b) . Even more recently, Verbin and Zhang proved =-=[23]-=- that if tq is o(log b log n n) for both successful and unsuccessful queries, then the update cost has to be Ω(1). These results show that for external hashing, buffering is essentially useless and mo... |

4 | Design and Analysis of Hashing Algorithms with Cache Effects
- Qi, Martel
- 1998
(Show Context)
Citation Context ...st is 1 + O(α/b) I/Os even assuming a truly random hash function. In fact, we also derive the constant in the big-Oh, which depends on Cn and C ′ n. This phenomenon was first studied by Qi and Martel =-=[22]-=-, however, we will give a detailed proof in Section 3 for completeness. This is worse than its cache-aware version that is particularly tuned to work with a single b. The gap is in some sense exponent... |

3 |
The effect of table expansion on the program complexity of perfect hash functions
- Mairson
- 1992
(Show Context)
Citation Context ...al in the cache-aware model (or in the cache-oblivious model with the two more conditions). It is known that we can achieve tq = 1 (namely, perfect hashing) with an internal memory of size m = Θ(n/b) =-=[11, 16, 18]-=-. On the other hand, external linear probing and blocked probing achieve tq = 1 + 1/2 Ω(b) with only m = Θ(b). There seems to be a tradeoff between m and tq but this tradeoff is yet to be understood. ... |

1 | Linear Probing with 5-wise Independence
- Pagh, Pagh, et al.
- 2011
(Show Context)
Citation Context ...on strategy for hash tables, can be easily made cache-oblivious but it only achieves tq = 1 + Θ(α/b) even if a truly random hash function is used. Then we demonstrate that the block probing algorithm =-=[20]-=- achieves tq = 1 + 1/2 Ω(b) , thus matching the cache-aware bound, if the following two conditions hold: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Note that... |