Results 1  10
of
22
Why simple hash functions work: Exploiting the entropy in a data stream
 In Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract

Cited by 50 (9 self)
 Add to MetaCart
(Show Context)
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2universal or O(1)wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
Strongly historyindependent hashing with applications
 In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
, 2007
"... We present a strongly history independent (SHI) hash table that supports search in O(1) worstcase time, and insert and delete in O(1) expected time using O(n) data space. This matches the bounds for dynamic perfect hashing, and improves on the best previous results by Naor and Teague on history ind ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
We present a strongly history independent (SHI) hash table that supports search in O(1) worstcase time, and insert and delete in O(1) expected time using O(n) data space. This matches the bounds for dynamic perfect hashing, and improves on the best previous results by Naor and Teague on history independent hashing, which were either weakly history independent, or only supported insertion and search (no delete) each in O(1) expected time. The results can be used to construct many other SHI data structures. We show straightforward constructions for SHI ordered dictionaries: for n keys from {1,..., n k} searches take O(log log n) worstcase time and updates (insertions and deletions) O(log log n) expected time, and for keys in the comparison model searches take O(log n) worstcase time and updates O(log n) expected time. We also describe a SHI data structure for the ordermaintenance problem. It supports comparisons in O(1) worstcase time, and updates in O(1) expected time. All structures use O(n) data space. 1
String hashing for linear probing
 In Proc. 20th SODA
, 2009
"... Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where t ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where the standard implementation of 2universal hashing leads to an expected number of Ω(log n) probes. They also showed that with 5universal hashing, the expected number of probes is constant. Unfortunately, we do not have 5universal hashing for, say, variable length strings. When we want to do such complex hashing from a complex domain, the generic standard solution is that we first do collision free hashing (w.h.p.) into a simpler intermediate domain, and second do the complicated hash function on this intermediate domain. Our contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5universal. This means that the intermediate domain can be n times smaller, and such a smaller intermediate domain typically means that the overall hash function can be made simpler and at least twice as fast. The same doubling of hashing speed for O(1) expected probes follows for most domains bigger than 32bit integers, e.g., 64bit integers and fixed length strings. In addition, we study how the overhead from linear probing diminishes as the array gets larger, and what happens if strings are stored directly as intervals of the array. These cases were not considered by Pagh et al. 1
On the kindependence required by linear probing and minwise independence
 In Proc. 37th International Colloquium on Automata, Languages and Programming (ICALP
, 2010
"... )independent hash functions are required, matching an upper bound of [Indyk, SODA’99]. We also show that the multiplyshift scheme of Dietzfelbinger, most commonly used in practice, fails badly in both applications. Abstract. We show that linear probing requires 5independent hash functions for exp ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
)independent hash functions are required, matching an upper bound of [Indyk, SODA’99]. We also show that the multiplyshift scheme of Dietzfelbinger, most commonly used in practice, fails badly in both applications. Abstract. We show that linear probing requires 5independent hash functions for expected constanttime performance, matching an upper bound of [Pagh et al. STOC’07]. For (1 + ε)approximate minwise independence, we show that Ω(lg 1 ε 1
On risks of using cuckoo hashing with simple universal hash classes
 In Proc. 20th ACM/SIAM Symposium on Discrete Algorithms (SODA
, 2009
"... Cuckoo hashing, introduced by Pagh and Rodler [10], is a dynamic dictionary data structure for storing a set S of n keys from a universe U, with constant lookup time and amortized expected constant insertion time. For the analysis, space (2+ε)n and Ω(log n)wise independence of the hash functions is ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Cuckoo hashing, introduced by Pagh and Rodler [10], is a dynamic dictionary data structure for storing a set S of n keys from a universe U, with constant lookup time and amortized expected constant insertion time. For the analysis, space (2+ε)n and Ω(log n)wise independence of the hash functions is sufficient. In experiments mentioned in [10], several weaker hash classes worked well; however, a certain simple multiplicative hash family worked badly. In this paper, we prove that the failure probability is high when cuckoo hashing is run with the multiplicative class or with the very common class of linear hash functions over a prime field, even if space 4n is provided. The key set S is fully random, but it must be relatively dense in the universe U of all keys (like S  ≥ U  11/12). The bad behavior and the fact that this effect depends on the density of S in U can also be observed in experiments. The result transfers to larger universes if the keys are chosen from a suitable smaller domain. Viewed from a different perspective, our result illustrates that care must be taken when applying a recent result of Mitzenmacher and Vadhan ([12], SODA 2008) proving good behavior of universal hash classes in combination with key sets that have some entropy. Their result is applicable to cuckoo hashing. A technical hypothesis in [12], namely the assumption that either the “collision probability ” or the “maximum probability ” is small, translates into the condition that S  is relatively small in comparison to U. Our result shows that the result from [12] on 2universal classes ceases to hold if S/U  is not small enough, even for very common 2universal hash classes and fully random key sets. 1
Tabulation Based 5Universal Hashing and Linear Probing
"... Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables a ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables are made 5universal, then the hash value becomes 5universal without any other change to the computation. Relatively this leads to even bigger gains since the direct methods for 5universal hashing use degree 4 polynomials. Experimentally, we find that our method can gain up to an order of magnitude in speed over direct 5universal hashing. Some of the most popular randomized algorithms have been proved to have the desired expected running time using
6.897: Advanced data structures (Spring 2005), Lecture 3, February 8
, 2005
"... Recall from last lecture that we are looking at the documentretrieval problem. The problem can be stated as follows: Given a set of texts T1, T2,..., Tk and a pattern P, determine the distinct texts in which the patterns occurs. In particular, we are allowed to preprocess the texts in order to be a ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Recall from last lecture that we are looking at the documentretrieval problem. The problem can be stated as follows: Given a set of texts T1, T2,..., Tk and a pattern P, determine the distinct texts in which the patterns occurs. In particular, we are allowed to preprocess the texts in order to be able to answer the query faster. Our preprocessing choice was the use of a single suffix tree, in which all the suffixes of all the texts appear, each suffix ending with a distinct symbol that determines the text in which the suffix appears. In order to answer the query we reduced the problem to rangemin queries, which in turn was reduced to the least common ancestor (LCA) problem on the cartesian tree of an array of numbers. The cartesian tree is constructed recursively by setting its root to be the minimum element of the array and recursively constructing its two subtrees using the left and right partitions of the array. The rangemin query of an interval [i, j] is then equivalent to finding the LCA of the two nodes of the cartesian tree that correspond to i and j. In this lecture we continue to see how we can solve the LCA problem on any static tree. This will involve a reduction of the LCA problem back to the rangemin query problem (!) and then a
Bottomk and priority sampling, set similarity and subset sums with minimal independence
 In Proc. 45th STOC
, 2013
"... ar ..."
(Show Context)
Uniquely Represented Data Structures for Computational Geometry
, 2008
"... We present new techniques for the construction of uniquely represented data structures in a RAM, and use them to construct efficient uniquely represented data structures for orthogonal range queries, line intersection tests, point location, and 2D dynamic convex hull. Uniquely represented data stru ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We present new techniques for the construction of uniquely represented data structures in a RAM, and use them to construct efficient uniquely represented data structures for orthogonal range queries, line intersection tests, point location, and 2D dynamic convex hull. Uniquely represented data structures represent each logical state with a unique machine state. Such data structures are strongly historyindependent. This eliminates the possibility of privacy violations caused by the leakage of information about the historical use of the data structure. Uniquely represented data structures may also simplify the debugging of complex parallel computations, by ensuring that two runs of a program that reach the same logical state reach the same physical state, even if various parallel processes executed in different orders during the two runs. 1
CacheOblivious Hashing
, 2010
"... The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average tq = 1+1/2 ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average tq = 1+1/2 Ω(b) disk accesses for any load factor α bounded away from 1. However, such nearperfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cacheoblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardwarespecific tuning, an important feature in autonomous databases. We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cacheoblivious but it only achieves tq = 1 + O(α/b). Then we demonstrate that it is possible to obtain tq = 1 + 1/2 Ω(b), thus matching the cacheaware bound, if the following two conditions hold: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Both conditions hold on a real machine, although they are not stated in the cacheoblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is tq = 1 + O(α/b), which is exactly what linear probing achieves.