Results 1  10
of
12
Why simple hash functions work: Exploiting the entropy in a data stream
 In Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2universal or O(1)wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
Strongly historyindependent hashing with applications
 In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
, 2007
"... We present a strongly history independent (SHI) hash table that supports search in O(1) worstcase time, and insert and delete in O(1) expected time using O(n) data space. This matches the bounds for dynamic perfect hashing, and improves on the best previous results by Naor and Teague on history ind ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
We present a strongly history independent (SHI) hash table that supports search in O(1) worstcase time, and insert and delete in O(1) expected time using O(n) data space. This matches the bounds for dynamic perfect hashing, and improves on the best previous results by Naor and Teague on history independent hashing, which were either weakly history independent, or only supported insertion and search (no delete) each in O(1) expected time. The results can be used to construct many other SHI data structures. We show straightforward constructions for SHI ordered dictionaries: for n keys from {1,..., n k} searches take O(log log n) worstcase time and updates (insertions and deletions) O(log log n) expected time, and for keys in the comparison model searches take O(log n) worstcase time and updates O(log n) expected time. We also describe a SHI data structure for the ordermaintenance problem. It supports comparisons in O(1) worstcase time, and updates in O(1) expected time. All structures use O(n) data space. 1
String hashing for linear probing
 In Proc. 20th SODA
, 2009
"... Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where t ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where the standard implementation of 2universal hashing leads to an expected number of Ω(log n) probes. They also showed that with 5universal hashing, the expected number of probes is constant. Unfortunately, we do not have 5universal hashing for, say, variable length strings. When we want to do such complex hashing from a complex domain, the generic standard solution is that we first do collision free hashing (w.h.p.) into a simpler intermediate domain, and second do the complicated hash function on this intermediate domain. Our contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5universal. This means that the intermediate domain can be n times smaller, and such a smaller intermediate domain typically means that the overall hash function can be made simpler and at least twice as fast. The same doubling of hashing speed for O(1) expected probes follows for most domains bigger than 32bit integers, e.g., 64bit integers and fixed length strings. In addition, we study how the overhead from linear probing diminishes as the array gets larger, and what happens if strings are stored directly as intervals of the array. These cases were not considered by Pagh et al. 1
Tabulation Based 5Universal Hashing and Linear Probing
"... Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables a ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables are made 5universal, then the hash value becomes 5universal without any other change to the computation. Relatively this leads to even bigger gains since the direct methods for 5universal hashing use degree 4 polynomials. Experimentally, we find that our method can gain up to an order of magnitude in speed over direct 5universal hashing. Some of the most popular randomized algorithms have been proved to have the desired expected running time using
Uniquely Represented Data Structures for Computational Geometry
, 2008
"... We present new techniques for the construction of uniquely represented data structures in a RAM, and use them to construct efficient uniquely represented data structures for orthogonal range queries, line intersection tests, point location, and 2D dynamic convex hull. Uniquely represented data stru ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We present new techniques for the construction of uniquely represented data structures in a RAM, and use them to construct efficient uniquely represented data structures for orthogonal range queries, line intersection tests, point location, and 2D dynamic convex hull. Uniquely represented data structures represent each logical state with a unique machine state. Such data structures are strongly historyindependent. This eliminates the possibility of privacy violations caused by the leakage of information about the historical use of the data structure. Uniquely represented data structures may also simplify the debugging of complex parallel computations, by ensuring that two runs of a program that reach the same logical state reach the same physical state, even if various parallel processes executed in different orders during the two runs. 1
CacheOblivious Hashing
, 2010
"... The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average tq = 1+1/2 ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average tq = 1+1/2 Ω(b) disk accesses for any load factor α bounded away from 1. However, such nearperfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cacheoblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardwarespecific tuning, an important feature in autonomous databases. We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cacheoblivious but it only achieves tq = 1 + O(α/b). Then we demonstrate that it is possible to obtain tq = 1 + 1/2 Ω(b), thus matching the cacheaware bound, if the following two conditions hold: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Both conditions hold on a real machine, although they are not stated in the cacheoblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is tq = 1 + O(α/b), which is exactly what linear probing achieves.
LINEAR PROBING WITH 5WISE INDEPENDENCE ∗
"... Abstract. Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms for storing (key,value) pairs. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analy ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms for storing (key,value) pairs. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analyses rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a hash function with random and independent function values. Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a 2wise independent hash function may have expected logarithmic cost per operation. Recently, Pǎtra¸scu and Thorup have shown that also 3 and 4wise independent hash functions may give rise to logarithmic expected query time. On the positive side, we show that 5wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for hashing with linear probing.
Fast, AllPurpose State Storage
"... Abstract. Existing techniques for approximate storage of visited states in a model checker are too specialpurpose and too DRAMintensive. Bitstate hashing, based on Bloom filters, is good for exploring most of very large state spaces, and hash compaction is good for highassurance verification of m ..."
Abstract
 Add to MetaCart
Abstract. Existing techniques for approximate storage of visited states in a model checker are too specialpurpose and too DRAMintensive. Bitstate hashing, based on Bloom filters, is good for exploring most of very large state spaces, and hash compaction is good for highassurance verification of more tractable problems. We describe a scheme that is good at both, because it adapts at run time to the number of states visited. It does this within a fixed memory space and with remarkable speed and accuracy. In many cases, it is faster than existing techniques, because it only ever requires one random access to main memory per operation; existing techniques require several to have good accuracy. Adapting to accomodate more states happens in place using streaming access to memory; traditional rehashing would require extra space, random memory accesses, and hash computation. The structure can also incorporate search stack matching for partialorder reductions, saving the need for extra resources dedicated to an additional structure. Our scheme is wellsuited for a future in which random accesses to memory are more of a limiting factor than the size of memory. 1
The Power of Simple Tabulation Hashing Mihai Pǎtras¸cu AT&T Labs
, 2011
"... Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates ba ..."
Abstract
 Add to MetaCart
Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates back to Carter and Wegman (STOC’77). Keys are viewed as consisting of c characters. We initialize c tables T1,..., Tc mapping characters to random hash codes. A key x = (x1,..., xq) is hashed to T1[x1] ⊕ · · · ⊕ Tc[xc], where ⊕ denotes xor. While this scheme is not even 4independent, we show that it provides many of the guarantees that are normally obtained via higher independence, e.g., Chernofftype concentration, minwise hashing for estimating set intersection, and cuckoo hashing. An important target of the analysis of algorithms is to determine whether there exist practical schemes, which enjoy mathematical guarantees on performance. Hashing and hash tables are one of the most common inner loops in realworld computation, and are even builtin “unit cost ” operations in high level programming languages that offer associative arrays. Often,
Effect of cache lines in arraybased hashing algorithms
"... Abstract — Hashing algorithms and their efficiency is modeled with their expected probe lengths. This value measures the number of algorithmic steps required to find the position of an item inside the table. This metric, however, has an implicit assumption that all of these steps have a uniform cost ..."
Abstract
 Add to MetaCart
Abstract — Hashing algorithms and their efficiency is modeled with their expected probe lengths. This value measures the number of algorithmic steps required to find the position of an item inside the table. This metric, however, has an implicit assumption that all of these steps have a uniform cost. In this paper we show that this is not true on modern computers, and that caches and especially cache lines have a great impact on the performance and effectiveness of hashing algorithms that use arraybased structures. Spatial locality of memory accesses plays a major role in the effectiveness of an algorithm. We show a novel model of evaluating hashing schemes; this model is based on the number of cache misses the algorithms suffer. This new approach is shown to model the real performance of hash tables more precisely than previous methods. For developing this model the sketch of proof of the expected probe length of linear probing is included.