Results 1 
6 of
6
On the kindependence required by linear probing and minwise independence
 In Proc. 37th International Colloquium on Automata, Languages and Programming (ICALP
, 2010
"... )independent hash functions are required, matching an upper bound of [Indyk, SODA’99]. We also show that the multiplyshift scheme of Dietzfelbinger, most commonly used in practice, fails badly in both applications. Abstract. We show that linear probing requires 5independent hash functions for exp ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
)independent hash functions are required, matching an upper bound of [Indyk, SODA’99]. We also show that the multiplyshift scheme of Dietzfelbinger, most commonly used in practice, fails badly in both applications. Abstract. We show that linear probing requires 5independent hash functions for expected constanttime performance, matching an upper bound of [Pagh et al. STOC’07]. For (1 + ε)approximate minwise independence, we show that Ω(lg 1 ε 1
bbit minwise hashing in practice
 In Internetware
, 2013
"... Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of bbit minwise hashing [23, 24] for efficient search and learning on massive, highdimensional, binary data (which are typical for many appli ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of bbit minwise hashing [23, 24] for efficient search and learning on massive, highdimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply bbit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that bbit minwise hashing implemented using simple hash functions, e.g., the 2universal (2U) and 4universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.
bBit Minwise Hashing in Practice: LargeScale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
"... ar ..."
(Show Context)
The Power of Simple Tabulation Hashing Mihai Pǎtras¸cu AT&T Labs
, 2011
"... Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates ba ..."
Abstract
 Add to MetaCart
Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates back to Carter and Wegman (STOC’77). Keys are viewed as consisting of c characters. We initialize c tables T1,..., Tc mapping characters to random hash codes. A key x = (x1,..., xq) is hashed to T1[x1] ⊕ · · · ⊕ Tc[xc], where ⊕ denotes xor. While this scheme is not even 4independent, we show that it provides many of the guarantees that are normally obtained via higher independence, e.g., Chernofftype concentration, minwise hashing for estimating set intersection, and cuckoo hashing. An important target of the analysis of algorithms is to determine whether there exist practical schemes, which enjoy mathematical guarantees on performance. Hashing and hash tables are one of the most common inner loops in realworld computation, and are even builtin “unit cost ” operations in high level programming languages that offer associative arrays. Often,
Tabulation Based 5independent Hashing with Applications to Linear Probing and Second Moment Estimation ∗
"... In the framework of Carter and Wegman, a kindependent hash function maps any k keys independently. It is known that 5independent hashing provides good expected performance in applications such as linear probing and second moment estimation for data streams. The classic 5independent hash function ..."
Abstract
 Add to MetaCart
In the framework of Carter and Wegman, a kindependent hash function maps any k keys independently. It is known that 5independent hashing provides good expected performance in applications such as linear probing and second moment estimation for data streams. The classic 5independent hash function evaluates a degree 4 polynomial over a prime field containing the key domain[n] = {0,...,n−1}. Here we present an efficient 5independent hash function that uses no multiplications. Instead, for any parameter c, we make 2c−1 lookups in tables of size O(n 1/c). In experiments on different computers, our scheme gained factors 1.8 to 10 in speed over the polynomial method. We also conducted experiments on the performance of hash functions inside the above applications. In particular, we give realistic examples of inputs that make the most popular 2independent hash function perform quite poorly. This illustrates the advantage of using schemes with provably good expected performance for all inputs. 1 Introduction. We consider “kindependent hashing ” in the classic framework of Carter and Wegman [32]. For any i ≥ 1, let [i] = {0,1,...,i − 1}. We consider “hash ” functions from “keys ” in [n] to “hash values ” in [m]. A class H of hash functions is kindependent if for any distinct x0,...,xk−1 ∈ [n] and any possibly identical