Results 11  20
of
48
Densifying one permutation hashing via rotation for fast near neighbor search
, 2013
"... The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often highdimensional and binary (e.g ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
(Show Context)
The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often highdimensional and binary (e.g., text ngrams), minwise hashing is widely adopted, which requires applying a large number of permutations on the data. This is costly in computation and energyconsumption. In this paper, we propose a hashing technique which generates all the necessary hash evaluations needed for similarity search, using one single permutation. The heart of the proposed hash function is a “rotation ” scheme which densifies the sparse sketches of one permutation hashing (Li et al., 2012) in an unbiased fashion thereby maintaining the LSH property. This makes the obtained sketches suitable for hash table construction. This idea of rotation presented in this paper could be of independent interest for densifying other types of sparse sketches. Using our proposed hashing method, the query time of a (K,L)parameterized LSH is reduced from the typical O(dKL) complexity to merely O(KL + dL), where d is the number of nonzeros of the data vector,K is the number of hashes in each hash table, and L is the number of hash tables. Our experimental evaluation on real data confirms that the proposed scheme significantly reduces the query processing time over minwise hashing without loss in retrieval accuracies. 1.
Dynamic external hashing: The limit of buffering
 In Proc. ACM Symposium on Parallelism in Algorithms and Architectures
, 2009
"... Hash tables are one of the most fundamental data structures in computer science, in both theory and practice. They are especially useful in external memory, where their query performance approaches the ideal cost of just one disk access. Knuth [16] gave an elegant analysis showing that with some sim ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Hash tables are one of the most fundamental data structures in computer science, in both theory and practice. They are especially useful in external memory, where their query performance approaches the ideal cost of just one disk access. Knuth [16] gave an elegant analysis showing that with some simple collision resolution strategies such as linear probing or chaining, the expected average number of disk I/Os of a lookup is merely 1 + 1/2 Ω(b) , where each I/O can read and/or write a disk block containing b items. Inserting a new item into the hash table also costs 1 + 1/2 Ω(b) I/Os, which is again almost the best one can do if the hash table is entirely stored on disk. However, this requirement is unrealistic since any algorithm operating on an external hash table must have some internal memory (at least Ω(1) blocks) to work with. The availability of a small internal memory buffer can dramatically reduce the amortized insertion cost to o(1) I/Os for many external memory data structures. In this paper we study the inherent queryinsertion tradeoff of external hash tables in the presence of a memory buffer. In particular, we show that for any constant c> 1, if the expected average successful query cost is targeted at 1 + O(1/b c) I/Os, then it is not possible to support insertions in less than 1 − O(1/b c−1 6) I/Os amortized, which means that the memory buffer is essentially useless. While if the query cost is relaxed to 1 + O(1/b c) I/Os for any constant c < 1, there is a simple dynamic hash table with o(1) insertion cost. Categories and Subject Descriptors F.2.3 [Analysis of algorithms and problem complexity]: Tradeoffs between complexity measures; E.2 [Data storage]: hashtable representations
An Analysis of RandomWalk Cuckoo Hashing
"... In this paper, we provide a polylogarithmic bound that holds with high probability on the insertion time for cuckoo hashing under the randomwalk insertion method. Cuckoo hashing provides a useful methodology for building practical, highperformance hash tables. The essential idea of cuckoo hashing ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we provide a polylogarithmic bound that holds with high probability on the insertion time for cuckoo hashing under the randomwalk insertion method. Cuckoo hashing provides a useful methodology for building practical, highperformance hash tables. The essential idea of cuckoo hashing is to combine the power of schemes that allow multiple hash locations for an item with the power to dynamically change the location of an item among its possible locations. Previous work on the case where the number of choices is larger than two has required a breadthfirst search analysis, which is both inefficient in practice and currently has only a polynomial high probability upper bound on the insertion time. Here we significantly advance the state of the art by proving a polylogarithmic bound on the more efficient randomwalk method, where items repeatedly kick out random blocking items until a free location for an item is found. 1
Private Search in the Real World
"... Encrypted search — performing queries on protected data — has been explored in the past; however, its inherent inefficiency has raised questions of practicality. Here, we focus on improving the performance and extending its functionality enough to make it practical. We do this by optimizing the syst ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Encrypted search — performing queries on protected data — has been explored in the past; however, its inherent inefficiency has raised questions of practicality. Here, we focus on improving the performance and extending its functionality enough to make it practical. We do this by optimizing the system, and by stepping back from the goal of achieving maximal privacy guarantees in an encrypted search scenario and consider efficiency and functionality as priorities. We design and analyze the privacy implications of two practical extensions applicable to any keywordbased private search system. We evaluate their efficiency by building them on top of a private search system, called SADS. Additionally, we improve SADS ’ performance, privacy guaranties and functionality. The extended SADS system offers improved efficiency parameters that meet practical usability requirements in a relaxed adversarial model. We present the experimental results and evaluate the performance of the system. We also demonstrate analytically that our scheme can meet the basic needs of a major hospital complex’s admissions records. Overall, we achieve performance comparable to a simply configured MySQL database system. 1.
Tabulation Based 5Universal Hashing and Linear Probing
"... Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables a ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables are made 5universal, then the hash value becomes 5universal without any other change to the computation. Relatively this leads to even bigger gains since the direct methods for 5universal hashing use degree 4 polynomials. Experimentally, we find that our method can gain up to an order of magnitude in speed over direct 5universal hashing. Some of the most popular randomized algorithms have been proved to have the desired expected running time using
Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments
, 906
"... Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attribute ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vectorweighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the L1 difference. Samplebased summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single scalar weight is associated with each key. We develop a sampling framework based on coordinated weighted samples that is suited for multiple weight assignments and obtain estimators that are orders of magnitude tighter than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data. 1.
On risks of using cuckoo hashing with simple universal hash classes
 In Proc. 20th ACM/SIAM Symposium on Discrete Algorithms (SODA
, 2009
"... Cuckoo hashing, introduced by Pagh and Rodler [10], is a dynamic dictionary data structure for storing a set S of n keys from a universe U, with constant lookup time and amortized expected constant insertion time. For the analysis, space (2+ε)n and Ω(log n)wise independence of the hash functions is ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Cuckoo hashing, introduced by Pagh and Rodler [10], is a dynamic dictionary data structure for storing a set S of n keys from a universe U, with constant lookup time and amortized expected constant insertion time. For the analysis, space (2+ε)n and Ω(log n)wise independence of the hash functions is sufficient. In experiments mentioned in [10], several weaker hash classes worked well; however, a certain simple multiplicative hash family worked badly. In this paper, we prove that the failure probability is high when cuckoo hashing is run with the multiplicative class or with the very common class of linear hash functions over a prime field, even if space 4n is provided. The key set S is fully random, but it must be relatively dense in the universe U of all keys (like S  ≥ U  11/12). The bad behavior and the fact that this effect depends on the density of S in U can also be observed in experiments. The result transfers to larger universes if the keys are chosen from a suitable smaller domain. Viewed from a different perspective, our result illustrates that care must be taken when applying a recent result of Mitzenmacher and Vadhan ([12], SODA 2008) proving good behavior of universal hash classes in combination with key sets that have some entropy. Their result is applicable to cuckoo hashing. A technical hypothesis in [12], namely the assumption that either the “collision probability ” or the “maximum probability ” is small, translates into the condition that S  is relatively small in comparison to U. Our result shows that the result from [12] on 2universal classes ceases to hold if S/U  is not small enough, even for very common 2universal hash classes and fully random key sets. 1
Biff (Bloom Filter) Codes: Fast Error Correction for Large Data Sets
"... Abstract—Large data sets are increasingly common in cloud and virtualized environments. For example, transfers of multiple gigabytes are commonplace, as are replicated blocks of such sizes. There is a need for fast errorcorrection or data reconciliation in such settings even when the expected numbe ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Large data sets are increasingly common in cloud and virtualized environments. For example, transfers of multiple gigabytes are commonplace, as are replicated blocks of such sizes. There is a need for fast errorcorrection or data reconciliation in such settings even when the expected number of errors is small. Motivated by such cloud reconciliation problems, we consider errorcorrection schemes designed for large data, after explaining why previous approaches appear unsuitable. We introduce Biff codes, which are based on Bloom filters and are designed for large data. For Biff codes with a message of length L and E errors, the encoding time is O(L), decoding time is O(L + E) and the space overhead is O(E). Biff codes are lowdensity paritycheck codes; they are similar to Tornado codes, but are designed for errors instead of erasures. Further, Biff codes are designed to be very simple, removing any explicit graph structures and based entirely on hash tables. We derive Biff codes by a simple reduction from a set reconciliation algorithm for a recently developed data structure, invertible Bloom lookup tables. While the underlying theory is extremely simple, what makes this code especially attractive is the ease with which it can be implemented and the speed of decoding. We present results from a prototype implementation that decodes messages of 1 million words with thousands of errors in well under a second. I.
Bottomk and priority sampling, set similarity and subset sums with minimal independence
 In Proc. 45th STOC
, 2013
"... ar ..."
(Show Context)
An Improved Analysis of the Lossy Difference Aggregator
"... We provide a detailed analysis of the Lossy Difference Aggregator, a recently developed data structure for measuring latency in a router environment where packet losses can occur. Our analysis provides stronger performance bounds than those given originally, and leads us to a model for how to optimi ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
We provide a detailed analysis of the Lossy Difference Aggregator, a recently developed data structure for measuring latency in a router environment where packet losses can occur. Our analysis provides stronger performance bounds than those given originally, and leads us to a model for how to optimize the parameters for the data structure when the loss rate is not known in advance by using competitive analysis.