Results 1  10
of
33
Measuring independence of datasets
 CoRR
"... Approximating pairwise, or kwise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of ktuples, with the goal of testing correlations among the components measured over the entire stream. In ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Approximating pairwise, or kwise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of ktuples, with the goal of testing correlations among the components measured over the entire stream. Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence in this model. Statistical distance is one of the most fundamental metrics for measuring the similarity of two distributions, and it has been a metric of choice in many papers that discuss distribution closeness. For pairwise independence, the Indyk and McGregor methods provide log napproximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their log napproximation for the statistical distance metric. In this paper we solve the main open problem posed by Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant k. In particular, we present an algorithm that computes an (ɛ, δ)approximation of the statistical distance between the joint and product distributions defined by a stream of 1 nm ktuples. Our algorithm requires O log( ɛ δ)) (30+k) k) memory and a single pass over the data stream.
Deamortized Cuckoo Hashing: Provable WorstCase Performance and Experimental Results
"... Cuckoo hashing is a highly practical dynamic dictionary: it provides amortized constant insertion time, worst case constant deletion time and lookup time, and good memory utilization. However, with a noticeable probability during the insertion of n elements some insertion requires Ω(log n) time. Whe ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Cuckoo hashing is a highly practical dynamic dictionary: it provides amortized constant insertion time, worst case constant deletion time and lookup time, and good memory utilization. However, with a noticeable probability during the insertion of n elements some insertion requires Ω(log n) time. Whereas such an amortized guarantee may be suitable for some applications, in other applications (such as highperformance routing) this is highly undesirable. Kirsch and Mitzenmacher (Allerton ’07) proposed a deamortization of cuckoo hashing using queueing techniques that preserve its attractive properties. They demonstrated a significant improvement to the worst case performance of cuckoo hashing via experimental results, but left open the problem of constructing a scheme with provable properties. In this work we present a deamortization of cuckoo hashing that provably guarantees constant worst case operations. Specifically, for any sequence of polynomially many operations, with overwhelming probability over the randomness of the initialization phase, each operation is performed in constant time. In addition, we present a general approach for proving that the performance guarantees are preserved when using hash functions with limited independence
HashBased Techniques for HighSpeed Packet Processing
"... Abstract Hashing is an extremely useful technique for a variety of highspeed packetprocessing applications in routers. In this chapter, we survey much of the recent work in this area, paying particular attention to the interaction between theoretical and applied research. We assume very little bac ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Abstract Hashing is an extremely useful technique for a variety of highspeed packetprocessing applications in routers. In this chapter, we survey much of the recent work in this area, paying particular attention to the interaction between theoretical and applied research. We assume very little background in either the theory or applications of hashing, reviewing the fundamentals as necessary. 1
Maximum matchings in random bipartite graphs and the space utilization of cuckoo hashtables
, 2009
"... We study the the following question in Random Graphs. We are given two disjoint sets L, R with L  = n = αm and R  = m. We construct a random graph G by allowing each x ∈ L to choose d random neighbours in R. The question discussed is as to the size µ(G) of the largest matching in G. When consi ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
We study the the following question in Random Graphs. We are given two disjoint sets L, R with L  = n = αm and R  = m. We construct a random graph G by allowing each x ∈ L to choose d random neighbours in R. The question discussed is as to the size µ(G) of the largest matching in G. When considered in the context of Cuckoo Hashing, one key question is as to when is µ(G) = n whp? We answer this question exactly when d is at least three. We also establish a precise threshold for when Phase 1 of the KarpSipser Greedy matching algorithm suffices to compute a maximum matching whp.
String hashing for linear probing
 In Proc. 20th SODA
, 2009
"... Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where t ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC’07, Pagh et al. presented data sets where the standard implementation of 2universal hashing leads to an expected number of Ω(log n) probes. They also showed that with 5universal hashing, the expected number of probes is constant. Unfortunately, we do not have 5universal hashing for, say, variable length strings. When we want to do such complex hashing from a complex domain, the generic standard solution is that we first do collision free hashing (w.h.p.) into a simpler intermediate domain, and second do the complicated hash function on this intermediate domain. Our contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5universal. This means that the intermediate domain can be n times smaller, and such a smaller intermediate domain typically means that the overall hash function can be made simpler and at least twice as fast. The same doubling of hashing speed for O(1) expected probes follows for most domains bigger than 32bit integers, e.g., 64bit integers and fixed length strings. In addition, we study how the overhead from linear probing diminishes as the array gets larger, and what happens if strings are stored directly as intervals of the array. These cases were not considered by Pagh et al. 1
HistoryIndependent Cuckoo Hashing
"... Cuckoo hashing is an efficient and practical dynamic dictionary. It provides expected amortized constant update time, worst case constant lookup time, and good memory utilization. Various experiments demonstrated that cuckoo hashing is highly suitable for modern computer architectures and distribute ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
Cuckoo hashing is an efficient and practical dynamic dictionary. It provides expected amortized constant update time, worst case constant lookup time, and good memory utilization. Various experiments demonstrated that cuckoo hashing is highly suitable for modern computer architectures and distributed settings, and offers significant improvements compared to other schemes. In this work we construct a practical historyindependent dynamic dictionary based on cuckoo hashing. In a historyindependent data structure, the memory representation at any point in time yields no information on the specific sequence of insertions and deletions that led to its current content, other than the content itself. Such a property is significant when preventing unintended leakage of information, and was also found useful in several algorithmic settings. Our construction enjoys most of the attractive properties of cuckoo hashing. In particular, no dynamic memory allocation is required, updates are performed in expected amortized constant time, and membership queries are performed in worst case constant time. Moreover, with high probability, the lookup procedure queries only two memory entries which are independent and can be queried in parallel. The approach underlying our construction is to enforce a canonical memory representation on cuckoo hashing. That is, up to the initial randomness, each set of elements has a unique memory representation.
Dynamic external hashing: The limit of buffering
 In Proc. ACM Symposium on Parallelism in Algorithms and Architectures
, 2009
"... Hash tables are one of the most fundamental data structures in computer science, in both theory and practice. They are especially useful in external memory, where their query performance approaches the ideal cost of just one disk access. Knuth [16] gave an elegant analysis showing that with some sim ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Hash tables are one of the most fundamental data structures in computer science, in both theory and practice. They are especially useful in external memory, where their query performance approaches the ideal cost of just one disk access. Knuth [16] gave an elegant analysis showing that with some simple collision resolution strategies such as linear probing or chaining, the expected average number of disk I/Os of a lookup is merely 1 + 1/2 Ω(b) , where each I/O can read and/or write a disk block containing b items. Inserting a new item into the hash table also costs 1 + 1/2 Ω(b) I/Os, which is again almost the best one can do if the hash table is entirely stored on disk. However, this requirement is unrealistic since any algorithm operating on an external hash table must have some internal memory (at least Ω(1) blocks) to work with. The availability of a small internal memory buffer can dramatically reduce the amortized insertion cost to o(1) I/Os for many external memory data structures. In this paper we study the inherent queryinsertion tradeoff of external hash tables in the presence of a memory buffer. In particular, we show that for any constant c> 1, if the expected average successful query cost is targeted at 1 + O(1/b c) I/Os, then it is not possible to support insertions in less than 1 − O(1/b c−1 6) I/Os amortized, which means that the memory buffer is essentially useless. While if the query cost is relaxed to 1 + O(1/b c) I/Os for any constant c < 1, there is a simple dynamic hash table with o(1) insertion cost. Categories and Subject Descriptors F.2.3 [Analysis of algorithms and problem complexity]: Tradeoffs between complexity measures; E.2 [Data storage]: hashtable representations
Sampleoptimal averagecase sparse fourier transform in two dimensions. Unpublished manuscript
, 2012
"... We present the first sampleoptimal sublinear time algorithms for the sparse Discrete Fourier Transform over a twodimensional √ n × √ n grid. Our algorithms are analyzed for average case signals. For signals whose spectrum is exactly sparse, our algorithms use O(k) samples and run in O(klogk) time ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
We present the first sampleoptimal sublinear time algorithms for the sparse Discrete Fourier Transform over a twodimensional √ n × √ n grid. Our algorithms are analyzed for average case signals. For signals whose spectrum is exactly sparse, our algorithms use O(k) samples and run in O(klogk) time, wherek is the expected sparsity of the signal. For signals whose spectrum is approximately sparse, our algorithm usesO(klogn) samples and runs in O(klog 2 n) time; the latter algorithm works for k = Θ ( √ n). The number of samples used by our algorithms matches the known lower bounds for the respective signal models. By a known reduction, our algorithms give similar results for the onedimensional sparse Discrete Fourier Transform whennis a power of a small composite number (e.g.,n = 6 t). 1
An Analysis of RandomWalk Cuckoo Hashing
"... In this paper, we provide a polylogarithmic bound that holds with high probability on the insertion time for cuckoo hashing under the randomwalk insertion method. Cuckoo hashing provides a useful methodology for building practical, highperformance hash tables. The essential idea of cuckoo hashing ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper, we provide a polylogarithmic bound that holds with high probability on the insertion time for cuckoo hashing under the randomwalk insertion method. Cuckoo hashing provides a useful methodology for building practical, highperformance hash tables. The essential idea of cuckoo hashing is to combine the power of schemes that allow multiple hash locations for an item with the power to dynamically change the location of an item among its possible locations. Previous work on the case where the number of choices is larger than two has required a breadthfirst search analysis, which is both inefficient in practice and currently has only a polynomial high probability upper bound on the insertion time. Here we significantly advance the state of the art by proving a polylogarithmic bound on the more efficient randomwalk method, where items repeatedly kick out random blocking items until a free location for an item is found. 1
Software Defined Traffic Measurement with OpenSketch
"... Most network management tasks in softwaredefined networks (SDN) involve two stages: measurement and control. While many efforts have been focused on network control APIs for SDN, little attention goes into measurement. The key challenge of designing a new measurement API is to strike a careful bala ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Most network management tasks in softwaredefined networks (SDN) involve two stages: measurement and control. While many efforts have been focused on network control APIs for SDN, little attention goes into measurement. The key challenge of designing a new measurement API is to strike a careful balance between generality (supporting a wide variety of measurement tasks) and efficiency (enabling high link speed and low cost). We propose a software defined traffic measurement architecture OpenSketch, which separates the measurement data plane from the control plane. In the data plane, OpenSketch provides a simple threestage pipeline (hashing, filtering, and counting), which can be implemented with commodity switch components and support many measurement tasks. In the control plane, OpenSketch provides a measurement library that automatically configures the pipeline and allocates resources for different measurement tasks. Our evaluations of realworld packet traces, our prototype on NetFPGA, and the implementation of five measurement tasks on top of OpenSketch, demonstrate that OpenSketch is general, efficient and easily programmable. 1