## Succinct Data Structures for Retrieval and Approximate Membership

Citations: | 14 - 6 self |

### BibTeX

@TECHREPORT{Dietzfelbinger_succinctdata,

author = {Martin Dietzfelbinger and Rasmus Pagh},

title = {Succinct Data Structures for Retrieval and Approximate Membership},

institution = {},

year = {}

}

### OpenURL

### Abstract

Abstract. The retrieval problem is the problem of associating data with keys in a set. Formally, the data structure must store a function f: U → {0, 1} r that has specified values on the elements of a given set S ⊆ U, |S | = n, but may have any value on elements outside S. All known methods (e. g. those based on perfect hash functions), induce a space overhead of Θ(n) bits over the optimum, regardless of the evaluation time. We show that for any k, query time O(k) can be achieved using space that is within a factor 1 + e −k of optimal, asymptotically for large n. The time to construct the data structure is O(n), expected. If we allow logarithmic evaluation time, the additive overhead can be reduced to O(log log n) bits whp. A general reduction transfers the results on retrieval into analogous results on approximate membership, a problem traditionally addressed using Bloom filters. Thus we obtain space bounds arbitrarily close to the lower bound for this problem as well. The evaluation procedures of our data structures are extremely simple. For the results stated above we assume free access to fully random hash functions. This assumption can be justified using space o(n) to simulate full randomness on a RAM. 1

### Citations

1898 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ... the mapping i ↦→ σ(i) yields one term in the expansion of det(P ′ ) as a sum that does not vanish, and because the terms in this sum cannot cancel in any way. By the SchwartzZippel Theorem (see e.g. =-=[26]-=-) we know that if we substitute random elements gℓ(xi) from F for the variables Xij in P ′ , the probability that the resulting matrix P ′ [Xij|gℓ(xi)] (with j = hℓ(xi)) is regular is at least 1 − deg... |

1539 | Space/Time Trade-offs in Hash Coding with Allowable Errors
- Bloom
- 1970
(Show Context)
Citation Context ...taining a space-optimal RAM data structure that answers range queries in constant time [1, 18]. 1.2 Previous results Approximate membership. The study of approximate membership was initiated by Bloom =-=[2]-=- who described the Bloom filter data structure which provides an 2elegant, near-optimal solution to the problem. Bloom showed [2, 4] that a space usage of n log 2(1/ε) log 2 e bits suffices for a fal... |

370 | Network applications of Bloom filters: A survey, 2002
- Broder, Mitzenmacher
(Show Context)
Citation Context ...racted significant interest in recent years due to a number of applications, mainly in distributed systems and database systems, where false positives can be tolerated and space usage is crucial (see =-=[4]-=- for a survey). Often the false positive probability that can be tolerated is relatively large, say, in the range 1% − 10%, which entails that the space usage can be made much smaller than what would ... |

205 | Compressed Bloom Filters
- Mitzenmacher
- 2001
(Show Context)
Citation Context ...tructure differs from the lower bound n log 2(1/ε) by the space required for the minimum perfect hash function, and improves upon Bloom filters when ε ≤ 2 −4 and n is sufficiently large. Mitzenmacher =-=[19]-=- considered the encoding problem where the task is to represent and transmit an approximate set representation (no fast queries required). However, even in this case existing techniques have a space o... |

69 | The Bloomier filter: an efficient data structure for static support lookup tables
- Chazelle, Kilian, et al.
- 2004
(Show Context)
Citation Context ...nally been addressed through the use of perfect hashing. Using the Hagerup-Tholey data structure yields a space usage of nr + n log 2 e + o(n) bits with constant query time. Recently, Chazelle et al. =-=[8]-=- presented a different approach to the problem. Each key is associated with k = O(1) locations in an array with O(n) entries of r bits. The answer to a retrieval query on x is found by combining the v... |

50 | Low redundancy in static dictionaries with constant query time
- Pagh
- 2001
(Show Context)
Citation Context ...rbitrarily, for example equal to 0. This observation implies that there is an alternative data structure whose redundancy is independent of r. A constant time rank data structure (e.g. Theorem 4.4 in =-=[27]-=-) can be used to identify the entries in {b1,...,bn} and map them to entries in a “compressed” array of size n. The space usage of the rank ( data structure is within m) a lower order term of the entr... |

46 | Space efficient hash tables with worst case constant access time
- Fotakis, Pagh, et al.
(Show Context)
Citation Context ...made to run in linear time. Section 4 describes a close relationship between the space requirements for dictionary implementations based on the balanced allocation paradigm (like k-ary cuckoo hashing =-=[15]-=-) and the space requirements for retrieval structures. 2 Retrieval in constant time and almost optimal space In this section, we give the basic construction of a data structure for retrieval with cons... |

31 |
Exact and approximate membership testers
- Carter, Floyd, et al.
- 1978
(Show Context)
Citation Context ...re which provides an 2elegant, near-optimal solution to the problem. Bloom showed [2, 4] that a space usage of n log 2(1/ε) log 2 e bits suffices for a false positive probability of ε. Carter et al. =-=[7]-=- showed that n log 2(1/ε) bits are required for solving the approximate membership problem when |U| ≫ n (see also [13] for details). Thus Bloom filters have space usage within a factor log 2 e ≈ 1.44 ... |

30 | Balanced allocation and dictionaries with tightly packed constant size bins. Theoretical Computer Science
- Dietzfelbinger, Weidling
- 2007
(Show Context)
Citation Context ... ∈ S) is suitable for S.) Choose one such mapping and store xi in T[σ(i)]. Examples of constructions that follow this scheme are cuckoo hashing [20], k-ary cuckoo hashing [15], blocked cuckoo hashing =-=[12, 21]-=-, and perfectly balanced allocation [10]. In [5, 14] threshold densities for blocked cuckoo hashing were determined exactly. These schemes are the most space-efficient dictionary structures known, amo... |

25 | New classes and applications of hash functions - Wegman, Carter - 1979 |

22 |
Z.J.Czech, A family of perfect hashing methods
- Majewski, Wormald, et al.
- 1996
(Show Context)
Citation Context ...es of entries associated with x, using bit-wise XOR. In place of the XOR operation, any abelian group operation may be used. In fact, this idea was used earlier by Majewski, Wormald, Havas, and Czech =-=[17]-=- and by Seiden and Hirschberg [23] to address the special case of order-preserving minimal perfect hashing. It is not hard to see that these data structure in fact solve the retrieval problem. The mai... |

18 |
Efficient minimal perfect hashing in nearly minimal space
- Hagerup
(Show Context)
Citation Context ...e membership is perfect hashing. A function h: U → [n] is a minimal perfect hash function for S if it maps the keys of S ⊆ U bijectively to [n] = {0, . . . , n − 1}, where n = |S|. Hagerup and Tholey =-=[16]-=- showed how to store a minimal perfect hash function h in a data structure of n log 2 e + o(n) bits such that h can be evaluated on a given input in constant time. This space usage is optimal. Now sto... |

17 | Optimal static range reporting in one dimension
- Alstrup, Brodal, et al.
- 2001
(Show Context)
Citation Context ...e the ranking of a given URL, without having to store the URL itself. The retrieval problem is also the key to obtaining a space-optimal RAM data structure that answers range queries in constant time =-=[1, 18]-=-. 1.2 Previous results Approximate membership. The study of approximate membership was initiated by Bloom [2] who described the Bloom filter data structure which provides an 2elegant, near-optimal so... |

17 | Perfectly balanced allocation
- Czumaj, Riley, et al.
- 2003
(Show Context)
Citation Context ...pping and store xi in T[σ(i)]. Examples of constructions that follow this scheme are cuckoo hashing [20], k-ary cuckoo hashing [15], blocked cuckoo hashing [12, 21], and perfectly balanced allocation =-=[10]-=-. In [5, 14] threshold densities for blocked cuckoo hashing were determined exactly. These schemes are the most space-efficient dictionary structures known, among schemes that store the keys explicitl... |

17 | Efficient hashing with lookups in two memory accesses
- Panigrahy
- 2005
(Show Context)
Citation Context ... ∈ S) is suitable for S.) Choose one such mapping and store xi in T[σ(i)]. Examples of constructions that follow this scheme are cuckoo hashing [20], k-ary cuckoo hashing [15], blocked cuckoo hashing =-=[12, 21]-=-, and perfectly balanced allocation [10]. In [5, 14] threshold densities for blocked cuckoo hashing were determined exactly. These schemes are the most space-efficient dictionary structures known, amo... |

16 |
On dynamic range reporting in one dimension
- Mortensen, Pagh, et al.
- 2005
(Show Context)
Citation Context ...ing of a given URL, without having to store the URL itself. The retrieval problem is also the key to obtaining a space-optimal RAM data structure that is able to answer range queries in constant time =-=[1, 25]-=-. 21.3 Previous results Approximate membership. The study of approximate membership was initiated by Bloom [5] who described the Bloom filter data structure which provides an elegant, near-optimal so... |

14 | An optimal bloom filter replacement based on matrix solving
- Porat
- 2009
(Show Context)
Citation Context ...mediately after a draft full version of this work appeared ([13], March 26, 2008), we were informed that E. Porat had independently worked on the same problems. His results are described in a report (=-=[22]-=-, April 11, 2008). He also uses linear equations, however without restricting the weight of rows. The resulting problems with construction and evaluation time are cicumvented by using a two-level spli... |

14 | Simple and space-efficient minimal perfect hash functions
- Botelho, Pagh, et al.
- 2007
(Show Context)
Citation Context ...d construction time O(n). Our results have a couple of other implications in data structures. We improve the space usage of a recent simple construction of (minimal) perfect hashing of Botelho et al. =-=[4]-=- (Section 7). In addition, we show a close relationship between “cuckoo hashing”-like dictionaries and retrieval structures (Section 6). This implies improved upper bounds on the space usage of k-ary ... |

13 | N.: The random graph threshold for korientiability and a fast algorithm for optimal multiple-choice allocation
- Cain, Sanders, et al.
- 2007
(Show Context)
Citation Context ... store xi in T[σ(i)]. Examples of constructions that follow this scheme are cuckoo hashing [20], k-ary cuckoo hashing [15], blocked cuckoo hashing [12, 21], and perfectly balanced allocation [10]. In =-=[5, 14]-=- threshold densities for blocked cuckoo hashing were determined exactly. These schemes are the most space-efficient dictionary structures known, among schemes that store the keys explicitly in a hash ... |

12 |
Dependent sets of constant weight binary vectors
- Calkin
- 1997
(Show Context)
Citation Context ...ce In this section, we give the basic construction of a data structure for retrieval with constant time lookup operation and (1 + δ)nr space. As a technical basis, we first describe results by Calkin =-=[6]-=-. 2.1 Calkin’s results All calculations are over the field GF(2) = Z2 with 2 elements. We consider binary matrices M = (pij)1≤i≤n,0≤j<m with n rows and m columns. If M is such a matrix, then row vecto... |

11 |
The k-orientability thresholds for Gn,p
- Fernholz, Ramachandran
- 2007
(Show Context)
Citation Context ... store xi in T[σ(i)]. Examples of constructions that follow this scheme are cuckoo hashing [20], k-ary cuckoo hashing [15], blocked cuckoo hashing [12, 21], and perfectly balanced allocation [10]. In =-=[5, 14]-=- threshold densities for blocked cuckoo hashing were determined exactly. These schemes are the most space-efficient dictionary structures known, among schemes that store the keys explicitly in a hash ... |

8 | S.S.Rao, Static Dictionaries Supporting Rank, in
- Raman
(Show Context)
Citation Context ...to a minimal perfect hash function. There are several plausible techniques for this, one of them as follows: One stores the set of locations in {0,... ,m − 1} − h(S) in a succinct rank data structure =-=[30]-=-. This table requires additional space of 0.035n ·log2(1.035/0.35)+n·log 2(1.035/1) ≈ 0.22n+o(n) bits. The total space 14needed for the minimal perfect hash function is 2.29n + o(n) bits, which is a ... |

7 | Architecture-conscious hashing
- Zukowski, Héman, et al.
- 2006
(Show Context)
Citation Context ...mory lookups are nonadaptive, i.e., the memory addresses can be determined from the query only. This can be exploited by modern CPU architectures that are able to parallelize memory lookups (see e.g. =-=[24]-=-). In fact, Chazelle et al. also show how approximate membership can be incorporated into their data structure by extending array entries to r + log 2(1/ε) bits. This generalized data structure is cal... |

4 |
Random graphs and systems of linear equations in finite fields
- Kolchin
- 1994
(Show Context)
Citation Context ... small values of k. (See the row for β appr k in Table 1.) Remark 3 Results similar to those of Calkin [7, 8], but for a different model, were obtained independently by Balakin, Kolchin, and Khokhlov =-=[2, 20, 21]-=-. Further results in a similar vein can be found in a paper by Cooper [12]. 2.2 The basic retrieval data structure Now we are ready to describe a retrieval data structure. Assume f : S → {0,1} r is gi... |

3 |
Hypercycles in a random hypergraph
- Balakin, Kolchin, et al.
- 1992
(Show Context)
Citation Context ... small values of k. (See the row for β appr k in Table 1.) Remark 3 Results similar to those of Calkin [7, 8], but for a different model, were obtained independently by Balakin, Kolchin, and Khokhlov =-=[2, 20, 21]-=-. Further results in a similar vein can be found in a paper by Cooper [12]. 2.2 The basic retrieval data structure Now we are ready to describe a retrieval data structure. Assume f : S → {0,1} r is gi... |

3 |
A threshold effect for systems of random equations of a special form
- Kolchin, Khokhlov
- 1992
(Show Context)
Citation Context ... small values of k. (See the row for β appr k in Table 1.) Remark 3 Results similar to those of Calkin [7, 8], but for a different model, were obtained independently by Balakin, Kolchin, and Khokhlov =-=[2, 20, 21]-=-. Further results in a similar vein can be found in a paper by Cooper [12]. 2.2 The basic retrieval data structure Now we are ready to describe a retrieval data structure. Assume f : S → {0,1} r is gi... |

2 |
Design Strategies for Minimal Perfect Hash Functions
- Dietzfelbinger
- 2007
(Show Context)
Citation Context ..., and the scratch space needed is O(n2 ). Remark 3. At the first glance, the time complexity of the construction seems to be forbiddingly large. However, using a trick (“split-and-share” described in =-=[11]-=- and in [13]) makes it possible to obtain a data structure with the same functionality and space bounds (up to a o(n) term) in time O(n 1+δ ) for any given δ > 0. In Section 3 we show how to construct... |

2 |
Dependent sets of constant weight vectors
- Calkin
- 1996
(Show Context)
Citation Context ... seems that the approximation obtained by omitting the last term in (2) is quite good already for small values of k. (See the row for β appr k in Table 1.) Remark 3 Results similar to those of Calkin =-=[7, 8]-=-, but for a different model, were obtained independently by Balakin, Kolchin, and Khokhlov [2, 20, 21]. Further results in a similar vein can be found in a paper by Cooper [12]. 2.2 The basic retrieva... |

2 | Asymptotics for dependent sums of random vectors. Random Structures and Algorithms
- Cooper
- 1999
(Show Context)
Citation Context ...ar to those of Calkin [7, 8], but for a different model, were obtained independently by Balakin, Kolchin, and Khokhlov [2, 20, 21]. Further results in a similar vein can be found in a paper by Cooper =-=[12]-=-. 2.2 The basic retrieval data structure Now we are ready to describe a retrieval data structure. Assume f : S → {0,1} r is given, for a set S = {x1,... ,xn}. For a given (fixed) k ≥ 3 let 1 + δ > β −... |

2 |
On the rank of random matrices, Random Struct
- Cooper
- 2000
(Show Context)
Citation Context ...ructure and its analysis, using a result due to Calkin [8]. This leads to part (b) of Theorem 1, except that the construction time is O(n 3 ). Part (a) is shown in Section 3, using a result of Cooper =-=[13]-=-. The reduction of approximate membership to retrieval, Theorem 2, is presented in Section 4. Section 5 completes the proof of part (b) of Theorem 1 by showing how the construction algorithm can be ma... |

1 |
On the rank of random matrices, Random Struct. Algorithms 16(2) 2001
- Cooper
(Show Context)
Citation Context ...we use k(x) hash functions for key x, where k(x), x ∈ S, are independent random variables, each approximately binomially distributed with expectation Θ(log n), and a range size m = n. Theorem 2(a) in =-=[9]-=- entails that the resulting square matrix will be regular with probability > 0.28. It takes O(n 3 ) time to test one matrix; trying O(log n) sets of hash functions will be sufficient whp. to find a se... |

1 | Finding succinct ordered minimal perfect hash functions
- Seiden, Hirschberg
- 1994
(Show Context)
Citation Context ...sing bit-wise XOR. In place of the XOR operation, any abelian group operation may be used. In fact, this idea was used earlier by Majewski, Wormald, Havas, and Czech [17] and by Seiden and Hirschberg =-=[23]-=- to address the special case of order-preserving minimal perfect hashing. It is not hard to see that these data structure in fact solve the retrieval problem. The main result of [17] is that for k = 3... |

1 |
Algorithms 51:122–144 (2004). 16 R. Panigrahy, Efficient hashing with lookups in two memory accesses
- Pagh, Rodler, et al.
(Show Context)
Citation Context ...s σ(i) ∈ Axi , for 1 ≤ i ≤ n. (In this case we say (Ax,x ∈ S) is suitable for S.) Choose one such mapping and store xi in T[σ(i)]. Examples of constructions that follow this scheme are cuckoo hashing =-=[28]-=-, k-ary cuckoo hashing [18], blocked cuckoo hashing [16, 29], and perfectly balanced allocation [14]. In [6, 17] threshold densities for blocked cuckoo hashing were determined exactly. These schemes a... |

1 |
A Creating random sets of size k without repetitions We briefly justify the assumption that given k fully random hash functions with ranges we can choose there is a way to map each key x to a fully random sequence (or ordered set) Ax = (h1(x),... ,hk(x))
- Zukowski, Heman, et al.
(Show Context)
Citation Context ...mory lookups are nonadaptive, i.e., the memory addresses can be determined from the query only. This can be exploited by modern CPU architectures that are able to parallelize memory lookups (see e.g. =-=[33]-=-). In fact, Chazelle et al. also show how approximate membership can be incorporated into their data structure by extending array entries to r + log 2(1/ε) bits. This generalized data structure is 1 B... |