## Fast evaluation of union-intersection expressions (2007)

Citations: | 5 - 1 self |

### BibTeX

@TECHREPORT{Bille07fastevaluation,

author = {Philip Bille and Anna Pagh and Rasmus Pagh},

title = {Fast evaluation of union-intersection expressions},

institution = {},

year = {2007}

}

### OpenURL

### Abstract

Abstract. We show how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case efficient way. This problem has applications in e.g. information retrieval and database systems. We mainly consider the RAM model of computation, and sets of machine words, but also state our results in the I/O model. On a RAM with word size w, a special case of our result is that the intersection of m (preprocessed) sets, containing n elements in total, can be computed in expected time O(n(log w) 2 /w + km), where k is the number of elements in the intersection. If the first of the two terms dominates, this is a factor w 1−o(1) faster than the standard solution of merging sorted lists. We show a log k cell probe lower bound of time Ω(n/(wm log m) + (1 −)k), meaning w that our upper bound is nearly optimal for small m. Our algorithm uses a novel combination of approximate set representations and word-level parallelism. 1

### Citations

1597 | Space/time trade-offs in hash coding with allowable errors
- Bloom
- 1970
(Show Context)
Citation Context ...xtensive previous work on approximate set representations, mainly motivated by applications in networking and distributed systems [8]. Much of this work builds upon the seminal paper on Bloom filters =-=[7]-=-. A Bloom filter for a set S is an approximate representation of S in the sense that for any x �∈ S the filter can be used to determine that x �∈ S with probability close to 1. However, for an ɛ fract... |

710 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ...e {0, 1} r , where r = log n + O(log w) and n is the total size of the input sets. The hash functions will all be derived from a single “mother” hash function h ∗ , a strongly universal hash function =-=[10, 19]-=- with values in the range {0, 1} w . This is a global hash function that is shared for all sets. The hash function hr, for 1 ≤ r ≤ w is defined by hr(x) = h ∗ (x) div 2 w−r , where “div” denotes integ... |

640 | N.: Communication complexity
- Kushilevitz, Nisan
- 1997
(Show Context)
Citation Context ...e all bit strings have a 1. We allow the protocol to behave in any way if this is not the case. Solving EQ exactly requires communication of Ω(n) bits, for both deterministic and randomized protocols =-=[16,20]-=-. That is, the trivial protocol where one player communicates her entire bit string is optimal. Chakrabarti at al. [12], based on work by Bar-Yossef et al. [4], showed that solving DISJn,t exactly req... |

561 |
The Input/Output complexity of sorting and related problems
- Aggarwal, Vitter
- 1988
(Show Context)
Citation Context ...word length is sufficiently large, e.g. w = (log n) ω(1) , our algorithm gains a factor w1−o(1) compared to [13]. We observe that all our results immediately imply nontrivial results in the I/O model =-=[1]-=-. For the upper bounds, this is because any RAM algorithm can be simulated in the same I/O bound as long as w is bounded by the number of bits in a disk block. In other words, if B is the number of wo... |

396 |
Some complexity questions related to distributive computing
- Yao
(Show Context)
Citation Context ...e all bit strings have a 1. We allow the protocol to behave in any way if this is not the case. Solving EQ exactly requires communication of Ω(n) bits, for both deterministic and randomized protocols =-=[16,20]-=-. That is, the trivial protocol where one player communicates her entire bit string is optimal. Chakrabarti at al. [12], based on work by Bar-Yossef et al. [4], showed that solving DISJn,t exactly req... |

390 | Network applications of bloom filters: A survey
- Broder, Mitzenmacher
- 2005
(Show Context)
Citation Context ... [13] in section 1.2. Approximate set representations There has been extensive previous work on approximate set representations, mainly motivated by applications in networking and distributed systems =-=[8]-=-. Much of this work builds upon the seminal paper on Bloom filters [7]. A Bloom filter for a set S is an approximate representation of S in the sense that for any x �∈ S the filter can be used to dete... |

162 | Information statistics approach to data stream and communication complexity
- Bar-Yossef, Jayram, et al.
- 2004
(Show Context)
Citation Context ...terministic and randomized protocols [16,20]. That is, the trivial protocol where one player communicates her entire bit string is optimal. Chakrabarti at al. [12], based on work by Bar-Yossef et al. =-=[4]-=-, showed that solving DISJn,t exactly requires Ω(n/(t log t)) bits of communication in expectation, even under the unique intersection assumption and when the protocol is randomized. Our main observat... |

86 | Sorting in linear time
- Andersson, Hagerup, et al.
- 1995
(Show Context)
Citation Context ...t representation of h(S) that is suitable for word-parallel set operations. Specifically, we show how set operations on small integers packed in words can be efficiently implemented, using ideas from =-=[2, 3]-=-. This allows us to quickly approximate the intersection of any two sets in the sense that we get a compressed list of references to the elements in the intersection plus a small fraction of the eleme... |

73 | Near-optimal lower bounds on the multi-party communication complexity of set disjointness
- Chakrabarti, Khot, et al.
- 2003
(Show Context)
Citation Context ...s communication of Ω(n) bits, for both deterministic and randomized protocols [16,20]. That is, the trivial protocol where one player communicates her entire bit string is optimal. Chakrabarti at al. =-=[12]-=-, based on work by Bar-Yossef et al. [4], showed that solving DISJn,t exactly requires Ω(n/(t log t)) bits of communication in expectation, even under the unique intersection assumption and when the p... |

69 | Membership in constant time and almost minimum space
- Brodnik, Munro
- 1999
(Show Context)
Citation Context ...lement) does not lead to a significant speedup in general. Instead, a compact representation of the set h(S) is needed. We use a bucketed set representation, as in the dictionary of Brodnik and Munro =-=[9]-=-, to get a compact representation of h(S) that is suitable for word-parallel set operations. Specifically, we show how set operations on small integers packed in words can be efficiently implemented, ... |

65 | Adaptive set intersections, unions, and differences - Demaine, López-Ortiz, et al. - 2000 |

44 | A simple algorithm for merging two disjoint linearly ordered sets
- Hwang, Lin
- 1972
(Show Context)
Citation Context ..., the number of sublists needed to form the sorted list of the union of the sets) is less than around n/w. Another idea that has been studied is, roughly speaking, to exploit asymmetry. Hwang and Lin =-=[15]-=- show that merging two sorted lists S1 and S2 requires Θ(|S1| log(1 + |S2| |S1| )) comparisons, for |S1| < |S2|, in the worst case over all input lists. This is significantly less than O(|S1| + |S2|) ... |

40 | Improved parallel integer sorting without concurrent writing
- Albers, Hagerup
- 1992
(Show Context)
Citation Context ...t representation of h(S) that is suitable for word-parallel set operations. Specifically, we show how set operations on small integers packed in words can be efficiently implemented, using ideas from =-=[2, 3]-=-. This allows us to quickly approximate the intersection of any two sets in the sense that we get a compressed list of references to the elements in the intersection plus a small fraction of the eleme... |

40 |
An Optimal Bloom Filter Replacement
- Pagh, Pagh, et al.
- 2005
(Show Context)
Citation Context ...the data structure. This makes it hard to locate the set of input elements represented by a particular Bloom filter. Instead, we use the approximate set representation of Carter et al. [11] (see also =-=[18]-=-), which consists of storing, in a compact way, the image of the set under a universal hash function. 1.2 Setup and results We consider fully parenthesized expressions with binary operators. That is, ... |

33 |
Exact and approximate membership testers
- Carter, Floyd, et al.
- 1978
(Show Context)
Citation Context ...ributed across the data structure. This makes it hard to locate the set of input elements represented by a particular Bloom filter. Instead, we use the approximate set representation of Carter et al. =-=[11]-=- (see also [18]), which consists of storing, in a compact way, the image of the set under a universal hash function. 1.2 Setup and results We consider fully parenthesized expressions with binary opera... |

32 | Adaptive intersection and t-threshold problems
- Barbay, Kenyon
- 2002
(Show Context)
Citation Context ... to spend time on preprocessing all sets, to decrease the time for answering queries. The search engine application has been the main motivation in several recent works on computing set intersections =-=[5,13,14]-=-. All these papers assume that elements are taken from an ordered set, and are accessed through comparisons. In particular, creating the canonical representation, a sorted list, is the best possible p... |

12 |
strongly universal hashing is pretty fast
- Even
- 2000
(Show Context)
Citation Context ...e {0, 1} r , where r = log n + O(log w) and n is the total size of the input sets. The hash functions will all be derived from a single “mother” hash function h ∗ , a strongly universal hash function =-=[10, 19]-=- with values in the range {0, 1} w . This is a global hash function that is shared for all sets. The hash function hr, for 1 ≤ r ≤ w is defined by hr(x) = h ∗ (x) div 2 w−r , where “div” denotes integ... |

8 | Worst case optimal union-intersection expression evaluation
- Chiniforooshan, Farzan, et al.
- 1999
(Show Context)
Citation Context ... to spend time on preprocessing all sets, to decrease the time for answering queries. The search engine application has been the main motivation in several recent works on computing set intersections =-=[5,13,14]-=-. All these papers assume that elements are taken from an ordered set, and are accessed through comparisons. In particular, creating the canonical representation, a sorted list, is the best possible p... |

3 | Adaptive comparison-based algorithms for evaluating set queries
- Mirzazadeh
- 2004
(Show Context)
Citation Context ... in which case the adaptive algorithm performs no better than standard merging. However, adaptive algorithms are able to exploit “easy” cases to achieve smaller running time. Mirzazadeh in his thesis =-=[17]-=- extended this line of work to arbitrary expressions with unions and intersections. These results are incomparable to those obtained in this paper: Our algorithm is faster for most problem instances, ... |