Results

**1 - 2**of**2**### Beyond Set Disjointness: The Communication Complexity of Finding the Intersection

"... We consider the following fundamental communication prob-lem- there is data that is distributed among servers, and the servers want to compute the intersection of their data sets, e.g., the common records in a relational database. They want to do this with as little communication and as few messages ..."

Abstract
- Add to MetaCart

(Show Context)
We consider the following fundamental communication prob-lem- there is data that is distributed among servers, and the servers want to compute the intersection of their data sets, e.g., the common records in a relational database. They want to do this with as little communication and as few messages (rounds) as possible. They are willing to use ran-domization, and fail with a tiny probability. Given a pro-tocol for computing the intersection, it can also be used to compute the exact Jaccard similarity, the rarity, the number of distinct elements, and joins between databases. Comput-ing the intersection is at least as hard as the set disjointness problem, which asks whether the intersection is empty. Formally, in the two-server setting, the players hold sub-sets S, T ⊆ [n]. In many realistic scenarios, the sizes of S and T are significantly smaller than n, so we impose the con-straint that |S|, |T | ≤ k. We study the minimum number of bits the parties need to communicate in order to compute the intersection set S ∩ T, given a certain number r of mes-sages that are allowed to be exchanged. While O(k log(n/k)) bits is achieved trivially and deterministically with a sin-gle message, we ask what is possible with more than one message and with randomization. We give a smooth com-munication/round tradeoff which shows that with O(log ∗ k) rounds, O(k) bits of communication is possible, which im-proves upon the trivial protocol by an order of magnitude. This is in contrast to other basic problems such as computing the union or symmetric difference, for which Ω(k log(n/k)) bits of communication is required for any number of rounds. For two players, known lower bounds for the easier problem of set disjointness imply our algorithms are optimal up to constant factors in communication and number of rounds. We extend our protocols to m-player protocols, obtaining an optimal O(mk) bits of communication with a similarly small number of rounds.

### Selectivity Estimation on Streaming Spatio-Textual Data Using Local Correlations

"... In this paper, we investigate the selectivity estimation prob-lem for streaming spatio-textual data, which arises in many social network and geo-location applications. Specifically, given a set of continuously and rapidly arriving spatio-textual objects, each of which is described by a geo-location ..."

Abstract
- Add to MetaCart

(Show Context)
In this paper, we investigate the selectivity estimation prob-lem for streaming spatio-textual data, which arises in many social network and geo-location applications. Specifically, given a set of continuously and rapidly arriving spatio-textual objects, each of which is described by a geo-location and a short text, we aim to accurately estimate the cardinal-ity of a spatial keyword query on objects seen so far, where a spatial keyword query consists of a search region and a set of query keywords. To the best of our knowledge, this is the first work to ad-dress this important problem. We first extend two existing techniques to solve this problem, and show their limitations. Inspired by two key observations on the “locality ” of the correlations among query keywords, we propose a local cor-relation based method by utilizing an augmented adaptive space partition tree (A2SP-tree for short) to approximately learn a local Bayesian network on-the-fly for a given query and estimate its selectivity. A novel local boosting approach is presented to further enhance the learning accuracy of lo-cal Bayesian networks. Our comprehensive experiments on real-life datasets demonstrate the superior performance of the local correlation based algorithm in terms of estimation accuracy compared to other competitors. 1.