Results 1 
6 of
6
Annotations in Data Streams
, 2009
"... The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further reduced by enlisting a more pow ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further reduced by enlisting a more powerful “helper ” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a nontrivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to MerlinArthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as trianglefreeness and connectivity. Our work is also part of a growing trend — including recent studies of multipass streaming, read/write streams and randomly ordered streams — of asking more complexitytheoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right. 1
Testing Closeness of Discrete Distributions
"... Given samples from two distributions over an nelement set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in n, specifically, O(n 2/3 ǫ −8/3 log n), independent samples from each distribution, runs in time linear in the sample size, ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Given samples from two distributions over an nelement set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in n, specifically, O(n 2/3 ǫ −8/3 log n), independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small (less than max{ǫ 4/3 n −1/3 /32, ǫn −1/2 /4}) or large (more than ǫ) in ℓ1 distance. This result can be compared to the lower bound of Ω(n 2/3 ǫ −2/3) for this problem given by Valiant [2008]. Our algorithm has applications to the problem of testing whether a given Markov process is rapidly mixing. We present sublinear algorithms for several variants of this problem as well.
RectangleEfficient Aggregation in Spatial Data Streams
"... We consider the estimation of aggregates over a data stream of multidimensional axisaligned rectangles. Rectangles are a basic primitive object in spatial databases, and efficient aggregation of rectangles is a fundamental task. The data stream model has emerged as a de facto model for processing m ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the estimation of aggregates over a data stream of multidimensional axisaligned rectangles. Rectangles are a basic primitive object in spatial databases, and efficient aggregation of rectangles is a fundamental task. The data stream model has emerged as a de facto model for processing massive databases in which the data resides in external memory or the cloud and is streamed through main memory. For a point p, let n(p) denote the sum of the weights of all rectangles in the stream that contain p. We give nearoptimal solutions for basic problems, including (1) the kth frequency moment Fk = ∑ points p n(p)k, (2) the counting version of stabbing queries, which seeks an estimate of n(p) given p, and (3) identification of heavyhitters, i.e., points p for which n(p) is large. An important special case of Fk is F0, which corresponds to the volume of the union of the rectangles. This is a celebrated problem in computational geometry known as “Klee’s measure problem”, and our work yields the first solution in the streaming model for dimensions greater than one.
NearOptimal Private Approximation Protocols via a Black Box Transformation
"... We show the following transformation: any twoparty protocol for outputting a (1 + ε)approximation to f(x, y) = n j=1 g(xj, yj) with probability at least 2/3, for any nonnegative efficienty computable function g, can be transformed into a twoparty private approximation protocol with only a polylo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We show the following transformation: any twoparty protocol for outputting a (1 + ε)approximation to f(x, y) = n j=1 g(xj, yj) with probability at least 2/3, for any nonnegative efficienty computable function g, can be transformed into a twoparty private approximation protocol with only a polylogarithmic factor loss in communication, computation, and round complexity. In general it is insufficient to use secure function evaluation or fully homomorphic encryption on a standard, nonprivate protocol for approximating f. This is because the approximation may reveal information about x and y that does not follow from f(x, y). Applying our transformation and variations of it, we obtain nearoptimal private approximation protocols for a wide range of problems in the data stream literature for which previously nothing was known. We give nearoptimal private approximation protocols for the ℓpdistance for every p ≥ 0, for the heavy hitters and importance sampling problems with respect to any ℓpnorm, for the maxdominance and other dominant ℓpnorms, for the distinct summation problem, for entropy, for cascaded frequency moments, for subspace approximation and block sampling, and for measuring independence of datasets. Using a result for data streams, we obtain private approximation protocols with polylogarithmic communication for every nondecreasing and symmetric function g(xj, yj) = h(xj − yj) with at most quadratic growth. If the original (nonprivate) protocol is a simultaneous protocol, e.g., a sketching algorithm, then our only cryptographic assumption is efficient symmetric computationallyprivate information retrieval; otherwise it is fully homomorphic encryption. For all but one of these problems, the original protocol is a sketching algorithm. Our protocols generalize straightforwardly to more than two parties.
BERTINORO WORKSHOP PARTICIPANTS:
, 2011
"... ABSTRACT. This document contains a list of open problems and research directions that have been suggested ..."
Abstract
 Add to MetaCart
ABSTRACT. This document contains a list of open problems and research directions that have been suggested
www.theoryofcomputing.org Testing Properties of Collections of Distributions ∗
, 2011
"... Abstract: We propose a framework for studying property testing of collections of distributions, where the number of distributions in the collection is a parameter of the problem. Previous work on property testing of distributions considered single distributions or pairs of distributions. We suggest ..."
Abstract
 Add to MetaCart
Abstract: We propose a framework for studying property testing of collections of distributions, where the number of distributions in the collection is a parameter of the problem. Previous work on property testing of distributions considered single distributions or pairs of distributions. We suggest two models that differ in the way the algorithm is given access to samples from the distributions. In one model the algorithm may ask for a sample from any distribution of its choice, and in the other the choice of the distribution is random. Our main focus is on the basic problem of distinguishing between the case that all the distributions in the collection are the same (or very similar), and the case that it is necessary to modify the distributions in the collection in a nonnegligible manner so as to obtain this property. We give almost tight upper and lower bounds for this testing problem, as well as study an extension to a clusterability property. One of our lower bounds directly implies a lower bound on testing independence of a joint distribution, a result which was left open by previous work.