Results 1 
5 of
5
Testing Closeness of Discrete Distributions
"... Given samples from two distributions over an nelement set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in n, specifically, O(n 2/3 ǫ −8/3 log n), independent samples from each distribution, runs in time linear in the sample size, ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
Given samples from two distributions over an nelement set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in n, specifically, O(n 2/3 ǫ −8/3 log n), independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small (less than max{ǫ 4/3 n −1/3 /32, ǫn −1/2 /4}) or large (more than ǫ) in ℓ1 distance. This result can be compared to the lower bound of Ω(n 2/3 ǫ −2/3) for this problem given by Valiant [2008]. Our algorithm has applications to the problem of testing whether a given Markov process is rapidly mixing. We present sublinear algorithms for several variants of this problem as well.
Panprivate streaming algorithms
 In Proceedings of ICS
, 2010
"... Abstract: Collectors of confidential data, such as governmental agencies, hospitals, or search engine providers, can be pressured to permit data to be used for purposes other than that for which they were collected. To support the data curators, we initiate a study of panprivate algorithms; roughly ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
(Show Context)
Abstract: Collectors of confidential data, such as governmental agencies, hospitals, or search engine providers, can be pressured to permit data to be used for purposes other than that for which they were collected. To support the data curators, we initiate a study of panprivate algorithms; roughly speaking, these algorithms retain their privacy properties even if their internal state becomes visible to an adversary. Our principal focus is on streaming algorithms, where each datum may be discarded immediately after processing.
Privacypreserving statistical estimation with optimal convergence rate
 In Proceedings on 43th Annual ACM Symposium on Theory of Computing
, 2011
"... Consider an analyst who wants to release aggregate statistics about a data set containing sensitive information. Using differentially private algorithms guarantees that the released statistics reveal very little about any particular record in the data set. In this paper we study the asymptotic prope ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
Consider an analyst who wants to release aggregate statistics about a data set containing sensitive information. Using differentially private algorithms guarantees that the released statistics reveal very little about any particular record in the data set. In this paper we study the asymptotic properties of differentially private algorithms for statistical inference. We show that for a large class of statistical estimators T and input distributions P, there is a differentially private estimator AT with the same asymptotic distribution as T. That is, the random variables AT (X) andT (X) converge in distribution when X consists of an i.i.d. sample from P of increasing size. This implies that AT (X) is essentially as good as the original statistic T (X) for statistical inference, for sufficiently large samples. Our technique applies to (almost) any pair T,P such that T is asymptotically normal on i.i.d. samples from P —in particular, to parametric maximum likelihood estimators and estimators for logistic and linear regression under standard regularity conditions. A consequence of our techniques is the existence of lowspace streaming algorithms whose output converges to the same asymptotic distribution as a given estimator T (for the same class of estimators and input distributions as above).
The Shifting Sands Algorithm
"... We resolve the problem of smallspace approximate selection in randomorder streams. Specifically, we present an algorithm that reads the n elements of a set in random order and returns an element whose rank differs from the true median by at most n 1/3+o(1) while storing a constant number of elemen ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
We resolve the problem of smallspace approximate selection in randomorder streams. Specifically, we present an algorithm that reads the n elements of a set in random order and returns an element whose rank differs from the true median by at most n 1/3+o(1) while storing a constant number of elements and counters at any one time. This is optimal: it was previously shown that achieving better accuracy required poly(n) memory. However, it was conjectured that the lower bound was not tight and that a previous algorithm achieving an n 1/2+o(1) approximation was optimal. We therefore consider the new result a surprising resolution to a natural and basic question. 1
Stochastic Streams: Sample Complexity vs. Space Complexity
"... We address the tradeoff between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and ar ..."
Abstract
 Add to MetaCart
We address the tradeoff between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the tradeoffs between the space and sample complexity. We study these tradeoffs for two canonical problems: undirected graph connectivity and estimating frequency moments. Our algorithms are based on techniques for emulating random walks and simulating different sampling procedures given a sequence of IID samples. 1