Results 1 - 10
of
45
The art of uninformed decisions: A primer to property testing
- Science
, 2001
"... Property testing is a new field in computational theory, that deals with the information that can be deduced from the input where the number of allowable queries (reads from the input) is significally smaller than its size. ..."
Abstract
-
Cited by 108 (17 self)
- Add to MetaCart
Property testing is a new field in computational theory, that deals with the information that can be deduced from the input where the number of allowable queries (reads from the input) is significally smaller than its size.
Detecting Change in Data Streams
, 2004
"... Detecting changes in a data stream is an important area of research with many applications. ..."
Abstract
-
Cited by 68 (2 self)
- Add to MetaCart
Detecting changes in a data stream is an important area of research with many applications.
Analysis of representations for domain adaptation
- In NIPS
, 2007
"... Domain is a distribution D on an instance set X Domain adaptation of a classifier A classification task Source domain (DS) ..."
Abstract
-
Cited by 67 (9 self)
- Add to MetaCart
Domain is a distribution D on an instance set X Domain adaptation of a classifier A classification task Source domain (DS)
Sampling Algorithms: Lower Bounds and Applications (Extended Abstract)
, 2001
"... ] Ziv Bar-Yossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a fr ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
] Ziv Bar-Yossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a framework to study probabilistic sampling algorithms that approximate general functions of the form f : A n ! B, where A and B are arbitrary sets. Our goal is to obtain lower bounds on the query complexity of functions, namely the number of input variables x i that any sampling algorithm needs to query to approximate f(x1 ; : : : ; xn ). We define two quantitative properties of functions --- the block sensitivity and the minimum Hellinger distance --- that give us techniques to prove lower bounds on the query complexity. These techniques are quite general, easy to use, yet powerful enough to yield tight results. Our applications include the mean and higher statistical moments, the median and other selection functions, and the frequency moments, where we obtain lower bounds that are close to the corresponding upper bounds. We also point out some connections between sampling and streaming algorithms and lossy compression schemes. 1.
Testing random variables for independence and identity
- Proceedings of the 41st Annual Symposium on Foundations of Computer Science
, 2000
"... Given access to independent samples of a distribution �over�℄�℄, we show how to test whether the distributions formed by projecting�to each coordinate are independent, i.e., whether�is-close in the norm to the product distribution��for some distributions�over �℄and�over�℄. The sample complexity of o ..."
Abstract
-
Cited by 36 (14 self)
- Add to MetaCart
Given access to independent samples of a distribution �over�℄�℄, we show how to test whether the distributions formed by projecting�to each coordinate are independent, i.e., whether�is-close in the norm to the product distribution��for some distributions�over �℄and�over�℄. The sample complexity of our test is �poly, assuming without loss of generality that �. We also give a matching lower bound, up to poly� � factors. Furthermore, given access to samples of a distribution �over�℄, we show how to test if�is-close in norm to an explicitly specified distribution�. Our test uses��poly samples, which nearly matches the known tight bounds for the case when�is uniform. 1.
Streaming and sublinear approximation of entropy and information distances
- In ACM-SIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract
-
Cited by 33 (9 self)
- Add to MetaCart
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the Jensen-Shannon distance. We present optimal algorithms for estimating bounded, symmetric f-divergences (including the Jensen-Shannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylog-space PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
The complexity of approximating the entropy
- SIAM Journal on Computing
"... We consider the problem of approximating the entropy of a discrete distribution under several different models of oracle access to the distribution. In the evaluation oracle model, the algorithm is given access to the explicit array of probabilities specifying the distribution. In this model, linear ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
We consider the problem of approximating the entropy of a discrete distribution under several different models of oracle access to the distribution. In the evaluation oracle model, the algorithm is given access to the explicit array of probabilities specifying the distribution. In this model, linear time in the size of the domain is both necessary and sufficient for approximating the entropy. In the generation oracle model, the algorithm has access only to independent samples from the distribution. In this ( case, we show that a γ-multiplicative approximation to the entropy can be obtained in O n (1+η)/γ2 log n time for distributions with entropy Ω(γ/η), where n is the size of the domain of the distribution and η is an arbitrarily small positive constant. We show that this model does not permit a multiplicative approximation to the entropy in general. For ( the class of distributions to which our upper bound applies, we obtain a lower bound of Ω n1/(2γ2) We next consider a combined oracle model in which the algorithm has access to both the
Testing monotone high-dimensional distributions
- In STOC
, 2005
"... A monotone distribution P over a (partially) ordered domain assigns higher probability to y than to x if y ≥ x in the order. We study several natural problems concerning testing properties of monotone distributions over the n-dimensional Boolean cube, given access to random draws from the distributi ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
A monotone distribution P over a (partially) ordered domain assigns higher probability to y than to x if y ≥ x in the order. We study several natural problems concerning testing properties of monotone distributions over the n-dimensional Boolean cube, given access to random draws from the distribution being tested. We give a poly(n)-time algorithm for testing whether a monotone distribution is equivalent to or ɛ-far (in the L1 norm) from the uniform distribution. A key ingredient of the algorithm is a generalization of a known isoperimetric inequality for the Boolean cube. We also introduce a method for proving lower bounds on various problems of testing monotone distributions over the n-dimensional Boolean cube, based on a new decomposition technique for monotone distributions. We use this method to show that our uniformity testing algorithm is optimal up to polylog(n) factors, and also to give exponential lower bounds on the complexity of several other problems, including testing whether a monotone distribution is identical to or ɛ-far from a fixed known monotone product distribution and approximating the entropy of an unknown monotone distribution. 1
Testing k-wise and almost k-wise independence
- In 39th Annual ACM Symposium on Theory of Computing
, 2007
"... In this work, we consider the problems of testing whether a distribution over {0, 1} n is k-wise (resp. (ɛ, k)-wise) independent using samples drawn from that distribution. For the problem of distinguishing k-wise independent distributions from those that are δ-far from k-wise independence in statis ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
In this work, we consider the problems of testing whether a distribution over {0, 1} n is k-wise (resp. (ɛ, k)-wise) independent using samples drawn from that distribution. For the problem of distinguishing k-wise independent distributions from those that are δ-far from k-wise independence in statistical distance, we upper bound the number of required samples by Õ(nk /δ 2) and lower bound it by Ω(n k−1 2 /δ) (these bounds hold for constant k, and essentially the same bounds hold for general k). To achieve these bounds, we use Fourier analysis to relate a distribution’s distance from k-wise independence to its biases, a measure of the parity imbalance it induces on a set of variables. The relationships we derive are tighter than previously known, and may be of independent interest. To distinguish (ɛ, k)-wise independent distributions from those that are δ-far from (ɛ, k)-wise independence in statistical distance, we upper bound the number of required samples by O ` k log n δ2ɛ2 ´ and lower bound it by
Sublinear time algorithms
- SIGACT News
, 2003
"... Abstract Sublinear time algorithms represent a new paradigm in computing, where an algorithmmust give some sort of an answer after inspecting only a very small portion of the input. We discuss the sorts of answers that one might be able to achieve in this new setting. 1 Introduction The goal of algo ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Abstract Sublinear time algorithms represent a new paradigm in computing, where an algorithmmust give some sort of an answer after inspecting only a very small portion of the input. We discuss the sorts of answers that one might be able to achieve in this new setting. 1 Introduction The goal of algorithmic research is to design efficient algorithms, where efficiency is typicallymeasured as a function of the length of the input. For instance, the elementary school algorithm for multiplying two n digit integers takes roughly n2 steps, while more sophisticated algorithmshave been devised which run in less than n log2 n steps. It is still not known whether a linear time algorithm is achievable for integer multiplication. Obviously any algorithm for this task, as for anyother nontrivial task, would need to take at least linear time in n, since this is what it would take to read the entire input and write the output. Thus, showing the existence of a linear time algorithmfor a problem was traditionally considered to be the gold standard of achievement. Nevertheless, due to the recent tremendous increase in computational power that is inundatingus with a multitude of data, we are now encountering a paradigm shift from traditional computational models. The scale of these data sets, coupled with the typical situation in which there is verylittle time to perform our computations, raises the issue of whether there is time to consider any more than a miniscule fraction of the data in our computations? Analogous to the reasoning thatwe used for multiplication, for most natural problems, an algorithm which runs in sublinear time must necessarily use randomization and must give an answer which is in some sense imprecise.Nevertheless, there are many situations in which a fast approximate solution is more useful than a slower exact solution.

