Results 1  10
of
18
An Optimal Algorithm for the Distinct Elements Problem
"... We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, ne ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,..., n}, our algorithm computes a (1 ± ε)approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worstcase time, and can report an estimate at any point midstream in O(1) worstcase time, thus settling both the space and time complexities simultaneously.
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal alg ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1
Bounded Independence Fools Degree2 Threshold Functions
"... Let x be a random vector coming from any kwise independent distribution over {−1, 1} n. For an nvariate degree2 polynomial p, we prove that E[sgn(p(x))] is determined up to an additive ε for k = poly(1/ε). This answers an open question of Diakonikolas et al. (FOCS 2009). Using standard constructi ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
Let x be a random vector coming from any kwise independent distribution over {−1, 1} n. For an nvariate degree2 polynomial p, we prove that E[sgn(p(x))] is determined up to an additive ε for k = poly(1/ε). This answers an open question of Diakonikolas et al. (FOCS 2009). Using standard constructions of kwise independent distributions, we obtain a broad class of explicit generators that εfool the class of degree2 threshold functions with seed length log n·poly(1/ε). Our approach is quite robust: it easily extends to yield that the intersection of any constant number of degree2 threshold functions is εfooled by poly(1/ε)wise independence. Our results also hold if the entries of x are kwise independent standard normals, implying for example that bounded independence derandomizes the GoemansWilliamson hyperplane rounding scheme. To achieve our results, we introduce a technique we dub multivariate FTmollification, a generalization of the univariate form introduced by Kane et al. (SODA 2010) in the context of streaming algorithms. Along the way we prove a generalized hypercontractive inequality for quadratic forms which takes the operator norm of the associated matrix into account. These techniques may be of independent interest. 1
1Pass RelativeError LpSampling with Applications
"... For any p ∈ [0, 2], we give a 1pass poly(ε −1 log n)space algorithm which, given a data stream of length m with insertions and deletions of an ndimensional vector a, with updates in the range {−M, −M + 1, · · · , M − 1, M}, outputs a sample of [n] = {1, 2, · · · , n} for which for all i th ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
For any p ∈ [0, 2], we give a 1pass poly(ε −1 log n)space algorithm which, given a data stream of length m with insertions and deletions of an ndimensional vector a, with updates in the range {−M, −M + 1, · · · , M − 1, M}, outputs a sample of [n] = {1, 2, · · · , n} for which for all i the probability that i is returned is (1 ± ɛ) ai  p Fp(a) ± n −C, where ai denotes the (possibly negative) value of coordinate i, Fp(a) = ∑n i=1 aip = a  p p denotes the pth frequency moment (i.e., the pth power of the Lp norm), and C> 0 is an arbitrarily large constant. Here we assume that n, m, and M are polynomially related. Our generic sampling framework improves and unifies algorithms for several communication and streaming problems, including cascaded norms, heavy hitters, and moment estimation. It also gives the first relativeerror forward sampling algorithm in a data stream with deletions, answering an open question of Cormode et al. 1
Coresets and Sketches for High Dimensional Subspace Approximation Problems ∗
"... We consider the problem of approximating a set P of n points in R d by a jdimensional subspace under the ℓp measure, in which we wish to minimize the sum of ℓp distances from each point of P to this subspace. More generally, the Fq(ℓp)subspace approximation problem asks for a jsubspace that minim ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
We consider the problem of approximating a set P of n points in R d by a jdimensional subspace under the ℓp measure, in which we wish to minimize the sum of ℓp distances from each point of P to this subspace. More generally, the Fq(ℓp)subspace approximation problem asks for a jsubspace that minimizes the sum of qth powers of ℓpdistances to this subspace, up to a multiplicative factor of (1 + ɛ). We develop techniques for subspace approximation, regression, and matrix approximation that can be used to deal with massive data sets in high dimensional spaces. In particular, we develop coresets and sketches, i.e. small space representations that approximate the input point set P with respect to the subspace approximation problem. Our results are: • A dimensionality reduction method that can be applied to Fq(ℓp)clustering and shape fitting problems, such as those in [8, 15]. • The first strong coreset for F1(ℓ2)subspace approximation in highdimensional spaces, i.e. of size polynomial in the dimension of the space. This coreset approximates the distances to any jsubspace (not just the optimal one). • A (1 + ɛ)approximation algorithm for the jdimensional F1(ℓ2)subspace approximation problem with running time nd(j/ɛ) O(1) + (n + d)2 poly(j/ɛ). • A streaming algorithm that maintains a coreset for the F1(ℓ2)subspace approximation problem and uses a space log n
Optimal Bounds for JohnsonLindenstrauss Transforms and Streaming Problems with SubConstant Error
"... The JohnsonLindenstrauss transform is a dimensionality reduction technique with a wide range of applications to theoretical computer science. It is specified by a distribution over projection matrices from R n → R k where k ≪ d and states that k = O(ε −2 log 1/δ) dimensions suffice to approximate t ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
The JohnsonLindenstrauss transform is a dimensionality reduction technique with a wide range of applications to theoretical computer science. It is specified by a distribution over projection matrices from R n → R k where k ≪ d and states that k = O(ε −2 log 1/δ) dimensions suffice to approximate the norm of any fixed vector in R d to within a factor of 1 ± ε with probability at least 1 − δ. In this paper we show that this bound on k is optimal up to a constant factor, improving upon a previous Ω((ε −2 log 1/δ) / log(1/ε)) dimension bound of Alon. Our techniques are based on lower bounding the information cost of a novel oneway communication game and yield the first space lower bounds in a data stream model that depend on the error probability δ. For many streaming problems, the most naïve way of achieving error probability δ is to first achieve constant probability, then take the median of O(log 1/δ) independent repetitions. Our techniques show that for a wide range of problems this is in fact optimal! As an example, we show that estimating the ℓpdistance for any p ∈ [0, 2] requires Ω(ε −2 log n log 1/δ) space, even for vectors in {0, 1} n. This is optimal in all parameters and closes a long line of work on this problem. We also show the number of distinct elements requires Ω(ε −2 log 1/δ + log n) space, which is optimal if ε −2 = Ω(log n). We also improve previous lower bounds for entropy in the strict turnstile and general turnstile models by a multiplicative factor of Ω(log 1/δ). Finally, we give an application to oneway communication complexity under product distributions, showing that unlike in the case of constant δ, the VCdimension does not characterize the complexity when δ = o(1).
Recognizing WellParenthesized Expressions in the Streaming Model
"... Motivated by a concrete problem and with the goal of understanding the relationship between the complexity of streaming algorithms and the computational complexity of formal languages, we investigate the problem Dyck(s) of checking matching parentheses, with s different types of parenthesis. We pres ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Motivated by a concrete problem and with the goal of understanding the relationship between the complexity of streaming algorithms and the computational complexity of formal languages, we investigate the problem Dyck(s) of checking matching parentheses, with s different types of parenthesis. We present a onepass randomized streaming algorithm for Dyck(2) with space O ( √ n log n) bits, time per letter polylog(n), and onesided error. We prove that this onepass algorithm is optimal, up to a log n factor, even when twosided error is allowed, and conjecture that a similar bound holds for any constant number of passes over the input. Surprisingly, the space requirement shrinks drastically if we have access to the input stream in reverse. We present a twopass randomized streaming algorithm for Dyck(2)
ZeroOne Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and polylogarithmic memory. No such characterization was known despite a tremendous amount of research on frequencybased functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zeroone law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in polylogarithmic time and space and ask, for which G in this class is there an (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Lower bounds for testing computability by small width OBDDs
 IN PROC. 8TH ANNUAL THEORY AND APPLICATIONS OF MODELS OF COMPUTATION
, 2011
"... We consider the problem of testing whether a function f: {0, 1} n → {0, 1} is computable by a readonce, width2 ordered binary decision diagram (OBDD), also known as a branching program. This problem has two variants: one where the variables must occur in a fixed, known order, and one where the v ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We consider the problem of testing whether a function f: {0, 1} n → {0, 1} is computable by a readonce, width2 ordered binary decision diagram (OBDD), also known as a branching program. This problem has two variants: one where the variables must occur in a fixed, known order, and one where the variables are allowed to occur in an arbitrary order. We show that for both variants, any nonadaptive testing algorithm must make Ω(n) queries, and thus any adaptive testing algorithm must make Ω(log n) queries. We also consider the more general problem of testing computability by widthw OBDDs where the variables occur in a fixed order. We show that for any constant w ≥ 4, Ω(n) queries are required, resolving a conjecture of Goldreich [15]. We prove all of our lower bounds using a new technique of Blais, Brody, and Matulef [6], giving simple reductions from known hard problems in communication complexity to the testing problems at hand. Our result for width2 OBDDs provides the first example of the power of this technique for proving strong nonadaptive bounds.
Information cost tradeoffs for augmented index and streaming language recognition
 In Proc. IEEE Symposium on Foundations of Computer Science
, 2010
"... This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTEDINDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTEDINDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTEDINDEX may violate the “rectangle property ” due to the inherent input sharing. Second, we use these bounds to resolve an open problem of Magniez, Mathieu and Nayak [STOC, 2010] that asked about the multipass complexity of recognizing Dyck languages. This results in a natural separation between the standard multipass model and the multipass model that permits reverse passes. Third, we present the first passive memory checkers that verify the interaction transcripts of priority queues, stacks, and doubleended queues. We obtain tight upper and lower bounds for these problems, thereby addressing an important subclass of the memory checking framework of Blum et al. [Algorithmica, 1994].