Results 1  10
of
129
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 375 (21 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Nearoptimal lower bounds on the multiparty communication complexity of set disjointness
 In IEEE Conference on Computational Complexity
, 2003
"... We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their ..."
Abstract

Cited by 72 (7 self)
 Add to MetaCart
We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their sets are disjoint. In the more restrictive oneway communication model, in which the players are required to speak in a predetermined order, we improve our bound to an optimal Ω(n/t). These results improve upon the earlier bounds of Ω(n/t 2) in the general model, and Ω(ε 2 n/t 1+ε) in the oneway model, due to BarYossef, Jayram, Kumar, and Sivakumar [5]. As in the case of earlier results, our bounds apply to the unique intersection promise problem. This communication problem is known to have connections with the space complexity of approximating frequency moments in the data stream model. Our results lead to an improved space complexity lower bound of Ω(n 1−2/k / log n) for approximating the k th frequency moment with a constant number of passes over the input, and a technical improvement to Ω(n 1−2/k) if only one pass over the input is permitted. Our proofs rely on the information theoretic direct sum decomposition paradigm of BarYossef et al [5]. Our improvements stem from novel analytical tech
Optimal space lower bounds for all frequency moments
 In SODA
, 2004
"... Abstract We prove that any onepass streaming algorithm which (ffl, ffi)approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the ..."
Abstract

Cited by 60 (12 self)
 Add to MetaCart
Abstract We prove that any onepass streaming algorithm which (ffl, ffi)approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the open questions of BarYossef et al in [3, 4], and extends the \Omega \Gamma 1ffl2 \Delta lower bound for F0 in [11] to much smaller ffl by applying novel techniques. Along the way we lower bound the oneway communication complexity of approximating the Hamming distance and the number of bipartite graphs with minimum/maximum degree constraints. 1 Introduction Computing statistics on massive data sets is increasinglyimportant these days. Advances in communication and storage technology enable large bodies of raw datato be generated daily, and consequently, there is a rising demand to process this data efficiently. Sinceit is impractical for an algorithm to store even a small fraction of the data stream, its performance istypically measured by the amount of space it uses. In many scenarios, such as internet routing, once a streamelement is examined it is lost forever unless explicitly saved by the processing algorithm. This, along with thesheer size of the data, makes multiple passes over the data infeasible. In this paper we restrict our attention toonepass streaming algorithms and we investigate their space complexity.Let a =
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Simpler algorithm for estimating frequency moments of data streams
 PROCEEDINGS OF THE SEVENTEENTH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHM
, 2006
"... The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for e ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for estimating Fk, for k > 2, using space Õ(n12/k), matching the space lower bound (up to polylogarithmic factors) for this problem [1, 2, 3, 4, 13] (n is the number of distinct items occurring in the stream.) In this paper, we present a simpler 1pass algorithm for estimating Fk.
An optimal randomised cell probe lower bounds for approximate nearest neighbor searching
 In Proceedings of the Symposium on Foundations of Computer Science
"... Abstract We consider the approximate nearest neighbour search problem on the Hamming Cube {0, 1}d.We show that a randomised cell probe algorithm that uses polynomial storage and word size dO(1)requires a worst case query time of \Omega (log log d / log log log d). The approximation factor may beas l ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
Abstract We consider the approximate nearest neighbour search problem on the Hamming Cube {0, 1}d.We show that a randomised cell probe algorithm that uses polynomial storage and word size dO(1)requires a worst case query time of \Omega (log log d / log log log d). The approximation factor may beas loose as 2log 1j d for any fixed j> 0. This generalises an earlier result [6] on the deterministic complexity of the same problem and, more importantly, fills a major gap in the study of thisproblem since all earlier lower bounds either did not allow randomisation [6, 19] or did not allow approximation [5, 2, 16]. We also give a cell probe algorithm which proves that our lower boundis optimal. Our proof uses a lower bound on the round complexity of the related communication problem.We show, additionally, that considerations of bit complexity alone cannot prove any nontrivial cell probe lower bound for the problem. This shows that the Richness Technique [20] used in a lot ofrecent research around this problem would not have helped here.
Robust lower bounds for communication and stream computation
 in Proceedings of the 40th Annual ACM Symposium on Theory of Computing (British
, 2008
"... We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the bestcase and ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the bestcase and worstcase partition models [32, 29]. We aim to understand whether the hardness of a communication problem holds for almost every allocation of the input, as opposed to holding for perhaps just a few atypical partitions. A key application is to the heavily studied data stream model. There is a strong connection between our communication lower bounds and lower bounds in the data stream model that are “robust” to the ordering of the data. That is, we prove lower bounds for when the order of the items in the stream is chosen not adversarially but rather uniformly (or nearuniformly) from the set of all permuations. This randomorder data stream model has attracted recent interest, since lower bounds here give stronger evidence for the inherent hardness of streaming problems. Our results include the first randompartition communication lower bounds for problems including multiparty set disjointness and gapHammingdistance. Both are tight. We also extend and improve previous results [19, 7] for a form of pointer jumping that is relevant to the problem of selection (in particular, median finding). Collectively, these results yield lower bounds for a variety of problems in the randomorder data stream model, including estimating the number of distinct elements, approximating frequency moments, and quantile estimation.
Index Coding with Side Information
, 2006
"... Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0, 1} n and wishes to broadcast a single message so that each receiver Ri c ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0, 1} n and wishes to broadcast a single message so that each receiver Ri can recover the bit xi. Each Ri has prior side information about x, induced by a directed graph G on n nodes; Ri knows the bits of x in the positions {j  (i, j) is an edge of G}. G is known to the sender and to the receivers. We call encoding schemes that achieve this goal INDEX codes for {0, 1} n with side information graph G. In this paper we identify a measure on graphs, the minrank, which exactly characterizes the minimum length of linear and certain types of nonlinear INDEX codes. We show that for natural classes of side information graphs, including directed acyclic graphs, perfect graphs, odd holes, and odd antiholes, minrank is the optimal length of arbitrary INDEX codes. For arbitrary INDEX codes and arbitrary graphs, we obtain a lower bound in terms of the size of the maximum acyclic induced subgraph. This bound holds even for randomized codes, but is shown not to be tight.
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal alg ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1
Distributed verification and hardness of distributed approximation
 CoRR
"... We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in t ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in the end of the process whether H has the specified property or not). We would like to perform this verification in a decentralized fashion via a distributed algorithm. The time complexity of verification is measured as the number of rounds of distributed communication. In this paper we initiate a systematic study of distributed verification, and give almost tight lower bounds on the running time of distributed verification algorithms for many A full version of this paper is available as [5] at