Results 1 - 10
of
228
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract
-
Cited by 533 (22 self)
- Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Index Coding with Side Information
, 2006
"... Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0, 1} n and wishes to broadcast a single message so that each receiver Ri c ..."
Abstract
-
Cited by 103 (0 self)
- Add to MetaCart
(Show Context)
Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0, 1} n and wishes to broadcast a single message so that each receiver Ri can recover the bit xi. Each Ri has prior side information about x, induced by a directed graph G on n nodes; Ri knows the bits of x in the positions {j | (i, j) is an edge of G}. G is known to the sender and to the receivers. We call encoding schemes that achieve this goal INDEX codes for {0, 1} n with side information graph G. In this paper we identify a measure on graphs, the minrank, which exactly characterizes the minimum length of linear and certain types of non-linear INDEX codes. We show that for natural classes of side information graphs, including directed acyclic graphs, perfect graphs, odd holes, and odd anti-holes, minrank is the optimal length of arbitrary INDEX codes. For arbitrary INDEX codes and arbitrary graphs, we obtain a lower bound in terms of the size of the maximum acyclic induced subgraph. This bound holds even for randomized codes, but is shown not to be tight.
Near-optimal lower bounds on the multi-party communication complexity of set disjointness
- In IEEE Conference on Computational Complexity
, 2003
"... We study the communication complexity of the set disjointness problem in the general multi-party model. For t players, each holding a subset of a universe of size n, we establish a near-optimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their ..."
Abstract
-
Cited by 86 (7 self)
- Add to MetaCart
(Show Context)
We study the communication complexity of the set disjointness problem in the general multi-party model. For t players, each holding a subset of a universe of size n, we establish a near-optimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their sets are disjoint. In the more restrictive one-way communication model, in which the players are required to speak in a predetermined order, we improve our bound to an optimal Ω(n/t). These results improve upon the earlier bounds of Ω(n/t 2) in the general model, and Ω(ε 2 n/t 1+ε) in the one-way model, due to Bar-Yossef, Jayram, Kumar, and Sivakumar [5]. As in the case of earlier results, our bounds apply to the unique intersection promise problem. This communication problem is known to have connections with the space complexity of approximating frequency moments in the data stream model. Our results lead to an improved space complexity lower bound of Ω(n 1−2/k / log n) for approximating the k th frequency moment with a constant number of passes over the input, and a technical improvement to Ω(n 1−2/k) if only one pass over the input is permitted. Our proofs rely on the information theoretic direct sum decomposition paradigm of Bar-Yossef et al [5]. Our improvements stem from novel analytical tech-
Optimal space lower bounds for all frequency moments
- IN SODA
, 2004
"... We prove that any one-pass streaming algorithm which (ffl, ffi)-approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the open qu ..."
Abstract
-
Cited by 78 (13 self)
- Add to MetaCart
(Show Context)
We prove that any one-pass streaming algorithm which (ffl, ffi)-approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the open questions of Bar-Yossef et al in [3, 4], and extends the \Omega \Gamma 1ffl2 \Delta lower bound for F0 in [11] to much smaller ffl by applying novel techniques. Along the way we lower bound the one-way communication complexity of approximating the Hamming distance and the number of bipartite graphs with minimum/maximum degree constraints.
Streaming and sublinear approximation of entropy and information distances
- In ACM-SIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract
-
Cited by 67 (13 self)
- Add to MetaCart
(Show Context)
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the Jensen-Shannon distance. We present optimal algorithms for estimating bounded, symmetric f-divergences (including the Jensen-Shannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylog-space PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Quantum and Classical Strong Direct Product Theorems and Optimal Time-Space Tradeoffs
- SIAM Journal on Computing
, 2004
"... A strong direct product theorem says that if we want to compute k independent instances of a function, using less than k times the resources needed for one instance, then our overall success probability will be exponentially small in k. We establish such theorems for the classical as well as quantum ..."
Abstract
-
Cited by 65 (13 self)
- Add to MetaCart
A strong direct product theorem says that if we want to compute k independent instances of a function, using less than k times the resources needed for one instance, then our overall success probability will be exponentially small in k. We establish such theorems for the classical as well as quantum query complexity of the OR function. This implies slightly weaker direct product results for all total functions. We prove a similar result for quantum communication protocols computing k instances of the Disjointness function. Our direct product theorems...
How to Compress Interactive Communication
, 2009
"... We describe new ways to simulate 2-party communication protocols to get protocols with potentially smaller communication. We show that every communication protocol that communicates C bits and reveals I bits of information to the participating parties can be simulated by a new protocol involving at ..."
Abstract
-
Cited by 50 (7 self)
- Add to MetaCart
We describe new ways to simulate 2-party communication protocols to get protocols with potentially smaller communication. We show that every communication protocol that communicates C bits and reveals I bits of information to the participating parties can be simulated by a new protocol involving at most Õ(√CI) bits of communication. In the case that the parties have inputs that are independent of each other, we get much better results, showing how to carry out the simulation with Õ(I) bits of communication. These results lead to a direct sum theorem for randomized communication complexity. Ignoring polylogarithmic factors, we show that for worst case computation, computing n copies of a function requires √ n times the communication required for computing on copy of the function. For average case complexity, given any distribution µ on inputs, computing n copies of the function on n independent inputs sampled according to µ requires √ n times the communication for computing one copy. If µ is a product distribution, computing n copies on n independent inputs sampled according to µ requires n times the communication required for computing the function. We also study the complexity of computing the sum (or parity) of n evaluations of f,
Simpler algorithm for estimating frequency moments of data streams
- PROCEEDINGS OF THE SEVENTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHM
, 2006
"... The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for e ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for estimating Fk, for k > 2, using space Õ(n1-2/k), matching the space lower bound (up to poly-logarithmic factors) for this problem [1, 2, 3, 4, 13] (n is the number of distinct items occurring in the stream.) In this paper, we present a simpler 1-pass algorithm for estimating Fk.
Distributed verification and hardness of distributed approximation
- CoRR
"... We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in t ..."
Abstract
-
Cited by 44 (12 self)
- Add to MetaCart
(Show Context)
We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in the end of the process whether H has the specified property or not). We would like to perform this verification in a decentralized fashion via a distributed algorithm. The time complexity of verification is measured as the number of rounds of distributed communication. In this paper we initiate a systematic study of distributed verification, and give almost tight lower bounds on the running time of distributed verification algorithms for many A full version of this paper is available as [5] at
Information Equals Amortized Communication
, 2010
"... We show how to efficiently simulate the sending of a message M to a receiver who has partial information about the message, sothat the expected number of bits communicated in the simulationis closeto the amount ofadditionalinformationthatthemessagerevealstothereceiver. Thisisageneralizationandstreng ..."
Abstract
-
Cited by 38 (6 self)
- Add to MetaCart
(Show Context)
We show how to efficiently simulate the sending of a message M to a receiver who has partial information about the message, sothat the expected number of bits communicated in the simulationis closeto the amount ofadditionalinformationthatthemessagerevealstothereceiver. Thisisageneralizationandstrengtheningof the Slepian-Wolftheorem, which showshow to carryout such a simulation with low amortized communication in the case that M is a deterministic function of X. A caveat is that our simulation is interactive. As a consequence, we obtain new relationships between the randomized amortized communication complexity of a function, and its information complexity. We prove that for any given distribution on inputs, the internal information cost (namely the information revealed to the parties) involved in computing any relation or function using a two party interactive protocol is exactly equal to the amortized communication complexity of computing independent copies of the same relation or function. Here by amortized communication complexity we mean the average per copy communication in the best protocol for computing multiple copies, with a bound on the error in each copy. This significantly simplifies the relationships between the various measures of complexity for average case communication protocols, and proves that if a function’s information cost is smaller than its communication complexity, then multiple copies of the function can be computed more