Results 1  10
of
66
A nearoptimal algorithm for computing the entropy of a stream
 In ACMSIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract

Cited by 74 (20 self)
 Add to MetaCart
(Show Context)
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
Data streaming algorithms for estimating entropy of network traffic
 In Proceedings of the joint international conference on Measurement and modeling of computer systems (SIGMETRICS). ACM
, 2006
"... • Given n flows of sizes a1,..., an. Let s ≡ i ai. The empirical entropy is defined as ..."
Abstract

Cited by 72 (12 self)
 Add to MetaCart
• Given n flows of sizes a1,..., an. Let s ≡ i ai. The empirical entropy is defined as
Estimating PageRank on Graph Streams
"... This study focuses on computations on large graphs (e.g., the webgraph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sublinear in the number of nodes n) and a few passes. In the streaming model, we show ho ..."
Abstract

Cited by 46 (5 self)
 Add to MetaCart
(Show Context)
This study focuses on computations on large graphs (e.g., the webgraph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sublinear in the number of nodes n) and a few passes. In the streaming model, we show how to perform several graph computations including estimating the probability distribution after a random walk of length l, mixing time, and the conductance. We estimate the mixing time M of a random walk in Õ(nα+Mα √ q q Mn M n+) space and Õ( α α) passes. Furthermore, the relation between mixing time and conductance gives us an estimate for the conductance of the graph. By applying our algorithm for computing probability distribution on the webgraph, we can estimate the PageRank p of any node up to an additive error of √ ɛp in Õ( q M α) passes and Õ(min(nα + 1
Estimating entropy and entropy norm on data streams
 In Proceedings of the 23rd International Symposium on Theoretical Aspects of Computer Science (STACS
, 2006
"... Abstract. We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result deals with a measure we call the “entropy norm ” of an input stream: it is closely related to entropy but is structurally similar to the wellstudi ..."
Abstract

Cited by 43 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result deals with a measure we call the “entropy norm ” of an input stream: it is closely related to entropy but is structurally similar to the wellstudied notion of frequency moments. We give a polylogarithmic space onepass algorithm for estimating this norm under certain conditions on the input stream. We also prove a lower bound that rules out such an algorithm if these conditions do not hold. Our second group of results are for estimating the empirical entropy of an input stream. We first present a sublinear space onepass algorithm for this problem. For a stream of m items and a given real parameter α, our algorithm uses space�O(m 2α) and provides an approximation of 1/α in the worst case and (1 + ε) in “most ” cases. We then present a twopass polylogarithmic space (1+ε)approximation algorithm. All our algorithms are quite simple. 1
Sketching and Streaming Entropy via Approximation Theory
"... We conclude a sequence of work by giving nearoptimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and nearopti ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
(Show Context)
We conclude a sequence of work by giving nearoptimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and nearoptimal bounds in the insertiononly model without sketching. Our highlevel approach is simple: we give algorithms to estimate Rényi and Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials, a technique which may be useful for other problems. Our work also yields the bestknown and nearoptimal additive approximations for entropy, and hence also for conditional entropy and mutual information.
Robust lower bounds for communication and stream computation
 in Proceedings of the 40th Annual ACM Symposium on Theory of Computing (British
, 2008
"... We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the bestcase and ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the bestcase and worstcase partition models [32, 29]. We aim to understand whether the hardness of a communication problem holds for almost every allocation of the input, as opposed to holding for perhaps just a few atypical partitions. A key application is to the heavily studied data stream model. There is a strong connection between our communication lower bounds and lower bounds in the data stream model that are “robust” to the ordering of the data. That is, we prove lower bounds for when the order of the items in the stream is chosen not adversarially but rather uniformly (or nearuniformly) from the set of all permuations. This randomorder data stream model has attracted recent interest, since lower bounds here give stronger evidence for the inherent hardness of streaming problems. Our results include the first randompartition communication lower bounds for problems including multiparty set disjointness and gapHammingdistance. Both are tight. We also extend and improve previous results [19, 7] for a form of pointer jumping that is relevant to the problem of selection (in particular, median finding). Collectively, these results yield lower bounds for a variety of problems in the randomorder data stream model, including estimating the number of distinct elements, approximating frequency moments, and quantile estimation.
Approximate quantiles and the order of the stream
 In Proceedings of the ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS
"... Recently, there has been an increased focus on modeling uncertainty by distributions. Suppose we wish to compute a function of a stream whose elements are samples drawn independently from some distribution. The distribution is unknown, but the order in which the samples are presented to us will not ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
(Show Context)
Recently, there has been an increased focus on modeling uncertainty by distributions. Suppose we wish to compute a function of a stream whose elements are samples drawn independently from some distribution. The distribution is unknown, but the order in which the samples are presented to us will not be completely adversarial. In this paper, we investigate the importance of the ordering of a data stream, without making any assumptions about the actual distribution of the data. Using quantiles as an example application, we show that we can design provably better algorithms, and settle several open questions on the impact of order on streams. With the recent impetus in the investigation of models for sensor networks, we believe that our approach will allow the construction of novel and significantly improved algorithms.
Lower bounds for quantile estimation in randomorder and multipass streaming
 in International Colloquium on Automata, Languages and Programming
, 2007
"... Abstract. We present lower bounds on the space required to estimate the quantiles of a stream of numerical values. Quantile estimation is perhaps the most studied problem in the data stream model and it is relatively well understood in the basic singlepass data stream model in which the values are ..."
Abstract

Cited by 20 (8 self)
 Add to MetaCart
(Show Context)
Abstract. We present lower bounds on the space required to estimate the quantiles of a stream of numerical values. Quantile estimation is perhaps the most studied problem in the data stream model and it is relatively well understood in the basic singlepass data stream model in which the values are ordered adversarially. Natural extensions of this basic model include the randomorder model in which the values are ordered randomly (e.g. [21, 5, 13, 11, 12]) and the multipass model in which an algorithm is permitted a limited number of passes over the stream (e.g. [6, 7, 1, 19, 2, 6, 7, 1, 19, 2]). We present lower bounds that complement existing upper bounds [21, 11] in both models. One consequence is an exponential separation between the randomorder and adversarialorder models: using Ω(polylog n) space, exact selection requires Ω(log n) passes in the adversarialorder model while O(log log n) passes are sufficient in the randomorder model. 1
Optimal Sampling from Sliding Windows
 ACM PODS2009
, 2009
"... A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gion ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model – windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n items remain active for some fixed parameter n), and bursty windows (e.g., where many items can arrive in “bursts ” at a single step and where only items from the last t steps remain active, again for some fixed parameter t). Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly “translated ” to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of oversampling. When k samples are required, the oversampling method maintains k ′> k samples in the hope that at least k samples are not expired. The obvious disadvantages of this method are twofold: (a) It introduces additional costs and thus decreases the performance; and (b) The memory bounds are not deterministic, which is atypical for