Results 1 - 10
of
25
A near-optimal algorithm for computing the entropy of a stream
- In ACM-SIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract
-
Cited by 36 (17 self)
- Add to MetaCart
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
Data Streaming Algorithms for Estimating Entropy of Network Traffic
- IN ACM SIGMETRICS
, 2006
"... Using entropy of traffic distributions has been shown to aid a wide variety of network monitoring applications such as anomaly detection, clustering to reveal interesting patterns, and traffic classification. However, realizing this potential benefit in practice requires accurate algorithms that can ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
Using entropy of traffic distributions has been shown to aid a wide variety of network monitoring applications such as anomaly detection, clustering to reveal interesting patterns, and traffic classification. However, realizing this potential benefit in practice requires accurate algorithms that can operate on high-speed links, with low CPU and memory requirements. In this paper, we investigate the problem of estimating the entropy in a streaming computation model. We give lower bounds for this problem, showing that neither approximation nor randomization alone will let us compute the entropy e#ciently. We present two algorithms for randomly approximating the entropy in a time and space e#- cient manner, applicable for use on very high speed (greater than OC-48) links. The first algorithm for entropy estimation is inspired by the structural similarity with the seminal work of Alon et al. for estimating frequency moments, and we provide strong theoretical guarantees on the error and resource usage. Our second algorithm utilizes the observation that the performance of the streaming algorithm can be enhanced by separating the high-frequency items (or elephants) from the low-frequency items (or mice). We evaluate our algorithms on traffic traces from different deployment scenarios.
Estimating entropy and entropy norm on data streams
- In Proceedings of the 23rd International Symposium on Theoretical Aspects of Computer Science (STACS
, 2006
"... Abstract. We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result deals with a measure we call the “entropy norm ” of an input stream: it is closely related to entropy but is structurally similar to the well-studi ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Abstract. We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result deals with a measure we call the “entropy norm ” of an input stream: it is closely related to entropy but is structurally similar to the well-studied notion of frequency moments. We give a polylogarithmic space one-pass algorithm for estimating this norm under certain conditions on the input stream. We also prove a lower bound that rules out such an algorithm if these conditions do not hold. Our second group of results are for estimating the empirical entropy of an input stream. We first present a sublinear space one-pass algorithm for this problem. For a stream of m items and a given real parameter α, our algorithm uses space�O(m 2α) and provides an approximation of 1/α in the worst case and (1 + ε) in “most ” cases. We then present a two-pass polylogarithmic space (1+ε)-approximation algorithm. All our algorithms are quite simple. 1
Estimating Entropy over Data Streams
- In ESA
, 2006
"... Abstract. We present an algorithm for estimating entropy of data streams consisting of insertion and deletion operations using Õ(1) space.1 1 ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Abstract. We present an algorithm for estimating entropy of data streams consisting of insertion and deletion operations using Õ(1) space.1 1
Lower bounds for quantile estimation in random-order and multi-pass streaming
- in International Colloquium on Automata, Languages and Programming
, 2007
"... Abstract. We present lower bounds on the space required to estimate the quantiles of a stream of numerical values. Quantile estimation is perhaps the most studied problem in the data stream model and it is relatively well understood in the basic single-pass data stream model in which the values are ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
Abstract. We present lower bounds on the space required to estimate the quantiles of a stream of numerical values. Quantile estimation is perhaps the most studied problem in the data stream model and it is relatively well understood in the basic single-pass data stream model in which the values are ordered adversarially. Natural extensions of this basic model include the random-order model in which the values are ordered randomly (e.g. [21, 5, 13, 11, 12]) and the multi-pass model in which an algorithm is permitted a limited number of passes over the stream (e.g. [6, 7, 1, 19, 2, 6, 7, 1, 19, 2]). We present lower bounds that complement existing upper bounds [21, 11] in both models. One consequence is an exponential separation between the random-order and adversarialorder models: using Ω(polylog n) space, exact selection requires Ω(log n) passes in the adversarial-order model while O(log log n) passes are sufficient in the random-order model. 1
Robust lower bounds for communication and stream computation
- in Proceedings of the 40th Annual ACM Symposium on Theory of Computing (British
, 2008
"... We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the best-case and ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the best-case and worst-case partition models [32, 29]. We aim to understand whether the hardness of a communication problem holds for almost every allocation of the input, as opposed to holding for perhaps just a few atypical partitions. A key application is to the heavily studied data stream model. There is a strong connection between our communication lower bounds and lower bounds in the data stream model that are “robust” to the ordering of the data. That is, we prove lower bounds for when the order of the items in the stream is chosen not adversarially but rather uniformly (or near-uniformly) from the set of all permuations. This random-order data stream model has attracted recent interest, since lower bounds here give stronger evidence for the inherent hardness of streaming problems. Our results include the first random-partition communication lower bounds for problems including multi-party set disjointness and gap-Hamming-distance. Both are tight. We also extend and improve previous results [19, 7] for a form of pointer jumping that is relevant to the problem of selection (in particular, median finding). Collectively, these results yield lower bounds for a variety of problems in the random-order data stream model, including estimating the number of distinct elements, approximating frequency moments, and quantile estimation.
Sketching and Streaming Entropy via Approximation Theory
"... We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and near-opti ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and near-optimal bounds in the insertion-only model without sketching. Our high-level approach is simple: we give algorithms to estimate Rényi and Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials, a technique which may be useful for other problems. Our work also yields the best-known and near-optimal additive approximations for entropy, and hence also for conditional entropy and mutual information.
Sketching information divergences
- In Conference on Learning Theory
, 2007
"... When comparing discrete probability distributions, natural measures of similarity are not ℓp distances but rather are information divergences such as Kullback-Leibler and Hellinger. This paper considers some of the issues related to constructing small-space sketches of distributions in the data-stre ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
When comparing discrete probability distributions, natural measures of similarity are not ℓp distances but rather are information divergences such as Kullback-Leibler and Hellinger. This paper considers some of the issues related to constructing small-space sketches of distributions in the data-stream model, a concept related to dimensionality reduction, such that these measures can be approximated from the sketches. Related problems for ℓp distances are reasonably well understood via a series of results by Johnson & Lindenstrauss (1984), Alon et al. (1999), Indyk (2000), and Brinkman & Charikar (2003). In contrast, almost no analogous results are known to date about constructing sketches for the information divergences used in statistics and learning theory. Our main result is an impossibility result that shows that no small-space sketches exist for the multiplicative approximation of any commonly used f-divergences and Bregman divergences with the notable exceptions of ℓ1 and ℓ2 where small-space sketches exist. We then present data-stream algorithms for the additive approximation of a wide range of information divergences. Throughout, our emphasis is on providing general characterizations.
Estimating PageRank on Graph Streams
"... This study focuses on computations on large graphs (e.g., the web-graph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sub-linear in the number of nodes n) and a few passes. In the streaming model, we show ho ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This study focuses on computations on large graphs (e.g., the web-graph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sub-linear in the number of nodes n) and a few passes. In the streaming model, we show how to perform several graph computations including estimating the probability distribution after a random walk of length l, mixing time, and the conductance. We estimate the mixing time M of a random walk in Õ(nα+Mα √ q q Mn M n+) space and Õ( α α) passes. Furthermore, the relation between mixing time and conductance gives us an estimate for the conductance of the graph. By applying our algorithm for computing probability distribution on the web-graph, we can estimate the PageRank p of any node up to an additive error of √ ɛp in Õ( q M α) passes and Õ(min(nα + 1
STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS ∗
"... Abstract. When trying to process a data-stream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable when the ordering is worst-case, that can be solved (with high probability) when the order is chosen uniformly at random? If we consider the str ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. When trying to process a data-stream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable when the ordering is worst-case, that can be solved (with high probability) when the order is chosen uniformly at random? If we consider the stream as if ordered by an adversary, what happens if we restrict the power of the adversary? We study these questions in the context of quantile estimation, one of the most studied problems in the data-stream model. Our results include an O(polylog n)-space, O(log log n)-pass algorithm for exact selection in a randomly ordered stream of n elements. This resolves an open question of Munro and Paterson [Theor. Comput. Sci., 23 (1980), pp. 315–323]. We then demonstrate an exponential separation between the random-order and adversarial-order models: using O(polylog n) space, exact selection requires Ω(log n / log log n) passes in the adversarial-order model. This lower bound, in contrast to previous results, applies to fully-general randomized algorithms and is established via a new bound on the communication complexity of a natural pointer-chasing style problem. We also prove the first fully general lower bounds in the random-order model: finding an element with rank n/2 ± nδ, in the single-pass random-order model with probability at least 9/10 requires Ω ( p n1−3δ / log n) space.

