## Streaming and sublinear approximation of entropy and information distances (2006)

### Cached

### Download Links

- [www.cs.utah.edu]
- [arxiv.org]
- [www.cis.upenn.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In ACM-SIAM Symposium on Discrete Algorithms |

Citations: | 54 - 12 self |

### BibTeX

@INPROCEEDINGS{Guha06streamingand,

author = {Sudipto Guha and Andrew Mcgregor and Suresh Venkatasubramanian},

title = {Streaming and sublinear approximation of entropy and information distances},

booktitle = {In ACM-SIAM Symposium on Discrete Algorithms},

year = {2006},

pages = {733--742}

}

### Years of Citing Articles

### OpenURL

### Abstract

In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the Jensen-Shannon distance. We present optimal algorithms for estimating bounded, symmetric f-divergences (including the Jensen-Shannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylog-space PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1

### Citations

8609 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...ion matrix) [13] in a way that other plausible measures (most notably ℓ2) are not. In addition, the log-likelihood ratio ln q(x) p(x) is a crucial parameter in Neyman-Pearson style hypothesis testing =-=[17]-=-, and distances based on this (like the KL-distance and the JSdistance) appear as exponents of error probabilities for optimal classifiers. Recently, these distance measures have been used in more alg... |

1497 | Probability inequalities for sums of bounded random variables - Hoeffding - 1963 |

704 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1996
(Show Context)
Citation Context ...ata stream model. We are given a base domain [1, . . . , n] over integers and a function fp() is specified as 〈p, i, +〉 which corresponds to fp(i) ← fp(i) + 1. This is the model used by Alon et al in =-=[2]-=-. The model naturally captures fp(i) ← fp(i)+∆i, however we do not consider fp(i) ← fp(i)−1 (deletions) since the negative term does not correspond to any operation over distributions. An alternate mo... |

437 | The information bottleneck method
- Tishby, Pereira, et al.
- 1999
(Show Context)
Citation Context ...troduction There are many settings where the natural unit of data, rather than being a point in a high dimensional vector space, is a distribution defined on n items. Examples include soft clustering =-=[33]-=-, where the membership of a point in a cluster is described by a distribution, and anomaly detection [27], where the distance between two empirical distributions is used to detect anomalies. Typically... |

299 | An improved data stream summary: the count-min sketch and its applications
- Cormode, Muthukrishnan
- 2005
(Show Context)
Citation Context ...ul sampling technique reminiscent of the online facility location algorithm of Meyerson [29] in the context of stream clustering. There are also some similarities with the count/count-min sketches of =-=[14, 16]-=-. Our algorithm will “track” a few items, i.e., maintain explicit counters for them. Let H be the true entropy and pi the true probability of i. Assume m > n ≥ 3. The algorithm is presented in Figure ... |

265 | Finding frequent items in data streams
- Charikar, Chen, et al.
- 2002
(Show Context)
Citation Context ...ul sampling technique reminiscent of the online facility location algorithm of Meyerson [29] in the context of stream clustering. There are also some similarities with the count/count-min sketches of =-=[14, 16]-=-. Our algorithm will “track” a few items, i.e., maintain explicit counters for them. Let H be the true entropy and pi the true probability of i. Assume m > n ≥ 3. The algorithm is presented in Figure ... |

227 |
Differential-geometrical methods in statistics, Lecture notes in statistics
- Amari
- 1985
(Show Context)
Citation Context ... Triangle distance with f(u) = (u − 1) 2 /(u + 1). Matsusita’s Divergence or the (squared) Hellinger distance has f(u) = ( √ u − 1) 2 . The ℓ1 or variational distance is realized with f(u) = |u − 1|.s=-=[3]-=- and many others show that f-divergences are the unique class of distances on distributions that arise from a fairly simple set of axioms, e.g., permutation invariance, non-decreasing projections, cer... |

179 |
Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems
- Csiszár
- 1991
(Show Context)
Citation Context ...retic considerations are often more natural than distances based on ℓp norms. In the first half of the paper we focus on the Ali-Silvey distances or f-divergences, discovered independently by Csiszár =-=[18]-=-, and Ali and Silvey [1]. The class of f-divergences include many commonly used information theoretic distances, e.g., the (asymmetric) Kullback-Liebler (KL) divergence 1 and its symmetrization, the J... |

158 |
A general class of coefficients of divergence of one distribution from another
- Ali, Silvey
- 1966
(Show Context)
Citation Context ...often more natural than distances based on ℓp norms. In the first half of the paper we focus on the Ali-Silvey distances or f-divergences, discovered independently by Csiszár [18], and Ali and Silvey =-=[1]-=-. The class of f-divergences include many commonly used information theoretic distances, e.g., the (asymmetric) Kullback-Liebler (KL) divergence 1 and its symmetrization, the Jensen-Shannon (JS) diver... |

156 | An information statistics approach to data stream and communication complexity
- Bar-Yossef, Jayram, et al.
- 2002
(Show Context)
Citation Context .... The random permutation assumption can be removed using an extra pass to give a two-pass simulation in the regular streaming model. The simulation builds upon the reductions used by Bar-Yossef et al =-=[8, 6, 7]-=- in deriving strong lower bounds for sampling. However we use the reductions for upper bounds. In a natural sense, if we exclude permutation dependent functions, stream testing in the random permutati... |

144 | Counting distinct elements in a data stream
- Bar-Yossef, Jayram, et al.
- 2002
(Show Context)
Citation Context ...proximating the F0. There has been a long history of papers for computing the frequency moments of streams. We focus our attention to the best known (ɛ, δ) approximation algorithm of Bar-Yossef et al =-=[9]-=-, where F0 is approximated up to a factor (1 + ɛ) with probability 1 − δ. Their result shows that the (ɛ, δ)-approximation can be performed in O(( 1 ɛ2 log log n+log n) log 1 δ ) space. We will only f... |

144 | Frequency estimation of internet packet streams with limited space
- Demaine, López-Ortiz, et al.
- 2002
(Show Context)
Citation Context ...ut from its read-only input tape. Alternate definitions are possible, but this definition dates back to Munro and Paterson [30] and we will restrict ourselves to this definition. (It also appeared in =-=[19]-=-.) All other features are the same as a general stream algorithm. As usual, the complexity of the algorithm is measured primarily in terms of the amount of space used on the work tape (for which the a... |

108 | A divisive information-theoretic feature clustering algorithm for text classification
- Dhillon, Mallela, et al.
- 2003
(Show Context)
Citation Context ...ear as exponents of error probabilities for optimal classifiers. Recently, these distance measures have been used in more algorithmic contexts, as natural distances for clustering distributional data =-=[33, 20, 5]-=-. Batu et al [12] gave algorithms for testing closeness of distributions for the ℓ1 and ℓ2 distances, and raised the question of testing closeness of distributions under the JS-divergence. They state ... |

93 | On the learnability of discrete distributions
- Kearns, Mansour, et al.
- 1994
(Show Context)
Citation Context ...perties of distributions. These are the generative 3 Note that the Hellinger distance is sometimes defined as the square-root of the above quantity. .sand evaluative models introduced by Kearns et al =-=[26]-=-. The black-box or generative model of a distribution permits only one operation: taking a sample from the distribution. In other words, given a distribution p = {p1, . . . pn}, sample(p) returns i wi... |

78 | Testing that Distributions are Close
- Batu, Fortnow, et al.
- 2000
(Show Context)
Citation Context ...r probabilities for optimal classifiers. Recently, these distance measures have been used in more algorithmic contexts, as natural distances for clustering distributional data [33, 20, 5]. Batu et al =-=[12]-=- gave algorithms for testing closeness of distributions for the ℓ1 and ℓ2 distances, and raised the question of testing closeness of distributions under the JS-divergence. They state that they suspect... |

76 |
Some inequalities for information divergence and related measures of discrimination
- Topsoe
(Show Context)
Citation Context ...ivergences are constant factor related to the Triangle divergence as follows: (3.1) Hellinger(p, q)/2 ≤ ∆(p, q)/2 ≤ JS(p, q) ≤ ln(2) ∆(p, q) ≤ 2 ln(2) Hellinger(p, q) (Parts of Eqn. 3.1 are proved in =-=[34]-=-.) Therefore the results presented here naturally imply analogous results for them as well. Our algorithm is similar to that in [12], and is presented in Figure 1. It relies on an ℓ2 tester given in [... |

58 |
Statistical decision rules and optimal inference
- Cencov
- 1981
(Show Context)
Citation Context ... certain direct sum theorems etc., in much the same way that ℓ2 is a natural measure for points in R n . Moreover, all of these distances are related to each other (via the Fisher information matrix) =-=[13]-=- in a way that other plausible measures (most notably ℓ2) are not. In addition, the log-likelihood ratio ln q(x) p(x) is a crucial parameter in Neyman-Pearson style hypothesis testing [17], and distan... |

53 | Online facility location
- Meyerson
- 2001
(Show Context)
Citation Context ...ge f(i) and estimate them and (ii) Uses a worst case bound for small f(i). The first step is achieved by a careful sampling technique reminiscent of the online facility location algorithm of Meyerson =-=[29]-=- in the context of stream clustering. There are also some similarities with the count/count-min sketches of [14, 16]. Our algorithm will “track” a few items, i.e., maintain explicit counters for them.... |

50 | Sampling algorithms: lower bounds and applications
- Bar-Yossef, Kumar, et al.
- 2001
(Show Context)
Citation Context ... [21]) that are easy in the property testing model but hard to test in streams. This was surprising since many sampling based techniques can be extended to data streams. For example, Bar-Yossef et al =-=[10]-=- showed that non-adaptive sampling can be easily simulated in an aggregate (all occurrences of item i are grouped together) streaming model with a small blowup in space. The aggregation assumption can... |

42 | Approximating the minimum spanning tree weight in sublinear time
- Chazelle, Rubinfeld, et al.
- 2005
(Show Context)
Citation Context ...pace. We will only focus on the fact that the space bounds are polylogarithmic. The basic intuition of our algorithm is similar to the sublinear time minimum spanning tree algorithm of Chazelle et al =-=[15]-=- and the streaming geometry algorithms by Indyk [25]. The idea is to count objects at various resolutions. Our algorithm works by randomly generating conceptual sub-streams from the data stream. Each ... |

42 |
Property testing (a tutorial), In: Handbook of Randomized Computing
- Ron
- 2001
(Show Context)
Citation Context ... and data streams. We discuss both of these aspects below. We will not be able to review the extensive literature on either of these topics; however several good surveys exist, including those by Ron =-=[32]-=-, Babcock et al [4] and Muthukrishnan [31]. 1.1 Problems When dealing with distributions, distances arising from information-theoretic considerations are often more natural than distances based on ℓp ... |

41 | Algorithms for dynamic geometric problems over data streams
- INDYK
(Show Context)
Citation Context ...bounds are polylogarithmic. The basic intuition of our algorithm is similar to the sublinear time minimum spanning tree algorithm of Chazelle et al [15] and the streaming geometry algorithms by Indyk =-=[25]-=-. The idea is to count objects at various resolutions. Our algorithm works by randomly generating conceptual sub-streams from the data stream. Each substream has a associated level j and we will perfo... |

36 |
small-space algorithms for approximate histogram maintenance
- Gilbert, Guha, et al.
- 2002
(Show Context)
Citation Context ...of the projection (irrespective of membership in MS), denoted by Prefix and keep track of the new elements seen. At the end of 6 We can view the setting as a “robust distribution” as in Gilbert et al =-=[23]-=- the stream, we know the length m, and can now simulate the combined oracle algorithm using MS and Prefix. The algorithm: At any point of time we are maintaining a set of items A. For each i ∈ A we ar... |

33 |
Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and Issues in Data Stream Systems
- Babcock, Babu
- 2002
(Show Context)
Citation Context ...e discuss both of these aspects below. We will not be able to review the extensive literature on either of these topics; however several good surveys exist, including those by Ron [32], Babcock et al =-=[4]-=- and Muthukrishnan [31]. 1.1 Problems When dealing with distributions, distances arising from information-theoretic considerations are often more natural than distances based on ℓp norms. In the first... |

25 |
The Complexity of Massive Data Set Computations
- Bar-Yossef
(Show Context)
Citation Context .... The random permutation assumption can be removed using an extra pass to give a two-pass simulation in the regular streaming model. The simulation builds upon the reductions used by Bar-Yossef et al =-=[8, 6, 7]-=- in deriving strong lower bounds for sampling. However we use the reductions for upper bounds. In a natural sense, if we exclude permutation dependent functions, stream testing in the random permutati... |

14 |
Selection and Sorting with
- Munro, Paterson
- 1980
(Show Context)
Citation Context ...tream algorithm is a data stream algorithm that reads a randomly permuted input from its read-only input tape. Alternate definitions are possible, but this definition dates back to Munro and Paterson =-=[30]-=- and we will restrict ourselves to this definition. (It also appeared in [19].) All other features are the same as a general stream algorithm. As usual, the complexity of the algorithm is measured pri... |

13 | Extensions of Lipshitz Mapping into Hilbert - Johnson, Lindenstrauss - 1984 |

12 |
Ronitt Rubinfeld. The complexity of approximating the entropy
- Batu, Dasgupta, et al.
(Show Context)
Citation Context ...ons. We consider the problem of estimating the entropy H of a distribution, providing optimal (up to constants) upper bounds for testing entropy. This improves the log n previous result of Batu et al =-=[11]-=- by a factor H(p) . Entropy is naturally related to the JS-divergence since JS(p, q) = ln 2(2H((p + q)/2) − H(p) − H(q)) where (p + q)/2 is the average of the two distributions. Switching from subline... |

11 |
Convex statistical distances,” in Teubner-texte zur Mathematik
- Liese, Vajda
- 1987
(Show Context)
Citation Context ... the Triangle distance. Every convex function f gives rise to an f-divergence Df (q, p) = � x∈Ω p(x)f (q(x)/p(x)) if f(1) = 0 and f is strictly convex at 1. 2 Results of Csiszár [18], Liese and Vajda =-=[28]-=-, Amari 1 Many of the measures we consider in this paper are not metrics – and several authors use constant multiples of the definitions in this paper. Traditionally, the term ‘divergence’ has been us... |

3 |
Data streams: Algorithms and applications, Survey available on request at muthu@research.att.com
- Muthukrishnan
- 2003
(Show Context)
Citation Context ...e aspects below. We will not be able to review the extensive literature on either of these topics; however several good surveys exist, including those by Ron [32], Babcock et al [4] and Muthukrishnan =-=[31]-=-. 1.1 Problems When dealing with distributions, distances arising from information-theoretic considerations are often more natural than distances based on ℓp norms. In the first half of the paper we f... |

1 |
Clustering with bregman divergences, JMLR
- Banerjee, Merugu, et al.
- 2005
(Show Context)
Citation Context ...ear as exponents of error probabilities for optimal classifiers. Recently, these distance measures have been used in more algorithmic contexts, as natural distances for clustering distributional data =-=[33, 20, 5]-=-. Batu et al [12] gave algorithms for testing closeness of distributions for the ℓ1 and ℓ2 distances, and raised the question of testing closeness of distributions under the JS-divergence. They state ... |

1 |
Mahesh Viswanathan, Testing and spot-checking of data streams (extended abstract
- Feigenbaum, Kannan, et al.
- 2000
(Show Context)
Citation Context ...ltiset. 1.2 Models As it turns out, sublinear algorithms for testing distributions reveal interesting structure about the relationship between property testing and stream algorithms. Feigenbaum et al =-=[21]-=- considered the problem of property testing in a data stream model. They showed that there exist functions (e.g., SortedSuperset, a variant of permutation, [21]) that are easy in the property testing ... |

1 |
Suresh Venkatasubramanian, Streaming and sublinear approximation of entropy and information distances, CoRR cs/0508122
- Guha, McGregor
- 2005
(Show Context)
Citation Context ... ɛ/2, the algorithms passes with probability at least 1 − δ, but if ℓ2(p, q) ≥ ɛ the algorithm passes with probability less than δ. The proofs of the following lemmas can be found in the full version =-=[24]-=-. Lemma 3.2. We say an estimate is heavy if it is greater than 1/nα . Then, with m = O(log 1 n δ α log n γ2 ) samples, with probability 1 − δ/2, for any heavy estimate ˜pi, ˜pi is at most piγ/100 from... |

1 |
Suresh Venkatasubramanian, On stationarity in internet measurements through an informationtheoretic lens
- Krishnamurthy, Madhyastha
- 2005
(Show Context)
Citation Context ...mensional vector space, is a distribution defined on n items. Examples include soft clustering [33], where the membership of a point in a cluster is described by a distribution, and anomaly detection =-=[27]-=-, where the distance between two empirical distributions is used to detect anomalies. Typically, such settings involve large data sets, and ∗ Department of Computer Information Sciences, University of... |

1 | Bar-Yossef, The complexity of massive data set computations - Ziv - 2002 |

1 | Indyk, Stable distributions, pseudorandom generators, embeddings and data stream computation - Piotr - 2000 |