Results 1  10
of
133
Divergence measures based on the Shannon entropy
 IEEE Transactions on Information theory
, 1991
"... AbstractA new class of informationtheoretic divergence measures based on the Shannon entropy is introduced. Unlike the wellknown Kullback divergences, the new measures do not require the condition of absolute continuity to be satisfied by the probability distributions involved. More importantly, ..."
Abstract

Cited by 404 (0 self)
 Add to MetaCart
AbstractA new class of informationtheoretic divergence measures based on the Shannon entropy is introduced. Unlike the wellknown Kullback divergences, the new measures do not require the condition of absolute continuity to be satisfied by the probability distributions involved. More importantly, their close relationship with the variational distance and the probability of misclassification error are established in terms of bounds. These bounds are crucial in many applications of divergence measures. The new measures are also well characterized by the properties of nonnegativity, finiteness, semiboundedness, and boundedness. Index TermsDivergence, dissimilarity measure, discrimination information, entropy, probability of error bounds. I.
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
 In Advances in Neural Information Processing Systems (NIPS
, 2007
"... by convex risk minimization ..."
Divergence Measures and Message Passing
, 2005
"... This paper presents a unifying view of messagepassing algorithms, as methods to approximate a complex Bayesian network by a simpler network with minimum information divergence. In this view, the difference between meanfield methods and belief propagation is not the amount of structure they model, b ..."
Abstract

Cited by 48 (2 self)
 Add to MetaCart
This paper presents a unifying view of messagepassing algorithms, as methods to approximate a complex Bayesian network by a simpler network with minimum information divergence. In this view, the difference between meanfield methods and belief propagation is not the amount of structure they model, but only the measure of loss they minimize (‘exclusive ’ versus ‘inclusive’ KullbackLeibler divergence). In each case, messagepassing arises by minimizing a localized version of the divergence, local to each factor. By examining these divergence measures, we can intuit the types of solution they prefer (symmetrybreaking, for example) and their suitability for different tasks. Furthermore, by considering a wider variety of divergence measures (such as alphadivergences), we can achieve different complexity and performance goals. 1
Probability of error, equivocation and the chernoff bound
 IEEE Transactions on Information Theory
, 1970
"... AbsfractRelationships between the probability of error, the equivocation, and the Chemoff bound are examined for the twohypothesis decision problem. The effect of rejections on these bounds is derived. Finally, the results are extended to the case of any finite number of hypotheses. I. ..."
Abstract

Cited by 36 (0 self)
 Add to MetaCart
AbsfractRelationships between the probability of error, the equivocation, and the Chemoff bound are examined for the twohypothesis decision problem. The effect of rejections on these bounds is derived. Finally, the results are extended to the case of any finite number of hypotheses. I.
Supervised Learning of Quantizer Codebooks by Information Loss Minimization
, 2007
"... This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic fo ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic for its class label Y. We derive an alternating minimization procedure for simultaneously learning codebooks in the Euclidean feature space and in the simplex of posterior class distributions. The resulting quantizer can be used to encode unlabeled points outside the training set and to predict their posterior class distributions, and has an elegant interpretation in terms of lossless source coding. The proposed method is extensively validated on synthetic and real datasets, and is applied to two diverse problems: learning discriminative visual vocabularies for bagoffeatures image classification, and image segmentation.
Informationtheoretic image formation
 IEEE Transactions on Information Theory
, 1998
"... Abstract — The emergent role of information theory in image formation is surveyed. Unlike the subject of informationtheoretic communication theory, informationtheoretic imaging is far from a mature subject. The possible role of information theory in problems of image formation is to provide a rigo ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
Abstract — The emergent role of information theory in image formation is surveyed. Unlike the subject of informationtheoretic communication theory, informationtheoretic imaging is far from a mature subject. The possible role of information theory in problems of image formation is to provide a rigorous framework for defining the imaging problem, for defining measures of optimality used to form estimates of images, for addressing issues associated with the development of algorithms based on these optimality criteria, and for quantifying the quality of the approximations. The definition of the imaging problem consists of an appropriate model for the data and an appropriate model for the reproduction space, which is the space within which image estimates take values. Each problem statement has an associated optimality criterion that measures the overall quality of an estimate. The optimality criteria include maximizing the likelihood function and minimizing mean squared error for stochastic problems, and minimizing squared error and discrimination for deterministic problems. The development of algorithms is closely tied to the definition of the imaging problem and the associated optimality criterion. Algorithms with a strong informationtheoretic motivation are obtained by the method of expectation maximization. Related alternating minimization algorithms are discussed. In quantifying the quality of approximations, global and local measures are discussed. Global measures include the (mean) squared error and discrimination between an estimate and the truth, and probability of error for recognition or hypothesis testing problems. Local measures include Fisher information. Index Terms—Image analysis, image formation, image processing, image reconstruction, image restoration, imaging, inverse problems, maximumlikelihood estimation, pattern recognition. I.
Symmetrizing the KullbackLeibler Distance
 IEEE Transactions on Information Theory
, 2000
"... We define a new distance measure the resistoraverage distance between two probability distributions that is closely related to the KullbackLeibler distance. While the KullbackLeibler distance is asymmetric in the two distributions, the resistoraverage distance is not. It arises from geometric ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
We define a new distance measure the resistoraverage distance between two probability distributions that is closely related to the KullbackLeibler distance. While the KullbackLeibler distance is asymmetric in the two distributions, the resistoraverage distance is not. It arises from geometric considerations similar to those used to derive the Chernoff distance. Determining its relation to wellknown distance measures reveals a new way to depict how commonly used distance measures relate to each other. 1 Introduction The KullbackLeibler distance [15, 16] is perhaps the most frequently used informationtheoretic "distance" measure from a viewpoint of theory. If p 0 , p 1 are two probability densities, the KullbackLeibler distance is defined to be D(p 1 #p 0 )= # p 1 (x)log p 1 (x) p 0 (x) dx . (1) In this paper, log() has base two. The KullbackLeibler distance is but one example of the AliSilvey class of informationtheoretic distance measures [1], which are defined to ...
Hilbert Space Embeddings and Metrics on Probability Measures
"... A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseu ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translationinvariant on R d, then it is characteristic if and only if the support of its Fourier transform is the entire R d.
Information, Divergence and Risk for Binary Experiments
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2009
"... We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all are related to costsensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating fdivergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants.