Results 1 
9 of
9
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 40 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
A General Minimax Result for Relative Entropy
 IEEE Trans. Inform. Theory
, 1996
"... : Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
: Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ` and Q. We show that the minimax and maximin values of this game are always equal, and there is always a minimax strategy in the closure of the set of all Bayes strategies. This generalizes previous results of Gallager, and Davisson and LeonGarcia. Index terms: minimax theorem, minimax redundancy, minimax risk, Bayes risk, relative entropy, KullbackLeibler divergence, density estimation, source coding, channel capacity, computational learning theory 1 Introduction Consider a sequential estimation game in which a statistician is given n independent observations Y 1 ; : : : ; Yn distributed according to an unknown distribution ~ P ` chosen at random by Nature from the set f ~ P ` : ` 2 \...
How Well do Bayes Methods Work for OnLine Prediction of {±1} values?
 In Proceedings of the Third NEC Symposium on Computation and Cognition. SIAM
, 1992
"... We look at sequential classification and regression problems in which f\Sigma1glabeled instances are given online, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the ..."
Abstract

Cited by 18 (11 self)
 Add to MetaCart
We look at sequential classification and regression problems in which f\Sigma1glabeled instances are given online, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the performance of Bayes method for this task, as measured by the total number of mistakes for the classification problem, and by the total log loss (or information gain) for the regression problem. Our results are given by comparing the performance of Bayes method to the performance of a hypothetical "omniscient scientist" who is able to use extra information about the labeling process that would not be available in the standard learning protocol. The results show that Bayes methods perform only slightly worse than the omniscient scientist in many cases. These results generalize previous results of Haussler, Kearns and Schapire, and Opper and Haussler. 1 Introduction Several recent papers in...
General bounds on the mutual information between a parameter and n conditionally independent observations
 In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory
, 1995
"... Each parameter in an abstract parameter space is associated with a di erent probability distribution on a set Y. A parameter is chosen at random from according to some a priori distribution on, and n conditionally independent random variables Y n = Y1�:::Yn are observed with common distribution dete ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
Each parameter in an abstract parameter space is associated with a di erent probability distribution on a set Y. A parameter is chosen at random from according to some a priori distribution on, and n conditionally independent random variables Y n = Y1�:::Yn are observed with common distribution determined by. We obtain bounds on the mutual information between the random variable, giving the choice of parameter, and the random variable Y n, giving the sequence of observations. We also bound the supremum of the mutual information, over choices of the prior distribution on. These quantities have applications in density estimation, computational learning theory, universal coding, hypothesis testing, and portfolio selection theory. The bounds are given in terms of the metric and information dimensions of the parameter space with respect to the Hellinger distance. 1
Asymptotic Normality of the Posterior in Relative Entropy
 IEEE Trans. Inform. Theory
, 1999
"... We show that the relative entropy between a posterior density formed from a smooth likelihood and prior and a limiting normal form tends to zero in the independent and identically distributed case. The mode of convergence is in probability and in mean. Applications to codelengths in stochastic compl ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We show that the relative entropy between a posterior density formed from a smooth likelihood and prior and a limiting normal form tends to zero in the independent and identically distributed case. The mode of convergence is in probability and in mean. Applications to codelengths in stochastic complexity and to sample size selection are briey discussed. Index Terms: Posterior density, asymptotic normality, relative entropy. Revision submitted to Trans. Inform Theory , 22 May 1998. This research was partially supported by NSERC Operating Grant 554891. The author is with the Department of Statistics, University of British Columbia, Room 333, 6356 Agricultural Road, Vancouver, BC, Canada V6T 1Z2. 1 I.
A Minimally Informative Likelihood for Decision Analysis: Robustness and Illustration
 Canadian Journal Statistics
, 1999
"... Here we use a class of likelihoods which makes weak assumptions on data generating mechanisms. These likelihoods may be appropriate for data sets where it is difficult to propose physically motivated models. We give some properties of these likelihoods, showing how they can be computed numerically b ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Here we use a class of likelihoods which makes weak assumptions on data generating mechanisms. These likelihoods may be appropriate for data sets where it is difficult to propose physically motivated models. We give some properties of these likelihoods, showing how they can be computed numerically by use of the BlahutArimoto algorithm. Then, in the context of a data set for which no plausible physical model is apparent, we show how these likelihoods give useful inferences for the location of a distribution. The plausibility of the inferences is enhanced by the extensive robustness analysis these likelihoods permit.
Mutual Information, Metric Entropy, and Risk in Estimation of Probability Distributions
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
Mutual information and Bayes methods for learning a distribution
 In Proc. Workshop on the Theory of Neural Networks: The Statistical Mechanics Perspective. World Scientific
, 1995
"... Each parameter w in an abstract parameter space W is associated with a di erent probability distribution on a set Y. A parameter w is chosen at random from W according to some a priori distribution on W,andn conditionally independent random variables Y n = Y1�:::Y n are observed with common distribu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Each parameter w in an abstract parameter space W is associated with a di erent probability distribution on a set Y. A parameter w is chosen at random from W according to some a priori distribution on W,andn conditionally independent random variables Y n = Y1�:::Y n are observed with common distribution determined by w. Viewing W as a random variable, we obtain bounds on the mutual information between the random variable W, giving the choice of parameter, and the random variable Y n, giving the sequence of observations. This quantity is the cumulative risk in predicting Y1�:::Yn under the log loss, minus the risk if the true parameter w is known. The upper bounds are stated in terms of the Laplace transform of the rate of growth of the volume of relative entropy neighborhoods in the parameter space W, and the lower bounds are given in terms of the corresponding quantity using Hellinger neighborhoods. We show how these bounds can be interpreted in terms of an average local dimension of the parameter space W under suitable conditions. 1.
Metric Entropy and Minimax Risk in Classification
 In Lecture Notes in Comp. Sci.: Studies in Logic and
, 1997
"... . We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously se ..."
Abstract
 Add to MetaCart
. We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy twovalued classification problems in terms of the Assouad density and the VapnikChervonenkis dimension. 1 Introduction The most basic problem in pattern recognition is the problem of classifying instances consisting of vectors of measurements into a one of a finite number of types or classes. One standard example is the recognition of isolated capital characters, in which the instances are measurements on images of letters ...