Results 1 
8 of
8
Informationtheoretic asymptotics of Bayes methods
 IEEE Transactions on Information Theory
, 1990
"... AbstractIn the absence of knowledge of the true density function, Bayesian models take the joint density function for a sequence of n random variables to be an average of densities with respect to a prior. We examine the relative entropy distance D,, between the true density and the Bayesian densit ..."
Abstract

Cited by 107 (10 self)
 Add to MetaCart
AbstractIn the absence of knowledge of the true density function, Bayesian models take the joint density function for a sequence of n random variables to be an average of densities with respect to a prior. We examine the relative entropy distance D,, between the true density and the Bayesian density and show that the asymptotic distance is (d/2Xlogn)+ c, where d is the dimension of the parameter vector. Therefore, the relative entropy rate D,,/n converges to zero at rate (logn)/n. The constant c, which we explicitly identify, depends only on the prior density function and the Fisher information matrix evaluated at the true parameter value. Consequences are given for density estimation, universal data compression, composite hypothesis testing, and stockmarket portfolio selection. 1.
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
General bounds on the mutual information between a parameter and n conditionally independent observations
 In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory
, 1995
"... Each parameter in an abstract parameter space is associated with a di erent probability distribution on a set Y. A parameter is chosen at random from according to some a priori distribution on, and n conditionally independent random variables Y n = Y1�:::Yn are observed with common distribution dete ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Each parameter in an abstract parameter space is associated with a di erent probability distribution on a set Y. A parameter is chosen at random from according to some a priori distribution on, and n conditionally independent random variables Y n = Y1�:::Yn are observed with common distribution determined by. We obtain bounds on the mutual information between the random variable, giving the choice of parameter, and the random variable Y n, giving the sequence of observations. We also bound the supremum of the mutual information, over choices of the prior distribution on. These quantities have applications in density estimation, computational learning theory, universal coding, hypothesis testing, and portfolio selection theory. The bounds are given in terms of the metric and information dimensions of the parameter space with respect to the Hellinger distance. 1
The Asymptotic Redundancy of Bayes Rules for Markov Chains
 IEEE Trans. on Information Theory
, 1997
"... Abstract We derive the asymptotics of the redundancy of Bayes rules for Markov chains with known order, extending the work of Barron and Clarke[6, 5] on i.i.d. sources. These asymptotics are derived when the actual source is in the class of OEmixing sources which includes Markov chains and functi ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Abstract We derive the asymptotics of the redundancy of Bayes rules for Markov chains with known order, extending the work of Barron and Clarke[6, 5] on i.i.d. sources. These asymptotics are derived when the actual source is in the class of OEmixing sources which includes Markov chains and functions of Markov chains. These results can be used to derive minimax asymptotic rates of convergence for universal codes when a Markov chain of known order is used as a model. Index terms universal coding, Markov chains, Bayesian statistics, asymptotics. 1 Introduction Given data generated by a known stochastic process, methods of encoding the data to achieve the minimal average coding length, such as Huffman and arithmetic coding, are known[7]. Universal codes[15, 8] encode data such that, asymptotically, the average persymbol code length is equal to its minimal value (the entropy rate) for any source within a wide class. For the wellknown LempelZiv code, the average persymbol code l...
Asymptotic Normality of the Posterior in Relative Entropy
 IEEE Trans. Inform. Theory
, 1999
"... We show that the relative entropy between a posterior density formed from a smooth likelihood and prior and a limiting normal form tends to zero in the independent and identically distributed case. The mode of convergence is in probability and in mean. Applications to codelengths in stochastic compl ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We show that the relative entropy between a posterior density formed from a smooth likelihood and prior and a limiting normal form tends to zero in the independent and identically distributed case. The mode of convergence is in probability and in mean. Applications to codelengths in stochastic complexity and to sample size selection are briey discussed. Index Terms: Posterior density, asymptotic normality, relative entropy. Revision submitted to Trans. Inform Theory , 22 May 1998. This research was partially supported by NSERC Operating Grant 554891. The author is with the Department of Statistics, University of British Columbia, Room 333, 6356 Agricultural Road, Vancouver, BC, Canada V6T 1Z2. 1 I.
Mutual Information, Metric Entropy, and Risk in Estimation of Probability Distributions
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
Mutual information and Bayes methods for learning a distribution
 In Proc. Workshop on the Theory of Neural Networks: The Statistical Mechanics Perspective. World Scientific
, 1995
"... Each parameter w in an abstract parameter space W is associated with a di erent probability distribution on a set Y. A parameter w is chosen at random from W according to some a priori distribution on W,andn conditionally independent random variables Y n = Y1�:::Y n are observed with common distribu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Each parameter w in an abstract parameter space W is associated with a di erent probability distribution on a set Y. A parameter w is chosen at random from W according to some a priori distribution on W,andn conditionally independent random variables Y n = Y1�:::Y n are observed with common distribution determined by w. Viewing W as a random variable, we obtain bounds on the mutual information between the random variable W, giving the choice of parameter, and the random variable Y n, giving the sequence of observations. This quantity is the cumulative risk in predicting Y1�:::Yn under the log loss, minus the risk if the true parameter w is known. The upper bounds are stated in terms of the Laplace transform of the rate of growth of the volume of relative entropy neighborhoods in the parameter space W, and the lower bounds are given in terms of the corresponding quantity using Hellinger neighborhoods. We show how these bounds can be interpreted in terms of an average local dimension of the parameter space W under suitable conditions. 1.
Metric Entropy and Minimax Risk in Classification
 In Lecture Notes in Comp. Sci.: Studies in Logic and
, 1997
"... . We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously se ..."
Abstract
 Add to MetaCart
. We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy twovalued classification problems in terms of the Assouad density and the VapnikChervonenkis dimension. 1 Introduction The most basic problem in pattern recognition is the problem of classifying instances consisting of vectors of measurements into a one of a finite number of types or classes. One standard example is the recognition of isolated capital characters, in which the instances are measurements on images of letters ...