Results 1  10
of
47
Universal prediction
 IEEE Transactions on Information Theory
, 1998
"... Abstract — This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both th ..."
Abstract

Cited by 136 (11 self)
 Add to MetaCart
Abstract — This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both the probabilistic setting and the deterministic setting of the universal prediction problem are described with emphasis on the analogy and the differences between results in the two settings. Index Terms — Bayes envelope, entropy, finitestate machine, linear prediction, loss function, probability assignment, redundancycapacity, stochastic complexity, universal coding, universal prediction. I.
InformationTheoretic Determination of Minimax Rates of Convergence
 Ann. Stat
, 1997
"... In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence. ..."
Abstract

Cited by 98 (18 self)
 Add to MetaCart
In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence.
Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: Basic Properties
, 1996
"... This paper was partially presented at the 9th conference on Uncertainty in Artificial Intelligence, July 1993. ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
This paper was partially presented at the 9th conference on Uncertainty in Artificial Intelligence, July 1993.
A strong version of the redundancycapacity theorem of universal coding
 IEEE TRANS. INFORM. THEORY
, 1995
"... The capacity of the channel induced by a given class of sources is well known to be an attainable lower bound on the redundancy of universal codes with respect to this class, both in the minimax sense and in the Bayesian (maximin) sense. We show that this capacity is essentially a lower bound also ..."
Abstract

Cited by 47 (9 self)
 Add to MetaCart
The capacity of the channel induced by a given class of sources is well known to be an attainable lower bound on the redundancy of universal codes with respect to this class, both in the minimax sense and in the Bayesian (maximin) sense. We show that this capacity is essentially a lower bound also in a stronger sense, that is, for “most ” sources in the class. This result extends Rissanen’s lower bound for parametric families. We demonstrate the applicability of this result in several examples, e.g., parametric families with growing dimensionality, piecewisefixed sources, arbitrarily varying sources, and noisy samples of learnable functions. Finally, we discuss implications of our results to statistical inference.
Predicting a Binary Sequence Almost as Well as the Optimal Biased Coin
, 1996
"... We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes alg ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes algorithm with Jeffrey's prior, that was studied by Xie and Barron under probabilistic assumptions [26]. We derive a uniform bound on the regret which holds for any sequence. We also show that if the empirical distribution of the sequence is bounded away from 0 and from 1, then, as the length of the sequence increases to infinity, the difference between this bound and a corresponding bound on the average case regret of the same algorithm (which is asymptotically optimal in that case) is only 1=2. We show that this gap of 1=2 is necessary by calculating the regret of the minmax optimal algorithm for this problem and showing that the asymptotic upper bound is tight. We also study the application...
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
On predictive distributions and Bayesian networks
 Statistics and Computing
, 2000
"... this paper we are interested in discrete prediction problems for a decisiontheoretic setting, where the ..."
Abstract

Cited by 38 (29 self)
 Add to MetaCart
this paper we are interested in discrete prediction problems for a decisiontheoretic setting, where the
A General Minimax Result for Relative Entropy
 IEEE Trans. Inform. Theory
, 1996
"... : Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
: Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ` and Q. We show that the minimax and maximin values of this game are always equal, and there is always a minimax strategy in the closure of the set of all Bayes strategies. This generalizes previous results of Gallager, and Davisson and LeonGarcia. Index terms: minimax theorem, minimax redundancy, minimax risk, Bayes risk, relative entropy, KullbackLeibler divergence, density estimation, source coding, channel capacity, computational learning theory 1 Introduction Consider a sequential estimation game in which a statistician is given n independent observations Y 1 ; : : : ; Yn distributed according to an unknown distribution ~ P ` chosen at random by Nature from the set f ~ P ` : ` 2 \...
Efficient Bayesian Parameter Estimation in Large Discrete Domains
 Advances in Neural Information Processing Systems
, 1999
"... In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assum ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assumption that the observed outcomes constitute only a small subset of the possible outcomes. We show how to efficiently perform exact inference with this form of hierarchical prior and compare our method to standard approaches and demonstrate its merits. Category: Algorithms and Architectures Presentation preference: none This paper was not submitted elsewhere nor will be submitted during NIPS review period. 1 Introduction One of the most important problems in statistical inference is multinomialestimation: Given a past history of observations independent trials with a discrete set of outcomes, predict the probability of the next trial. Such estimators are the basic building blocks in mor...
Discrete Component Analysis
 Subspace, Latent Structure and Feature Selection Techniques
, 2006
"... This article presents a unified theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis, nonnegative matrix factorisation and latent Dirichlet allocation. The main families of algorithms discussed are a variational appr ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
This article presents a unified theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis, nonnegative matrix factorisation and latent Dirichlet allocation. The main families of algorithms discussed are a variational approximation, Gibbs sampling, and RaoBlackwellised Gibbs sampling. Applications are presented for voting records from the United States Senate for 2003, and for the Reuters21578 newswire collection.