Results 1  10
of
140
Bayesian Data Analysis
, 1995
"... I actually own a copy of Harold Jeffreys’s Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chisquared pvalue when he wanted to check the misfit of a model to data (Gelman, Meng and Ste ..."
Abstract

Cited by 2132 (59 self)
 Add to MetaCart
(Show Context)
I actually own a copy of Harold Jeffreys’s Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chisquared pvalue when he wanted to check the misfit of a model to data (Gelman, Meng and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin and Rousseau as a platform for further discussion of foundational issues. 2 In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in the process of data analysis.
The minimum description length principle in coding and modeling
 IEEE TRANS. INFORM. THEORY
, 1998
"... We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized maximized ..."
Abstract

Cited by 396 (17 self)
 Add to MetaCart
We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples.
Model Selection and the Principle of Minimum Description Length
 Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract

Cited by 195 (8 self)
 Add to MetaCart
(Show Context)
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate th...
InformationTheoretic Determination of Minimax Rates of Convergence
 Ann. Stat
, 1997
"... In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence. ..."
Abstract

Cited by 158 (24 self)
 Add to MetaCart
In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence.
Schapire R., Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
"... ..."
(Show Context)
A Bayesian/information theoretic model of learning to learn via multiple task sampling
 Machine Learning
, 1997
"... Abstract. A Bayesian model of learning to learn by sampling from multiple tasks is presented. The multiple tasks are themselves generated by sampling from a distribution over an environment of related tasks. Such an environment is shown to be naturally modelled within a Bayesian context by the conce ..."
Abstract

Cited by 95 (2 self)
 Add to MetaCart
(Show Context)
Abstract. A Bayesian model of learning to learn by sampling from multiple tasks is presented. The multiple tasks are themselves generated by sampling from a distribution over an environment of related tasks. Such an environment is shown to be naturally modelled within a Bayesian context by the concept of an objective prior distribution. It is argued that for many common machine learning problems, although in general we do not know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by learning sufficiently many tasks from the environment. In addition, bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, but the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous. The theory is applied to the problem of learning a common feature set or equivalently a lowdimensionalrepresentation (LDR) for an environment of related tasks.
Strong Optimality of the Normalized ML Models as Universal Codes
 IEEE Transactions on Information Theory
, 2000
"... We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside ..."
Abstract

Cited by 82 (8 self)
 Add to MetaCart
We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside the parametric class. We strengthen this result by showing that the same minmax bound results even when the data generating models are restricted to be most `benevolent' in minimizing the mean of the negative logarithm of the maximized likelihood. Further, we show for the class of exponential models that the bound cannot be beaten in essence by any code except when the mean is taken with respect to the most benevolent data generating models in a set of vanishing size. These results allow us to decompose the data into two parts, the first having all the useful information that can be extracted with the parametric models and the rest which has none. We also show that, if we change Ak...
The Horseshoe Estimator for Sparse Signals
, 2008
"... This paper proposes a new approach to sparsity called the horseshoe estimator. The horseshoe is a close cousin of other widely used Bayes rules arising from, for example, doubleexponential and Cauchy priors, in that it is a member of the same family of multivariate scale mixtures of normals. But th ..."
Abstract

Cited by 79 (15 self)
 Add to MetaCart
This paper proposes a new approach to sparsity called the horseshoe estimator. The horseshoe is a close cousin of other widely used Bayes rules arising from, for example, doubleexponential and Cauchy priors, in that it is a member of the same family of multivariate scale mixtures of normals. But the horseshoe enjoys a number of advantages over existing approaches, including its robustness, its adaptivity to different sparsity patterns, and its analytical tractability. We prove two theorems that formally characterize both the horseshoe’s adeptness at large outlying signals, and its superefficient rate of convergence to the correct estimate of the sampling density in sparse situations. Finally, using a combination of real and simulated data, we show that the horseshoe estimator corresponds quite closely to the answers one would get by pursuing a full Bayesian modelaveraging approach using a discrete mixture prior to model signals and noise.
Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions
, 1997
"... The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of lowtemperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known ..."
Abstract

Cited by 77 (3 self)
 Add to MetaCart
The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of lowtemperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam’s razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys ’ prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics.
Hypothesis Selection and Testing by the MDL Principle
 The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually ..."
Abstract

Cited by 72 (3 self)
 Add to MetaCart
(Show Context)
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data