Results 1 
8 of
8
Model Selection by Normalized Maximum Likelihood
, 2005
"... The Minimum Description Length (MDL) principle is an information theoretic approach to inductive inference that originated in algorithmic coding theory. In this approach, data are viewed as codes to be compressed by the model. From this perspective, models are compared on their ability to compress a ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
The Minimum Description Length (MDL) principle is an information theoretic approach to inductive inference that originated in algorithmic coding theory. In this approach, data are viewed as codes to be compressed by the model. From this perspective, models are compared on their ability to compress a data set by extracting useful information in the data apart from random noise. The goal of model selection is to identify the model, from a set of candidate models, that permits the shortest description length (code) of the data. Since Rissanen originally formalized the problem using the crude ‘twopart code ’ MDL method in the 1970s, many significant strides have been made, especially in the 1990s, with the culmination of the development of the refined ‘universal code’ MDL method, dubbed Normalized Maximum Likelihood (NML). It represents an elegant solution to the model selection problem. The present paper provides a tutorial review on these latest developments with a special focus on NML. An application example of NML in cognitive modeling is also provided.
An empirical study of MDL model selection with infinite parametric complexity
 J. Mathematical Psychology
, 2006
"... ..."
(Show Context)
Prequential plugin codes that achieve optimal redundancy rates even if the model is wrong. arXiv:1002.0757
, 2010
"... Abstract — We analyse the prequential plugin codes relative to oneparameter exponential families M. We show that if data are sampled i.i.d. from some distribution outside M, then the redundancy of any plugin prequential code grows at rate larger than 1 ln n in the worst case. This means that plug ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract — We analyse the prequential plugin codes relative to oneparameter exponential families M. We show that if data are sampled i.i.d. from some distribution outside M, then the redundancy of any plugin prequential code grows at rate larger than 1 ln n in the worst case. This means that plugin codes, such 2 as the RissanenDawid ML code, may behave inferior to other important universal codes such as the 2part MDL, Shtarkov and Bayes codes, for which the redundancy is always 1 ln n + O(1). 2 However, we also show that a slight modification of the ML plugin code, “almost ” in the model, does achieve the optimal redundancy even if the the true distribution is outside M. I.
Following the Flattened Leader
"... We analyze the regret, measured in terms of log loss, of the maximum likelihood (ML) sequential prediction strategy. This “follow the leader ” strategy also defines one of the main versions of Minimum Description Length model selection. We proved in prior work for single parameter exponential family ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We analyze the regret, measured in terms of log loss, of the maximum likelihood (ML) sequential prediction strategy. This “follow the leader ” strategy also defines one of the main versions of Minimum Description Length model selection. We proved in prior work for single parameter exponential family models that (a) in the misspecified case, the redundancy of followtheleader is not 1 2 log n+O(1), as it is for other universal prediction strategies; as such, the strategy also yields suboptimal individual sequence regret and inferior model selection performance; and (b) that in general it is not possible to achieve the optimal redundancy when predictions are constrained to the distributions in the considered model. Here we describe a simple “flattening” of the sequential ML and related predictors, that does achieve the optimal worst case individual sequence regret of (k/2)log n + O(1) for k parameter exponential family models for bounded outcome spaces; for unbounded spaces, we provide almostsure results. Simulations show a major improvement of the resulting model selection criterion.
Model Selection by Normalized Maximum Likelihood
"... cCorresponding Author The Minimum Description Length (MDL) principle is an information theoretic approach to inductive inference that originated in algorithmic coding theory. In this approach, data are viewed as codes to be compressed by the model. From this perspective, models are compared on thei ..."
Abstract
 Add to MetaCart
cCorresponding Author The Minimum Description Length (MDL) principle is an information theoretic approach to inductive inference that originated in algorithmic coding theory. In this approach, data are viewed as codes to be compressed by the model. From this perspective, models are compared on their ability to compress a data set by extracting useful information in the data apart from random noise. The goal of model selection is to identify the model, from a set of candidate models, that permits the shortest description length (code) of the data. Since Rissanen originally formalized the problem using the crude ‘twopart code ’ MDL method in the 1970s, many significant strides have been made, especially in the 1990s, with the culmination of the development of the refined ‘universal code ’ MDL method, dubbed Normalized Maximum Likelihood (NML). It represents an elegant solution to the model selection problem. The present paper provides a tutorial review on these latest developments with a special focus on NML. An application example of NML in cognitive modeling is also provided.
Maximum Likelihood vs. Sequential Normalized Maximum Likelihood in Online Density Estimation
"... The paper considers sequential prediction of individual sequences with log loss (online density estimation) using an exponential family of distributions. We first analyze the regret of the maximum likelihood (“follow the leader”) strategy. We find that this strategy is (1) suboptimal and (2) require ..."
Abstract
 Add to MetaCart
The paper considers sequential prediction of individual sequences with log loss (online density estimation) using an exponential family of distributions. We first analyze the regret of the maximum likelihood (“follow the leader”) strategy. We find that this strategy is (1) suboptimal and (2) requires an additional assumption about boundedness of the data sequence. We then show that both problems can be be addressed by adding the currently predicted outcome to the calculation of the maximum likelihood, followed by normalization of the distribution. The strategy obtained in this way is known in the literature as the sequential normalized maximum likelihood or laststep minimax strategy. We show for the first time that for general exponential families, the regret is bounded by the familiar (k/2) log n and thus optimal up to O(1). We also show the relationship to the Bayes strategy with Jeffreys ’ prior. 1
Bounds on Individual Risk for Logloss Predictors
"... In sequential prediction with logloss as well as density estimation with risk measured by KL divergence, one is often interested in the expected instantaneous loss, or, equivalently, the individual risk at a given fixed sample size n. For Bayesian prediction and estimation methods, it is often easy ..."
Abstract
 Add to MetaCart
In sequential prediction with logloss as well as density estimation with risk measured by KL divergence, one is often interested in the expected instantaneous loss, or, equivalently, the individual risk at a given fixed sample size n. For Bayesian prediction and estimation methods, it is often easy to obtain bounds on the cumulative risk. Such results are based on bounding the individual sequence regret, a technique that is very well known in the COLT community. Motivated by the easiness of proofs for the cumulative risk, our open problem is to use the results on cumulative risk to prove corresponding individualrisk bounds. Background We consider sequential prediction (online learning) with logloss (CesaBianchi and Lugosi, 2006). In each iteration n = 1, 2,..., after observing a sequence of past outcomes xn = x1, x2,..., xn ∈ X n, a prediction strategy assigns a probability distribution on X, denoted P ̂ ( ·  xn). Then, a next outcome xn+1 is revealed and the strategy incurs the log loss − log P ̂ (xn+1  xn). The goal of the prediction strategy is to be not much worse than the best in a reference set of distributions (also called “experts”), which we call the model M.