Results 1 
6 of
6
Conditional NML Universal Models
"... The NML (Normalized Maximum Likelihood) universal model has certain minmax optimal properties but it has two shortcomings: the normalizing coefficient can be evaluated in a closed form only for special model classes, and it does ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
The NML (Normalized Maximum Likelihood) universal model has certain minmax optimal properties but it has two shortcomings: the normalizing coefficient can be evaluated in a closed form only for special model classes, and it does
Bayesian Network Structure Learning using Factorized NML Universal Models
, 2008
"... Universal codes/models can be used for data compression and model selection by the minimum description length (MDL) principle. For many interesting model classes, such as Bayesian networks, the minimax regret optimal normalized maximum likelihood (NML) universal model is computationally very deman ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Universal codes/models can be used for data compression and model selection by the minimum description length (MDL) principle. For many interesting model classes, such as Bayesian networks, the minimax regret optimal normalized maximum likelihood (NML) universal model is computationally very demanding. We suggest a computationally feasible alternative to NML for Bayesian networks, the factorized NML universal model, where the normalization is done locally for each variable. This can be seen as an approximate sumproduct algorithm. We show that this new universal model performs extremely well in model selection, compared to the existing stateoftheart, even for small sample sizes.
Freezing and sleeping: Tracking experts that learn by evolving past posteriors
 Extended Abstracts (Local Dissemination
, 2009
"... A problem posed by Freund is how to efficiently track a small pool of experts out of a much larger set. This problem was solved when Bousquet and Warmuth introduced their mixing past posteriors (MPP) algorithm in 2001. In Freund’s problem the experts would normally be considered black boxes. However ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
A problem posed by Freund is how to efficiently track a small pool of experts out of a much larger set. This problem was solved when Bousquet and Warmuth introduced their mixing past posteriors (MPP) algorithm in 2001. In Freund’s problem the experts would normally be considered black boxes. However, in this paper we reexamine Freund’s problem in case the experts have internal structure that enables them to learn. In this case the problem has two possible interpretations: should the experts learn from all data or only from the subsequence on which they are being tracked? The MPP algorithm solves the first case. Our contribution is to generalise MPP to address the second option. The results we obtain apply to any expert structure that can be formalised using (expert) hidden Markov models. Curiously enough, for our interpretation there are two natural reference schemes: freezing and sleeping. For each scheme, we provide an efficient prediction strategy and prove the relevant loss bound. 1
Learning Locally Minimax Optimal Bayesian Networks
"... We consider the problem of learning Bayesian network models in a noninformative setting, where the only available information is a set of observational data, and no background knowledge is available. The problem can be divided into two different subtasks: learning the structure of the network (a se ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We consider the problem of learning Bayesian network models in a noninformative setting, where the only available information is a set of observational data, and no background knowledge is available. The problem can be divided into two different subtasks: learning the structure of the network (a set of independence relations), and learning the parameters of the model (that fix the probability distribution from the set of all distributions consistent with the chosen structure). There are not many theoretical frameworks that consistently handle both these problems together, the Bayesian framework being an exception. In this paper we propose an alternative, informationtheoretic framework which sidesteps some of the technical problems facing the Bayesian approach. The framework is based on the minimaxoptimal Normalized Maximum Likelihood (NML) distribution, which is motivated by the Minimum Description Length (MDL) principle. The resulting model selection criterion is consistent, and it provides a way to construct highly predictive Bayesian network models. Our empirical tests show that the proposed method compares favorably with alternative approaches in both model selection and prediction tasks. 1
Following the Flattened Leader
"... We analyze the regret, measured in terms of log loss, of the maximum likelihood (ML) sequential prediction strategy. This “follow the leader ” strategy also defines one of the main versions of Minimum Description Length model selection. We proved in prior work for single parameter exponential family ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We analyze the regret, measured in terms of log loss, of the maximum likelihood (ML) sequential prediction strategy. This “follow the leader ” strategy also defines one of the main versions of Minimum Description Length model selection. We proved in prior work for single parameter exponential family models that (a) in the misspecified case, the redundancy of followtheleader is not 1 2 log n+O(1), as it is for other universal prediction strategies; as such, the strategy also yields suboptimal individual sequence regret and inferior model selection performance; and (b) that in general it is not possible to achieve the optimal redundancy when predictions are constrained to the distributions in the considered model. Here we describe a simple “flattening” of the sequential ML and related predictors, that does achieve the optimal worst case individual sequence regret of (k/2)log n + O(1) for k parameter exponential family models for bounded outcome spaces; for unbounded spaces, we provide almostsure results. Simulations show a major improvement of the resulting model selection criterion.
Learning Eigenvectors for Free
"... We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu> where u is a vector in R n of unit length. Whereas in the classical case the go ..."
Abstract
 Add to MetaCart
We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu> where u is a vector in R n of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n 2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worstcase regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worstcase sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1