Results 11  20
of
67
Combining Expert Advice Efficiently
"... We show how models for prediction with expert advice can be defined concisely and clearly using hidden Markov models (HMMs); standard HMM algorithms can then be used to efficiently calculate how the expert predictions should be weighted according to the model. We cast many existing models as HMMs an ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
(Show Context)
We show how models for prediction with expert advice can be defined concisely and clearly using hidden Markov models (HMMs); standard HMM algorithms can then be used to efficiently calculate how the expert predictions should be weighted according to the model. We cast many existing models as HMMs and recover the best known running times in each case. We also describe two new models: the switch distribution, which was recently developed to improve Bayesian/Minimum Description Length model selection, and a new generalisation of the fixed share algorithm based on runlength coding. We give loss bounds for all models and shed new light on the relationships between them. 1
The minimax strategy for gaussian density estimation
 In COLT
, 2000
"... We consider online density estimation with a Gaussian of unit variance. In each trial t the learner predicts a mean θt. Then it receives an instance xt chosen by the adversary and incurs loss 1 2 (θt − xt) 2. The performance of the learner is measured by the regret de£ned as the total loss of the l ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
We consider online density estimation with a Gaussian of unit variance. In each trial t the learner predicts a mean θt. Then it receives an instance xt chosen by the adversary and incurs loss 1 2 (θt − xt) 2. The performance of the learner is measured by the regret de£ned as the total loss of the learner minus the total loss of the best mean parameter chosen offline. We assume that the horizon T of the protocol is £xed and known to both parties. We give the optimal strategies for both the learner and the adversary. The value of the game is 1 2X2 (ln T − ln ln T + O(ln ln T / ln T)), where X is an upper bound of the 2norm of instances. We also consider the standard algorithm that predicts with θt = ∑ t−1 q=1 xq/(t − 1 + a) for a £xed a. We show that the regret of this algorithm is 1 2 X2 (ln T − O(1)) regardless of the choice of a. This work was done while Eiji Takimoto was on a sabbatical
The LastStep Minimax Algorithm
 Pages 279 290 of: Proc. 11th International Conference on Algorithmic Learning Theory
, 2000
"... We consider online density estimation with a parameterized density from an exponential family. In each trial t the learner predicts a parameter t . Then it receives an instance x t chosen by the adversary and incurs loss ln p(x t j t ) which is the negative loglikelihood of x t w.r.t. the predict ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
We consider online density estimation with a parameterized density from an exponential family. In each trial t the learner predicts a parameter t . Then it receives an instance x t chosen by the adversary and incurs loss ln p(x t j t ) which is the negative loglikelihood of x t w.r.t. the predicted density of the learner. The performance of the learner is measured by the regret dened as the total loss of the learner minus the total loss of the best parameter chosen oline. We develop an algorithm called the Laststep Minimax Algorithm that predicts with the minimax optimal parameter assuming that the current trial is the last one. For onedimensional exponential families, we give an explicit form of the prediction of the Laststep Minimax Algorithm and show that its regret is O(ln T ), where T is the number of trials. In particular, for Bernoulli density estimation the Laststep Minimax Algorithm is slightly better than the standard Laplace estimator. This work was done while...
Calculating the normalized maximum likelihood distribution for Bayesian forests
 in Proc. IADIS International Conference on Intelligent Systems and Agents
, 2007
"... When learning Bayesian network structures from sample data, an important issue is how to evaluate the goodness of alternative network structures. Perhaps the most commonly used model (class) selection criterion is the marginal likelihood, which is obtained by integrating over a prior distribution fo ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
When learning Bayesian network structures from sample data, an important issue is how to evaluate the goodness of alternative network structures. Perhaps the most commonly used model (class) selection criterion is the marginal likelihood, which is obtained by integrating over a prior distribution for the model parameters. However, the problem of determining a reasonable prior for the parameters is a highly controversial issue, and no completely satisfying Bayesian solution has yet been presented in the noninformative setting. The normalized maximum likelihood (NML), based on Rissanen’s informationtheoretic MDL methodology, offers an alternative, theoretically solid criterion that is objective and noninformative, while no parameter prior is required. It has been previously shown that for discrete data, this criterion can be computed in linear time for Bayesian networks with no arcs, and in quadratic time for the so called Naive Bayes network structure. Here we extend the previous results by showing how to compute the NML criterion in polynomial time for treestructured Bayesian networks. The order of the polynomial depends on the number of values of the variables, but neither on the number of variables itself, nor on the sample size.
Tight Bounds on Profile Redundancy and Distinguishability
"... The minimax KLdivergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estim ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
The minimax KLdivergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected logloss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machinelearning and informationtheory researchers have recently considered labelinvariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by lengthn i.i.d. sequences is between 0.3 · n 1/3 and n 1/3 log 2 n, in particular, establishing its exact growth power. 1
Analyzing the Stochastic Complexity via Tree Polynomials
, 2005
"... Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure
NML Computation Algorithms for TreeStructured Multinomial Bayesian Networks
, 2007
"... Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically wellfounded, general fr ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically wellfounded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, treestructured Bayesian networks.
Computing the Regret Table for Multinomial Data
, 2005
"... Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. In the case ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. In the case