Results 1  10
of
10
MIXABILITY IS BAYES RISK CURVATURE RELATIVE TO LOG LOSS
"... Given K codes, a standard result from source coding tells us how to design a single universal code with codelengths within log(K) bits of the best code, on any data sequence. Translated to the online learning setting of prediction with expert advice, this result implies that for logarithmic loss one ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
Given K codes, a standard result from source coding tells us how to design a single universal code with codelengths within log(K) bits of the best code, on any data sequence. Translated to the online learning setting of prediction with expert advice, this result implies that for logarithmic loss one can guarantee constant regret, which does not grow with the number of outcomes that need to be predicted. In this setting, it is known for which other losses the same guarantee can be given: these are the losses that are mixable. We show that among the mixable losses, log loss is special: in fact, one may understand the class of mixable losses as those that behave like log loss in an essential way. More specifically, a loss is mixable if and only if the curvature of its Bayes risk is at least as large as the curvature of the Bayes risk for log loss (for which the Bayes risk equals the entropy). 1.
Putting Bayes to sleep
 In NIPS
, 2012
"... We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in eac ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such “sparse composite models ” is achieved by mixing in a bit of all the past posteriors. This selfreferential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a longterm memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the “specialist ” framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a wellstudied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound. 1
Mixability in statistical learning
 In Advances in Neural Information Processing Systems
"... Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequent ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worstcase regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of logloss, stochastic mixability reduces to a wellknown (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
Prediction with expert evaluators’ advice
 Proceedings of the Twentieth International Conference on Algorithmic Learning Theory
, 2009
"... We introduce a new protocol for prediction with expert advice in which each expert evaluates the learner’s and his own performance using a loss function that may change over time and may be different from the loss functions used by the other experts. The learner’s goal is to perform better or not mu ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
We introduce a new protocol for prediction with expert advice in which each expert evaluates the learner’s and his own performance using a loss function that may change over time and may be different from the loss functions used by the other experts. The learner’s goal is to perform better or not much worse than each expert, as evaluated by that expert, for all experts simultaneously. If the loss functions used by the experts are all proper scoring rules and all mixable, we show that the defensive forecasting algorithm enjoys the same performance guarantee as that attainable by the Aggregating Algorithm in the standard setting and known to be optimal. This result is also applied to the case of “specialist ” (or “sleeping”) experts. In this case, the defensive forecasting algorithm reduces to a simple modification of the Aggregating Algorithm. 1
Mixability in Statistical Learning Tim
"... Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequentia ..."
Abstract
 Add to MetaCart
(Show Context)
Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worstcase regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of logloss, stochastic mixability reduces to a wellknown (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
Cascading Randomized Weighted Majority: A New Online Ensemble Learning Algorithm
"... Abstract. With the increasing volume of data in the world, the best approach for learning from this data is to exploit an online learning algorithm. Online ensemble methods are online algorithms which take advantage of an ensemble of classifiers to predict labels of data. Prediction with expert advi ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. With the increasing volume of data in the world, the best approach for learning from this data is to exploit an online learning algorithm. Online ensemble methods are online algorithms which take advantage of an ensemble of classifiers to predict labels of data. Prediction with expert advice is a wellstudied problem in the online ensemble learning literature. The Weighted Majority algorithm and the randomized weighted majority (RWM) are the most wellknown solutions to this problem, aiming to converge to the best expert. Since among some expert, The best one does not necessarily have the minimum error in all regions of data space, defining specific regions and converging to the best expert in each of these regions will lead to a better result. In this paper, we aim to resolve this defect of RWM algorithms by proposing a novel online ensemble algorithm to the problem of prediction with expert advice. We propose a cascading version of RWM to achieve not only better experimental results but also a better error bound for sufficiently large datasets.
Fast Rates in Statistical and Online Learning Tim van Erven ∗
"... The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning — a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discove ..."
Abstract
 Add to MetaCart
The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning — a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for ‘proper ’ learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a reinterpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the