Results 11  20
of
150
Online algorithms in machine learning
 IN FIAT, AND WOEGINGER., EDS., ONLINE ALGORITHMS: THE STATE OF THE ART
, 1998
"... The areas of OnLine Algorithms and Machine Learning are both concerned with problems of making decisions about the present based only on knowledge of the past. Although these areas differ in terms of their emphasis and the problems typically studied, there are a collection of results in Computation ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
(Show Context)
The areas of OnLine Algorithms and Machine Learning are both concerned with problems of making decisions about the present based only on knowledge of the past. Although these areas differ in terms of their emphasis and the problems typically studied, there are a collection of results in Computational Learning Theory that fit nicely into the "online algorithms" framework. This survey article discusses some of the results, models, and open problems from Computational Learning Theory that seem particularly interesting from the point of view of online algorithms. The emphasis in this article is on describing some of the simpler, more intuitive results, whose proofs can be given in their entirity. Pointers to the literature are given for more sophisticated versions of these algorithms.
Improved secondorder bounds for prediction with expert advice
 In COLT
, 2005
"... Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bou ..."
Abstract

Cited by 69 (14 self)
 Add to MetaCart
(Show Context)
Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bounds for the wellknown exponentially weighted average forecaster and for a new forecaster with a different multiplicative update rule. Our analysis has two main advantages: first, no preliminary knowledge about the payoff sequence is needed, not even its range; second, our bounds are expressed in terms of sums of squared payoffs, replacing larger firstorder quantities appearing in previous bounds. In addition, our most refined bounds have the natural and desirable property of being stable under rescalings and general translations of the payoff sequence. 1.
Adaptive Regression by Mixing
 Journal of American Statistical Association
"... Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus auto ..."
Abstract

Cited by 67 (11 self)
 Add to MetaCart
Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus automatic adaptation over various scenarios is desirable. A practically feasible method, named Adaptive Regression by Mixing (ARM) is proposed to convexly combine general candidate regression procedures. Under mild conditions, the resulting estimator is theoretically shown to perform optimally in rates of convergence without knowing which of the original procedures work the best. Simulations are conducted in several settings, including comparing a parametric model with nonparametric alternatives, comparing a neural network with a projection pursuit in multidimensional regression, and combining bandwidths in kernel regression. The results clearly support the theoretical property of ARM. The ARM ...
Universal compression of memoryless sources over unknown alphabets
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2004
"... It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbol ..."
Abstract

Cited by 58 (22 self)
 Add to MetaCart
It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the GoodTuring probabilityestimation problem.
Predicting a Binary Sequence Almost as Well as the Optimal Biased Coin
, 1996
"... We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes alg ..."
Abstract

Cited by 50 (5 self)
 Add to MetaCart
We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes algorithm with Jeffrey's prior, that was studied by Xie and Barron under probabilistic assumptions [26]. We derive a uniform bound on the regret which holds for any sequence. We also show that if the empirical distribution of the sequence is bounded away from 0 and from 1, then, as the length of the sequence increases to infinity, the difference between this bound and a corresponding bound on the average case regret of the same algorithm (which is asymptotically optimal in that case) is only 1=2. We show that this gap of 1=2 is necessary by calculating the regret of the minmax optimal algorithm for this problem and showing that the asymptotic upper bound is tight. We also study the application...
Analysis of two gradientbased algorithms for online regression
 Journal of Computer and System Sciences
, 1999
"... In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the online framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear ..."
Abstract

Cited by 49 (5 self)
 Add to MetaCart
(Show Context)
In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the online framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear combination according to the gradient of the loss function. However, the two algorithms have distinctive ways of using the gradient information for updating the coefficients. For each algorithm, we show general regression bounds for any convex loss function. Furthermore, we show special bounds for the absolute and the square loss functions, thus extending previous results by Kivinen and Warmuth. In the nonlinear regression case, we show general bounds for pairs of transfer and loss functions satisfying a certain condition. We apply this result to the Hellinger loss and the entropic loss in case of logistic regression (similar results, but only for the entropic loss, were also obtained by Helmbold et al. using a different analysis.) Finally, we describe the connection between our approach and a general family of gradientbased algorithms proposed by Warmuth et al. in recent works. 1999 Academic Press 1.
Using Additive Expert Ensembles to Cope with Concept Drift
 In Proceedings of the 22nd International Conference on Machine Learning (ICML2005
, 2005
"... We consider online learning where the target concept can change over time. Previous work on expert prediction algorithms has bounded the worstcase performance on any subsequence of the training data relative to the performance of the best expert. However, because these “experts ” may be difficult t ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
We consider online learning where the target concept can change over time. Previous work on expert prediction algorithms has bounded the worstcase performance on any subsequence of the training data relative to the performance of the best expert. However, because these “experts ” may be difficult to implement, we take a more general approach and bound performance relative to the actual performance of any online learner on this single subsequence. We present the additive expert ensemble algorithm AddExp, a new, general method for using any online learner for drifting concepts. We adapt techniques for analyzing expert prediction algorithms to prove mistake and loss bounds for a discrete and a continuous version of AddExp. Finally, we present pruning methods and empirical results for data sets with concept drift. 1.
Fast learning rates in statistical inference through aggregation
 SUBMITTED TO THE ANNALS OF STATISTICS
, 2008
"... We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, w ..."
Abstract

Cited by 42 (8 self)
 Add to MetaCart
We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, we provide minimax convergence rates of the form C () log G  v with tight evaluation of the positive constant C and with n exact 0 < v ≤ 1, the latter value depending on the convexity of the loss function and on the level of noise in the output distribution. The risk upper bounds are based on a sequential randomized algorithm, which at each step concentrates on functions having both low risk and low variance with respect to the previous step prediction function. Our analysis puts forward the links between the probabilistic and worstcase viewpoints, and allows to obtain risk bounds unachievable with the standard statistical learning approach. One of the key idea of this work is to use probabilistic inequalities with respect to appropriate (Gibbs) distributions on the prediction function space instead of using them with respect to the distribution generating the data. The risk lower bounds are based on refinements of the Assouad lemma taking particularly into account the properties of the loss function. Our key example to illustrate the upper and lower bounds is to consider the Lqregression setting for which an exhaustive analysis of the convergence rates is given while q ranges in [1; +∞[.
Potentialbased Algorithms in Online Prediction and Game Theory
"... In this paper we show that several known algorithms for sequential prediction problems (including Weighted Majority and the quasiadditive family of Grove, Littlestone, and Schuurmans), for playing iterated games (including Freund and Schapire's Hedge and MW, as well as the strategies of Hart ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
In this paper we show that several known algorithms for sequential prediction problems (including Weighted Majority and the quasiadditive family of Grove, Littlestone, and Schuurmans), for playing iterated games (including Freund and Schapire's Hedge and MW, as well as the strategies of Hart and MasColell), and for boosting (including AdaBoost) are special cases of a general decision strategy based on the notion of potential. By analyzing this strategy we derive known performance bounds, as well as new bounds, as simple corollaries of a single general theorem. Besides offering a new and unified view on a large family of algorithms, we establish a connection between potentialbased analysis in learning and their counterparts independently developed in game theory. By exploiting this connection, we show that certain learning problems are instances of more general gametheoretic problems. In particular, we describe a notion of generalized regret and show its applications in learning theory.
On Prediction of Individual Sequences
, 1998
"... Sequential randomized prediction of an arbitrary binary sequence is investigated. No assumption is made on the ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Sequential randomized prediction of an arbitrary binary sequence is investigated. No assumption is made on the