Results 1 
7 of
7
Sequential Prediction of Individual Sequences Under General Loss Functions
 IEEE Transactions on Information Theory
, 1998
"... We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction st ..."
Abstract

Cited by 75 (7 self)
 Add to MetaCart
We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either \Theta(log N) or \Omega\Gamma p ` log N ), depending on the loss function, where N is the number of predictors in the comparison class a...
Tracking the Best Linear Predictor
 Journal of Machine Learning Research
, 2001
"... In most online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of ex ..."
Abstract

Cited by 53 (11 self)
 Add to MetaCart
In most online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the predictor u t at each trial t is allowed to change with time, and the total online loss of the algorithm is compared to the sum of the losses of u t at each trial plus the total "cost" for shifting to successive predictors. This is to model situations in which the examples change over time, and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. They hold for arbitrary sequences of examples and arbitrary sequences of predictors. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a sequences of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique is to project the hypothesis of the static algorithm at the end of each trial into a suitably chosen convex region. This keeps the hypothesis of the algorithm wellbehaved and the static bounds can be converted to shifting bounds.
Tight WorstCase Loss Bounds for Predicting With Expert Advice
, 1994
"... this paper is somewhat different from the one just described. Assume that there are N experts E i , i = 1; : : : ; N , each trying to predict the outcomes y t as best they can. Let x t;i be the prediction of the ith expert E i about the ..."
Abstract

Cited by 53 (10 self)
 Add to MetaCart
this paper is somewhat different from the one just described. Assume that there are N experts E i , i = 1; : : : ; N , each trying to predict the outcomes y t as best they can. Let x t;i be the prediction of the ith expert E i about the
OnLine Learning of Linear Functions
 Computational Complexity
, 1991
"... this paper, we present nearoptimal strategies for combining opinions in situations like this. In more abstract terms, we study the online learning of linear functions. We assume that learning proceeds in a sequence of trials. At trial number t the learning algorithm (the advisor) is presented with ..."
Abstract

Cited by 41 (18 self)
 Add to MetaCart
this paper, we present nearoptimal strategies for combining opinions in situations like this. In more abstract terms, we study the online learning of linear functions. We assume that learning proceeds in a sequence of trials. At trial number t the learning algorithm (the advisor) is presented with an instance ~x t 2 [0; 1]
Worstcase Quadratic Loss Bounds for Online Prediction of Linear Functions by Gradient Descent
 IEEE Transactions on Neural Networks
, 1993
"... this paper we study the performance of gradient descent when applied to the problem of online linear prediction in arbitrary inner product spaces. We show worstcase bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about the ..."
Abstract

Cited by 31 (12 self)
 Add to MetaCart
this paper we study the performance of gradient descent when applied to the problem of online linear prediction in arbitrary inner product spaces. We show worstcase bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about the sequence to predict. The algorithms we use are variants and extensions of online gradient descent. Whereas our algorithms always predict using linear functions as hypotheses, none of our results requires the data to be linearly related. In fact, the bounds proved on the total prediction loss are typically expressed as a function of the total loss of the best fixed linear predictor with bounded norm. All the upper bounds are tight to within constants. Matching lower bounds are provided in some cases. Finally, we apply our results to the problem of online prediction for classes of smooth functions. Keywords: prediction, WidrowHoff algorithm, gradient descent, smoothing, inner product spaces, computational learning theory, online learning, linear systems.
Worstcase Quadratic Loss Bounds for Prediction Using Linear Functions and Gradient Descent
, 1996
"... In this paper we study the performance of gradient descent when applied to the problem of online linear prediction in arbitrary inner product spaces. We show worstcase bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about t ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
In this paper we study the performance of gradient descent when applied to the problem of online linear prediction in arbitrary inner product spaces. We show worstcase bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about the sequence to predict. The algorithms we use are variants and extensions of online gradient descent. Whereas our algorithms always predict using linear functions as hypotheses, none of our results requires the data to be linearly related. In fact, the bounds proved on the total prediction loss are typically expressed as a function of the total loss of the best fixed linear predictor with bounded norm. All the upper bounds are tight to within constants. Matching lower bounds are provided in some cases. Finally, we apply our results to the problem of online prediction for classes of smooth functions.
On the Complexity of Function Learning
 In Proc. 6th Annu. Workshop on Comput. Learning Theory
, 1994
"... The majority of results in computational learning theory are concerned with concept learning, i.e. with the special case of function learning for classes of functions with range f0; 1g. Much less is known about the theory of learning functions with a larger range such as IN or IR. In particular rel ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
The majority of results in computational learning theory are concerned with concept learning, i.e. with the special case of function learning for classes of functions with range f0; 1g. Much less is known about the theory of learning functions with a larger range such as IN or IR. In particular relatively few results exist about the general structure of common models for function learning, and there are only very few nontrivial function classes for which positive learning results have been exhibited in any of these models. We introduce in this paper the notion of a binary branching adversary tree for function learning, which allows us to give a somewhat surprising equivalent characterization of the optimal learning cost for learning a class of realvalued functions (in terms of a maxmin definition which does not involve any "learning" model). Another general structural result of this paper relates the cost for learning a union of function classes to the learning costs for the individ...