Results 1 -
7 of
7
Sequential Prediction of Individual Sequences Under General Loss Functions
- IEEE Transactions on Information Theory
, 1998
"... We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction st ..."
Abstract
-
Cited by 58 (7 self)
- Add to MetaCart
We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either \Theta(log N) or \Omega\Gamma p ` log N ), depending on the loss function, where N is the number of predictors in the comparison class a...
Tight Worst-Case Loss Bounds for Predicting With Expert Advice
, 1994
"... this paper is somewhat different from the one just described. Assume that there are N experts E i , i = 1; : : : ; N , each trying to predict the outcomes y t as best they can. Let x t;i be the prediction of the ith expert E i about the ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
this paper is somewhat different from the one just described. Assume that there are N experts E i , i = 1; : : : ; N , each trying to predict the outcomes y t as best they can. Let x t;i be the prediction of the ith expert E i about the
Tracking the Best Linear Predictor
- Journal of Machine Learning Research
, 2001
"... In most on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of ex ..."
Abstract
-
Cited by 43 (11 self)
- Add to MetaCart
In most on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the predictor u t at each trial t is allowed to change with time, and the total on-line loss of the algorithm is compared to the sum of the losses of u t at each trial plus the total "cost" for shifting to successive predictors. This is to model situations in which the examples change over time, and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. They hold for arbitrary sequences of examples and arbitrary sequences of predictors. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a sequences of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique is to project the hypothesis of the static algorithm at the end of each trial into a suitably chosen convex region. This keeps the hypothesis of the algorithm well-behaved and the static bounds can be converted to shifting bounds.
On-Line Learning of Linear Functions
- Computational Complexity
, 1991
"... this paper, we present near-optimal strategies for combining opinions in situations like this. In more abstract terms, we study the on-line learning of linear functions. We assume that learning proceeds in a sequence of trials. At trial number t the learning algorithm (the advisor) is presented with ..."
Abstract
-
Cited by 40 (18 self)
- Add to MetaCart
this paper, we present near-optimal strategies for combining opinions in situations like this. In more abstract terms, we study the on-line learning of linear functions. We assume that learning proceeds in a sequence of trials. At trial number t the learning algorithm (the advisor) is presented with an instance ~x t 2 [0; 1]
Worst-case Quadratic Loss Bounds for On-line Prediction of Linear Functions by Gradient Descent
- IEEE Transactions on Neural Networks
, 1993
"... this paper we study the performance of gradient descent when applied to the problem of on-line linear prediction in arbitrary inner product spaces. We show worst-case bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about the ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
this paper we study the performance of gradient descent when applied to the problem of on-line linear prediction in arbitrary inner product spaces. We show worst-case bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about the sequence to predict. The algorithms we use are variants and extensions of on-line gradient descent. Whereas our algorithms always predict using linear functions as hypotheses, none of our results requires the data to be linearly related. In fact, the bounds proved on the total prediction loss are typically expressed as a function of the total loss of the best fixed linear predictor with bounded norm. All the upper bounds are tight to within constants. Matching lower bounds are provided in some cases. Finally, we apply our results to the problem of on-line prediction for classes of smooth functions. Keywords: prediction, Widrow-Hoff algorithm, gradient descent, smoothing, inner product spaces, computational learning theory, on-line learning, linear systems.
Worst-case Quadratic Loss Bounds for Prediction Using Linear Functions and Gradient Descent
, 1996
"... In this paper we study the performance of gradient descent when applied to the problem of on-line linear prediction in arbitrary inner product spaces. We show worst-case bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about t ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
In this paper we study the performance of gradient descent when applied to the problem of on-line linear prediction in arbitrary inner product spaces. We show worst-case bounds on the sum of the squared prediction errors under various assumptions concerning the amount of a priori information about the sequence to predict. The algorithms we use are variants and extensions of on-line gradient descent. Whereas our algorithms always predict using linear functions as hypotheses, none of our results requires the data to be linearly related. In fact, the bounds proved on the total prediction loss are typically expressed as a function of the total loss of the best fixed linear predictor with bounded norm. All the upper bounds are tight to within constants. Matching lower bounds are provided in some cases. Finally, we apply our results to the problem of on-line prediction for classes of smooth functions.
On the Complexity of Function Learning
- In Proc. 6th Annu. Workshop on Comput. Learning Theory
, 1994
"... The majority of results in computational learning theory are concerned with concept learning, i.e. with the special case of function learning for classes of functions with range f0; 1g. Much less is known about the theory of learning functions with a larger range such as IN or IR. In particular rel ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The majority of results in computational learning theory are concerned with concept learning, i.e. with the special case of function learning for classes of functions with range f0; 1g. Much less is known about the theory of learning functions with a larger range such as IN or IR. In particular relatively few results exist about the general structure of common models for function learning, and there are only very few nontrivial function classes for which positive learning results have been exhibited in any of these models. We introduce in this paper the notion of a binary branching adversary tree for function learning, which allows us to give a somewhat surprising equivalent characterization of the optimal learning cost for learning a class of real-valued functions (in terms of a max-min definition which does not involve any "learning" model). Another general structural result of this paper relates the cost for learning a union of function classes to the learning costs for the individ...

