Results 1  10
of
20
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
 Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract

Cited by 247 (12 self)
 Add to MetaCart
this paper, we concentrate on linear predictors . To any vector u 2 R
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 116 (11 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
Sequential Prediction of Individual Sequences Under General Loss Functions
 IEEE Transactions on Information Theory
, 1998
"... We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction st ..."
Abstract

Cited by 75 (7 self)
 Add to MetaCart
We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either \Theta(log N) or \Omega\Gamma p ` log N ), depending on the loss function, where N is the number of predictors in the comparison class a...
Competitive online statistics
 International Statistical Review
, 1999
"... A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential sta ..."
Abstract

Cited by 63 (10 self)
 Add to MetaCart
A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential statistics). In this approach, which we call “competitive online statistics”, it is not assumed that data are generated by some stochastic mechanism; the bounds derived for the performance of competitive online statistical procedures are guaranteed to hold (and not just hold with high probability or on the average). This paper reviews some results in this area; the new material in it includes the proofs for the performance of the Aggregating Algorithm in the problem of linear regression with square loss. Keywords: Bayes’s rule, competitive online algorithms, linear regression, prequential statistics, worstcase analysis.
Adaptive and SelfConfident OnLine Learning Algorithms
, 2000
"... We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the wh ..."
Abstract

Cited by 62 (7 self)
 Add to MetaCart
We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the whole sequence of examples and thus it is not available to any strictly online algorithm. We introduce new techniques for adaptively tuning the learning rate as the data sequence is progressively revealed. Our techniques allow us to prove essentially the same bounds as if we knew the optimal learning rate in advance. Moreover, such techniques apply to a wide class of online algorithms, including pnorm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. Our adaptive tunings are radically dierent from previous techniques, such as the socalled doubling trick. Whereas the doubling trick restarts the online algorithm several ti...
Tracking the Best Linear Predictor
 Journal of Machine Learning Research
, 2001
"... In most online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of ex ..."
Abstract

Cited by 53 (11 self)
 Add to MetaCart
In most online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the predictor u t at each trial t is allowed to change with time, and the total online loss of the algorithm is compared to the sum of the losses of u t at each trial plus the total "cost" for shifting to successive predictors. This is to model situations in which the examples change over time, and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. They hold for arbitrary sequences of examples and arbitrary sequences of predictors. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a sequences of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique is to project the hypothesis of the static algorithm at the end of each trial into a suitably chosen convex region. This keeps the hypothesis of the algorithm wellbehaved and the static bounds can be converted to shifting bounds.
Tight WorstCase Loss Bounds for Predicting With Expert Advice
, 1994
"... this paper is somewhat different from the one just described. Assume that there are N experts E i , i = 1; : : : ; N , each trying to predict the outcomes y t as best they can. Let x t;i be the prediction of the ith expert E i about the ..."
Abstract

Cited by 53 (10 self)
 Add to MetaCart
this paper is somewhat different from the one just described. Assume that there are N experts E i , i = 1; : : : ; N , each trying to predict the outcomes y t as best they can. Let x t;i be the prediction of the ith expert E i about the
Relative Loss Bounds for Single Neurons
 IEEE Transactions on Neural Networks
, 1996
"... We analyze and compare the wellknown Gradient Descent algorithm and the more recent Exponentiated Gradient algorithm for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
We analyze and compare the wellknown Gradient Descent algorithm and the more recent Exponentiated Gradient algorithm for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the standard backpropagation algorithm. In this paper we prove worstcase loss bounds for both algorithms in the single neuron case. Since local minima make it difficult to prove worstcase bounds for gradientbased algorithms, we must use a loss function that prevents the formation of spurious local minima. We define such a matching loss function for any strictly increasing differentiable transfer function and prove worstcase loss bounds for any such transfer function and its corresponding matching loss. For example, the matching loss for the identity function is the square loss and the matching loss for the logistic transfer function is the entropic loss. The different forms of the two algori...
From NoiseFree to NoiseTolerant and from Online to Batch Learning
 In Proceedings of the Eighth Annual Conference on Computational Learning Theory
, 1995
"... A simple method is presented which, loosely speaking, virtually removes noise or misfit from data, and thereby converts a "noisefree" algorithm A, which online learns linear functions from data without noise or misfit, into a "noisetolerant" algorithm A nt which learns linear functions from da ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
A simple method is presented which, loosely speaking, virtually removes noise or misfit from data, and thereby converts a "noisefree" algorithm A, which online learns linear functions from data without noise or misfit, into a "noisetolerant" algorithm A nt which learns linear functions from data containing noise or misfit. Given some technical conditions, this conversion preserves optimality. For instance, the optimal noisefree algorithm B of Bernstein from [3] is converted into an optimal noisetolerant algorithm B nt . The conversion also works properly for all function classes which are closed under addition and contain linear functions as a subclass. In the second part of the paper, we show that Bernstein's online learning algorithm B can be converted into a batch learning algorithm B which consumes an (almost) minimal number of random training examples. This is true for a whole class of "pacstyle" batch learning models (including learning with an (ffl; fl) good model...
Incomplete Tree Search using Adaptive Probing
, 2001
"... When not enough time is available to fully explore a search tree, different algorithms will visit different leaves. Depthfirst search and depthbounded discrepancy search, for example, make opposite assumptions about the distribution of good leaves. Unfortunately, it is rarely clear a priori which ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
When not enough time is available to fully explore a search tree, different algorithms will visit different leaves. Depthfirst search and depthbounded discrepancy search, for example, make opposite assumptions about the distribution of good leaves. Unfortunately, it is rarely clear a priori which algorithm will be most appropriate for a particular problem. Rather than fixing strong assumptions in advance, we propose an approach in which an algorithm attempts to adjust to the distribution of leaf costs in the tree while exploring it. By sacrificing completeness, such flexible algorithms can exploit information gathered during the search using only weak assumptions. As an example, we show how a simple depthbased additive cost model of the tree can be learned online. Empirical analysis using a generic tree search problem shows that adaptive probing is competitive with systematic algorithms on a variety of hard trees and outperforms them when the nodeordering heuristic makes many mistakes. Results on boolean satisfiability and two different representations of number partitioning confirm these observations. Adaptive probing combines the flexibility and robustness of local search with the ability to take advantage of constructive heuristics.