Results 1  10
of
12
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
 Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract

Cited by 247 (12 self)
 Add to MetaCart
this paper, we concentrate on linear predictors . To any vector u 2 R
Tracking the best expert
 In Proceedings of the 12th International Conference on Machine Learning
, 1995
"... Abstract. We generalize the recent relative loss bounds for online algorithms where the additional loss of the algorithm on the whole sequence of examples over the loss of the best expert is bounded. The generalization allows the sequence to be partitioned into segments, and the goal is to bound th ..."
Abstract

Cited by 198 (18 self)
 Add to MetaCart
Abstract. We generalize the recent relative loss bounds for online algorithms where the additional loss of the algorithm on the whole sequence of examples over the loss of the best expert is bounded. The generalization allows the sequence to be partitioned into segments, and the goal is to bound the additional loss of the algorithm over the sum of the losses of the best experts for each segment. This is to model situations in which the examples change and different experts are best for certain segments of the sequence of examples. In the single segment case, the additional loss is proportional to log n, where n is the number of experts and the constant of proportionality depends on the loss function. Our algorithms do not produce the best partition; however the loss bound shows that our predictions are close to those of the best partition. When the number of segments is k +1and the sequence is of length ℓ, we can bound the additional loss of our algorithm over the best partition by O(k log n + k log(ℓ/k)). For the case when the loss per trial is bounded by one, we obtain an algorithm whose additional loss over the loss of the best partition is independent of the length of the sequence. The additional loss becomes O(k log n + k log(L/k)), where L is the loss of the best partition with k +1segments. Our algorithms for tracking the predictions of the best expert are simple adaptations of Vovk’s original algorithm for the single best expert case. As in the original algorithms, we keep one weight per expert, and spend O(1) time per weight in each trial.
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 116 (11 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
Online portfolio selection using multiplicative updates
 Mathematical Finance
, 1998
"... We present an online investment algorithm which achieves almost the same wealth as the best constantrebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algo ..."
Abstract

Cited by 80 (10 self)
 Add to MetaCart
We present an online investment algorithm which achieves almost the same wealth as the best constantrebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algorithm is very simple to implement and requires only constant storage and computing time per stock ineach trading period. We tested the performance of our algorithm on real stock data from the New York Stock Exchange accumulated during a 22year period. On this data, our algorithm clearly outperforms the best single stock aswell as Cover's universal portfolio selection algorithm. We also present results for the situation in which the We present an online investment algorithm which achieves almost the same wealth as the best constantrebalanced portfolio investment strategy. The algorithm employsamultiplicative update rule derived using a framework introduced by Kivinen and Warmuth [20]. Our algorithm is very simple to implement and its time and storage requirements grow linearly in the number of stocks.
Adaptive and SelfConfident OnLine Learning Algorithms
, 2000
"... We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the wh ..."
Abstract

Cited by 62 (7 self)
 Add to MetaCart
We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the whole sequence of examples and thus it is not available to any strictly online algorithm. We introduce new techniques for adaptively tuning the learning rate as the data sequence is progressively revealed. Our techniques allow us to prove essentially the same bounds as if we knew the optimal learning rate in advance. Moreover, such techniques apply to a wide class of online algorithms, including pnorm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. Our adaptive tunings are radically dierent from previous techniques, such as the socalled doubling trick. Whereas the doubling trick restarts the online algorithm several ti...
Averaging Expert Predictions
 Computational Learning Theory: 4th European Conference (EuroCOLT ’99
, 1999
"... We consider algorithms for combining advice from a set of experts. In each trial, the algorithm receives the predictions of the experts and produces its own prediction. A loss function is applied to measure the discrepancy between the predictions and actual observations. ..."
Abstract

Cited by 58 (14 self)
 Add to MetaCart
We consider algorithms for combining advice from a set of experts. In each trial, the algorithm receives the predictions of the experts and produces its own prediction. A loss function is applied to measure the discrepancy between the predictions and actual observations.
Linear Hinge Loss and Average Margin
, 1998
"... We describe a unifying method for proving relative loss bounds for online linear threshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a continuous ..."
Abstract

Cited by 37 (12 self)
 Add to MetaCart
We describe a unifying method for proving relative loss bounds for online linear threshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a continuous loss function, called the "linear hinge loss", that can be employed to derive the updates of the algorithms. We first prove bounds w.r.t. the linear hinge loss and then convert them to the discrete loss. We introduce a notion of "average margin" of a set of examples . We show how relative loss bounds based on the linear hinge loss can be converted to relative loss bounds i.t.o. the discrete loss using the average margin.
Tracking the Best Regressor
 In Proc. 11th Annu. Conf. on Comput. Learning Theory
, 1998
"... In most of the online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequenc ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
In most of the online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the comparison vector u t at each trial t is allowed to change with time, and the total online loss of the algorithm is compared to the sum of the losses of u t at each trial plus the total "cost" for shifting to successive comparison vectors. This is to model situations in which the examples change over time and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. Shifting bounds still hold for arbitrary sequences of examples and also for arbitrary partitions. The algorithm does not know the offline partition and the sequence of predictors that i...
Learning of Depth Two Neural Networks with Constant Fanin at the Hidden Nodes (Extended Abstract)
 In Proc. 9th Annu. Conf. on Comput. Learning Theory
, 1996
"... We present algorithms for learning depth two neural networks where the hidden nodes are threshold gates with constant fanin. The transfer function of the output node might be more general: we have results for the cases when the threshold function, the logistic function or the identity function is u ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We present algorithms for learning depth two neural networks where the hidden nodes are threshold gates with constant fanin. The transfer function of the output node might be more general: we have results for the cases when the threshold function, the logistic function or the identity function is used as the transfer function at the output node. We give batch and online learning algorithms for these classes of neural networks and prove bounds on the performance of our algorithms. The batch algorithms work for real valued inputs whereas the online algorithms assume that the inputs are discretized. The hypotheses of our algorithms are essentially also neural networks of depth two. However, their number of hidden nodes might be much larger than the number of hidden nodes of the neural network that has to be learned. Our algorithms can handle such a large number of hidden nodes since they rely on multiplicative weight updates at the output node, and the performance of these algorithms s...
Continuous And DiscreteTime Nonlinear Gradient Descent: Relative Loss Bounds and Convergence
 IN FIFT INTERNATIONAL SYMPOSIUM ON ARTI INTELLIGENCE AND MATHEMATICS
, 1998
"... We introduce a general algorithm for continuous and discretetime nonlinear gradientdescent. The nonlinearity is captured by the choice of a link function. The discretetime algorithm yields, for various choices of link function, the conventional gradientdescent algorithm as well as several expon ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We introduce a general algorithm for continuous and discretetime nonlinear gradientdescent. The nonlinearity is captured by the choice of a link function. The discretetime algorithm yields, for various choices of link function, the conventional gradientdescent algorithm as well as several exponentiated gradient ones. We obtain relative loss bounds for the general algorithm in an online setting for both the continuous and discretetime versions. These bounds reveal the dependence on the link function and show that an additional term is present in the discretetime case which disappears in the continuoustime case. This additional term is responsible for the pair of dual norms that appear in the relative loss bounds for linear and logistic regression. The continuoustime version is also shown to have a simple proof of convergence in the batch setting. Convergence of Hopfield recurrent neural networks is seen as a special case.