Results 1  10
of
49
Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms
 ICML 2004: PROCEEDINGS OF THE TWENTYFIRST INTERNATIONAL CONFERENCE ON MACHINE LEARNING. OMNIPRESS
, 2004
"... Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classi cation, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) algorithms on regularized forms of linea ..."
Abstract

Cited by 116 (10 self)
 Add to MetaCart
Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classi cation, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) algorithms on regularized forms of linear prediction methods. This class of methods, related to online algorithms such as perceptron, are both ecient and very simple to implement.
Adaptive and SelfConfident OnLine Learning Algorithms
, 2000
"... We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the wh ..."
Abstract

Cited by 99 (8 self)
 Add to MetaCart
We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the whole sequence of examples and thus it is not available to any strictly online algorithm. We introduce new techniques for adaptively tuning the learning rate as the data sequence is progressively revealed. Our techniques allow us to prove essentially the same bounds as if we knew the optimal learning rate in advance. Moreover, such techniques apply to a wide class of online algorithms, including pnorm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. Our adaptive tunings are radically dierent from previous techniques, such as the socalled doubling trick. Whereas the doubling trick restarts the online algorithm several ti...
Relative Loss Bounds for Multidimensional Regression Problems
 MACHINE LEARNING
, 2001
"... We study online generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. The ..."
Abstract

Cited by 90 (15 self)
 Add to MetaCart
We study online generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. The weight vectors used to produce the linear activations are represented indirectly by maintaining separate parameter vectors. We get the weight vector by applying a particular parameterization function to the parameter vector. Updating the parameter vectors upon seeing new examples is done additively, as in the usual gradient descent update. However, by using a nonlinear parameterization function between the parameter vectors and the weight vectors, we can make the resulting update of the weight vector quite different from a true gradient descent update. To analyse such updates, we define a notion of a matching loss function and apply it both to the transfer function and to the parameterization function. The loss function that matches the transfer function is used to measure the goodness of the predictions of the algorithm. The loss function that matches the parameterization function can be used both as a measure of divergence between models in motivating the update rule of the algorithm and as a measure of progress in analyzing its relative performance compared to an arbitrary fixed model. As a result, we have a unified treatment that generalizes earlier results for the gradient descent and exponentiated gradient algorithms to multidimensional outputs, including multiclass logistic regression.
Sequential procedures for aggregating arbitrary estimators of a conditional mean
, 2005
"... In this paper we describe and analyze a sequential procedure for aggregating linear combinations of a finite family of regression estimates, with particular attention to linear combinations having coefficients in the generalized simplex. The procedure is based on exponential weighting, and has a com ..."
Abstract

Cited by 47 (2 self)
 Add to MetaCart
In this paper we describe and analyze a sequential procedure for aggregating linear combinations of a finite family of regression estimates, with particular attention to linear combinations having coefficients in the generalized simplex. The procedure is based on exponential weighting, and has a computationally tractable approximation. Analysis of the procedure is based in part on techniques from the sequential prediction of nonrandom sequences. Here these techniques are applied in a stochastic setting to obtain cumulative loss bounds for the aggregation procedure. From the cumulative loss bounds we derive an oracle inequality for the aggregate estimator for an unbounded response having a suitable moment generating function. The inequality shows that the risk of the aggregate estimator is less than the risk of the best candidate linear combination in the generalized simplex, plus a complexity term that depends on the size of the coefficient set. The inequality readily yields convergence rates for aggregation over the unit simplex that are within logarithmic factors of known minimax bounds. Some preliminary results on model selection are also presented.
Potentialbased Algorithms in Online Prediction and Game Theory
"... In this paper we show that several known algorithms for sequential prediction problems (including Weighted Majority and the quasiadditive family of Grove, Littlestone, and Schuurmans), for playing iterated games (including Freund and Schapire's Hedge and MW, as well as the strategies of Hart ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
In this paper we show that several known algorithms for sequential prediction problems (including Weighted Majority and the quasiadditive family of Grove, Littlestone, and Schuurmans), for playing iterated games (including Freund and Schapire's Hedge and MW, as well as the strategies of Hart and MasColell), and for boosting (including AdaBoost) are special cases of a general decision strategy based on the notion of potential. By analyzing this strategy we derive known performance bounds, as well as new bounds, as simple corollaries of a single general theorem. Besides offering a new and unified view on a large family of algorithms, we establish a connection between potentialbased analysis in learning and their counterparts independently developed in game theory. By exploiting this connection, we show that certain learning problems are instances of more general gametheoretic problems. In particular, we describe a notion of generalized regret and show its applications in learning theory.
Relative Loss Bounds for Single Neurons
 IEEE Transactions on Neural Networks
, 1996
"... We analyze and compare the wellknown Gradient Descent algorithm and the more recent Exponentiated Gradient algorithm for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
(Show Context)
We analyze and compare the wellknown Gradient Descent algorithm and the more recent Exponentiated Gradient algorithm for training a single neuron with an arbitrary transfer function. Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the standard backpropagation algorithm. In this paper we prove worstcase loss bounds for both algorithms in the single neuron case. Since local minima make it difficult to prove worstcase bounds for gradientbased algorithms, we must use a loss function that prevents the formation of spurious local minima. We define such a matching loss function for any strictly increasing differentiable transfer function and prove worstcase loss bounds for any such transfer function and its corresponding matching loss. For example, the matching loss for the identity function is the square loss and the matching loss for the logistic transfer function is the entropic loss. The different forms of the two algori...
On Prediction of Individual Sequences
, 1998
"... Sequential randomized prediction of an arbitrary binary sequence is investigated. No assumption is made on the ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Sequential randomized prediction of an arbitrary binary sequence is investigated. No assumption is made on the
A ZeroDelay Sequential Scheme for Lossy Coding of Individual Sequences
, 2000
"... We consider adaptive sequential lossy coding of bounded individual sequences when the performance is measured by the sequentially accumulated mean squared distortion. The encoder and the decoder are connected via a noiseless channel of capacity R and both are assumed to have zero delay. No probabil ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
We consider adaptive sequential lossy coding of bounded individual sequences when the performance is measured by the sequentially accumulated mean squared distortion. The encoder and the decoder are connected via a noiseless channel of capacity R and both are assumed to have zero delay. No probabilistic assumptions are made on how the sequence to be encoded is generated. For any bounded sequence of length n, the distortion redundancy is defined as the normalized cumulative distortion of the sequential scheme minus the normalized cumulative distortion of the best scalar quantizer of rate R which is matched to this particular sequence. We demonstrate the existence of a zerodelay sequential scheme which uses common randomization in the encoder and the decoder such that the normalized maximum distortion redundancy converges to zero at a rate n \Gamma1=5 log n as the length of the encoded sequence n increases without bound.
Online Oblivious Routing
 In Proceedings of ACM Symposium in Parallelism in Algorithms and Architectures (SPAA
, 2003
"... We consider an online version of the oblivious routing problem. Oblivious routing is the problem of picking a routing between each pair of nodes (or a set of flows), without knowledge of the traffic or demand between each pair, with the goal of minimizing the maximum congestion on any edge in the gr ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
We consider an online version of the oblivious routing problem. Oblivious routing is the problem of picking a routing between each pair of nodes (or a set of flows), without knowledge of the traffic or demand between each pair, with the goal of minimizing the maximum congestion on any edge in the graph. In the online version of the problem, we consider a "repeatedgame" setting, in which the algorithm is allowed to choose a new routing each night, but is still oblivious to the demands that will occur the next day. The cost of the algorithm at every time step is its competitive ratio, or the ratio of its congestion to the minimum possible congestion for the demands at that time step.
2007) “Strategies for Prediction under Imperfect Monitoring,” mimeo
"... We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possib ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possible average reward. It was Rustichini [25] who first proved the existence of such consistent predictors. The forecasters presented here offer the first constructive proof of consistency. Moreover, the proposed algorithms are computationally efficient. We also establish upper bounds for the rates of convergence. In the case of deterministic feedback, these rates are optimal up to logarithmic terms.