Results 1  10
of
69
Accelerated training of conditional random fields with stochastic gradient methods
 In ICML
, 2006
"... We apply Stochastic MetaDescent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than lim ..."
Abstract

Cited by 140 (6 self)
 Add to MetaCart
(Show Context)
We apply Stochastic MetaDescent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than limitedmemory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques. 1.
Fast iterative alignment of pose graphs with poor initial estimates
 In IEEE Intl. Conf. on Robotics and Automation (ICRA
, 2006
"... Abstract — A robot exploring an environment can estimate its own motion and the relative positions of features in the environment. Simultaneous Localization and Mapping (SLAM) algorithms attempt to fuse these estimates to produce a map and a robot trajectory. The constraints are generally nonlinear ..."
Abstract

Cited by 83 (11 self)
 Add to MetaCart
(Show Context)
Abstract — A robot exploring an environment can estimate its own motion and the relative positions of features in the environment. Simultaneous Localization and Mapping (SLAM) algorithms attempt to fuse these estimates to produce a map and a robot trajectory. The constraints are generally nonlinear, thus SLAM can be viewed as a nonlinear optimization problem. The optimization can be difficult, due to poor initial estimates arising from odometry data, and due to the size of the state space. We present a fast nonlinear optimization algorithm that rapidly recovers the robot trajectory, even when given a poor initial estimate. Our approach uses a variant of Stochastic Gradient Descent on an alternative statespace representation that has good stability and computational properties. We compare our algorithm to several others, using both real and synthetic data sets.
A Stochastic Gradient Method with an Exponential Convergence Rate for StronglyConvex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258
, 2012
"... We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in ..."
Abstract

Cited by 76 (11 self)
 Add to MetaCart
We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training objective and reducing the testing objective quickly. 1
Fast Curvature MatrixVector Products for SecondOrder Gradient Descent
 Neural Computation
, 2002
"... We propose a generic method for iteratively approximating various secondorder gradient steps  Newton, GaussNewton, LevenbergMarquardt, and natural gradient  in linear time per iteration, using special curvature matrixvector products that can be computed in O(n). Two recent acceleration techn ..."
Abstract

Cited by 56 (15 self)
 Add to MetaCart
We propose a generic method for iteratively approximating various secondorder gradient steps  Newton, GaussNewton, LevenbergMarquardt, and natural gradient  in linear time per iteration, using special curvature matrixvector products that can be computed in O(n). Two recent acceleration techniques for online learning, matrix momentum and stochastic metadescent (SMD), in fact implement this approach. Since both were originally derived by very different routes, this o ers fresh insight into their operation, resulting in further improvements to SMD.
Piecewise pseudolikelihood for efficient CRF training
 In International Conference on Machine Learning (ICML
, 2007
"... Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accur ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
(Show Context)
Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accuracy can be poor. Piecewise training (Sutton & McCallum, 2005) can have better accuracy but does not scale as well in the variable cardinality. In this paper, we introduce piecewise pseudolikelihood, which retains the computational efficiency of pseudolikelihood but can have much better accuracy. On several benchmark NLP data sets, piecewise pseudolikelihood has better accuracy than standard pseudolikelihood, and in many cases nearly equivalent to maximum likelihood, with five to ten times less training time than batch CRF training. 1.
Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure
"... Graphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
Graphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using backpropagation and stochastic metadescent. Over a range of syntheticdata problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude. 1
On the role of tracking in stationary environments
 Proceedings of the TwentyFourth International Conference on Machine Learning (ICML 2007
, 2007
"... It is often thought that learning algorithms that track the best solution, as opposed to converging to it, are important only on nonstationary problems. We present three results suggesting that this is not so. First we illustrate in a simple concrete example, the Black and White problem, that tracki ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
It is often thought that learning algorithms that track the best solution, as opposed to converging to it, are important only on nonstationary problems. We present three results suggesting that this is not so. First we illustrate in a simple concrete example, the Black and White problem, that tracking can perform better than any converging algorithm on a stationary problem. Second, we show the same point on a larger, more realistic problem, an application of temporaldifference learning to computer Go. Our third result suggests that tracking in stationary problems could be important for metalearning research (e.g., learning to learn, feature selection, transfer). We apply a metalearning algorithm for stepsize adaptation, IDBD (Sutton, 1992a), to the Black and White problem, showing that metalearning has a dramatic longterm effect on performance whereas, on an analogous converging problem, metalearning has only a small secondorder effect. This small result suggests a way of eventually overcoming a major obstacle to metalearning research: the lack of an independent methodology for task selection. 1.
Neural network–based colonoscopic diagnosis using on–line learning and differential evolution, Applied Soft Computing 4(2004
 and Differential Evolution, Applied Soft Computing
, 2004
"... ABSTRACT: In this paper, online training of neural networks is investigated in the context of computerassisted colonoscopic diagnosis. A memorybased adaptation of the learning rate for the online Backpropagation is proposed and used to seed an online evolution process that applies a Differentia ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
(Show Context)
ABSTRACT: In this paper, online training of neural networks is investigated in the context of computerassisted colonoscopic diagnosis. A memorybased adaptation of the learning rate for the online Backpropagation is proposed and used to seed an online evolution process that applies a Differential Evolution Strategy to (re)adapt the neural network to modified environmental conditions. Our approach looks at online training from the perspective of tracking the changing location of an approximate solution of a patternbased, and, thus, dynamically changing, error function. The proposed hybrid strategy is compared with other standard training methods that have traditionally been used for training neural networks offline. Results in interpreting colonoscopy images and frames of video sequences are promising and suggest that networks trained with this strategy detect malignant regions of interest with accuracy.
No More Pesky Learning Rates
, 2012
"... The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations ac ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for nonstationary problems. Using a number of convex and nonconvex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning. 1.