On the limited memory BFGS method for large scale optimization
 MATHEMATICAL PROGRAMMING
, 1989
Efficient BackProp
, 1998
. The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that secondorder optimization methods are advantageous for neural net training. It is shown that most "classical" secondorder methods are impractical for large neural networks. A few methods are proposed that do not have these limitations. 1 Introduction Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number ...
First and SecondOrder Methods for Learning: between Steepest Descent and Newton's Method
 Neural Computation
, 1992
Online first order backpropagation is sufficiently fast and effective for many largescale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first and secondorder optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.
Representations Of QuasiNewton Matrices And Their Use In Limited Memory Methods
, 1996
We derive compact representations of BFGS and symmetric rankone matrices for optimization. These representations allow us to efficiently implement limited memory methods for large constrained optimization problems. In particular, we discuss how to compute projections of limited memory matrices onto subspaces. We also present a compact representation of the matrices generated by Broyden's update for solving systems of nonlinear equations.
Learning DependencyBased Compositional Semantics
Compositional question answering begins by mapping questions to logical forms, but training a semantic parser to perform this mapping typically requires the costly annotation of the target logical forms. In this paper, we learn to map questions to answers via latent logical forms, which are induced automatically from questionanswer pairs. In tackling this challenging learning problem, we introduce a new semantic representation which highlights a parallel between dependency syntax and efficient evaluation of logical forms. On two standard semantic parsing benchmarks (GEO and JOBS), our system obtains the highest published accuracies, despite requiring no annotated logical forms. 1
Theory of Algorithms for Unconstrained Optimization
, 1992
this article I will attempt to review the most recent advances in the theory of unconstrained optimization, and will also describe some important open questions. Before doing so, I should point out that the value of the theory of optimization is not limited to its capacity for explaining the behavior of the most widely used techniques. The question
Feature Forest Models for Probabilistic HPSG Parsing
 In Computational Linguistics
, 2008
Probabilistic modeling of lexicalized grammars is difficult because these grammars exploit complicated data structures, such as typed feature structures. This prevents us from applying common methods of probabilistic modeling in which a complete structure is divided into substructures under the assumption of statistical independence among substructures. For example, partofspeech tagging of a sentence is decomposed into tagging of each word, and CFGparsing is split into applications of CFGrules. These methods have relied on the structure of the target problem, namely lattices or trees, and cannot be applied to graph structures including typed feature structures. This article proposes the feature forest model as a solution to the problem of probabilistic modeling of complex data structures including typed feature structures. The feature forest model provides a method for probabilistic modeling without the independence assumption when probabilistic events are represented with feature forests. Feature forests are generic data structures that represent ambiguous trees in a packed forest structure. Feature forest models are maximum entropy models defined over feature forests. A dynamic programming algorithm is proposed for maximum entropy estimation without unpacking feature forests. Thus probabilistic modeling of
A new conjugate gradient method with guaranteed descent and an efficient line search
 SIAM J. OPTIM
, 2005
A new nonlinear conjugate gradient method and an associated implementation, based on an inexact line search, are proposed and analyzed. With exact line search, our method reduces to a nonlinear version of the Hestenes–Stiefel conjugate gradient scheme. For any (inexact) line search, our scheme satisfies the descent condition gT k dk ≤ − 7 8 ‖gk‖2. Moreover, a global convergence result is established when the line search fulfills the Wolfe conditions. A new line search scheme is developed that is efficient and highly accurate. Efficiency is achieved by exploiting properties of linear interpolants in a neighborhood of a local minimizer. High accuracy is achieved by using a convergence criterion, which we call the “approximate Wolfe ” conditions, obtained by replacing the sufficient decrease criterion in the Wolfe conditions with an approximation that can be evaluated with greater precision in a neighborhood of a local minimum than the usual sufficient decrease criterion. Numerical comparisons are given with both LBFGS and conjugate gradient methods using the unconstrained optimization problems in the CUTE library.
A reliable effective terascale linear learning system
, 2011
We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature.2 We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.
The Maximum Likelihood Ensemble Filter as a . . .
, 2008
The Maximum Likelihood Ensemble Filter (MLEF) equations are derived without the differentiability requirement for the prediction model and for the observation operators. Derivation reveals that a new nondifferentiable minimization method can be defined as a generalization of the gradientbased unconstrained methods, such as the preconditioned conjugategradient and quasiNewton methods. In the new minimization algorithm the vector of first order increments of the cost function is defined as a generalized gradient, while the symmetric matrix of second order increments of the cost function is defined as a generalized Hessian matrix. In the case of differentiable observation operators, the minimization algorithm reduces to the standard gradientbased form. The nondifferentiable aspect of the MLEF algorithm is illustrated in an example with onedimensional Burgers model and simulated observations. The MLEF algorithm has a robust performance, producing satisfactory results for tested nondifferentiable observation operators.