## Accelerated training of conditional random fields with stochastic gradient methods (2006)

### Cached

### Download Links

- [www.cs.ubc.ca]
- [www.cs.ubc.ca]
- [www.cs.ubc.ca]
- [www.ai.mit.edu]
- [people.cs.ubc.ca]
- [www.cs.ubc.ca]
- [www.ai.mit.edu]
- [www.cs.ubc.ca]
- [people.cs.ubc.ca]
- [cnl.salk.edu]
- [imls.engr.oregonstate.edu]
- [www.cs.ubc.ca]
- [www.ai.mit.edu]
- [www.icml2006.org]
- [people.cs.ubc.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICML |

Citations: | 101 - 4 self |

### BibTeX

@INPROCEEDINGS{Vishwanathan06acceleratedtraining,

author = {S. V. N. Vishwanathan and Nicol N. Schraudolph and Mark W. Schmidt and Kevin P. Murphy},

title = {Accelerated training of conditional random fields with stochastic gradient methods},

booktitle = {In ICML},

year = {2006},

pages = {969--976}

}

### Years of Citing Articles

### OpenURL

### Abstract

We apply Stochastic Meta-Descent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than limited-memory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques. 1.

### Citations

2456 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...eported to date. We report results for both exact and inexact inference techniques. 1. Introduction Conditional Random Fields (CRFs) have recently gained popularity in the machine learning community (=-=Lafferty et al., 2001-=-; Sha & Pereira, 2003; Kumar & Hebert, 2004). Current training methods for CRFs 1 include generalized iterative scaling (GIS), conjugate gradient (CG), and limited-memory BFGS. These are all batch-onl... |

1457 | Fast approximate energy minimization via graph cuts
- Boykov, Veksler, et al.
(Show Context)
Citation Context ...od than the MF free energy (Weiss, 2001). Although LBP can sometimes oscillate, convergent versions have been developed (e.g., Kolmogorov, 2004). For some kinds of potentials, one can use graph cuts (=-=Boykov et al., 2001-=-) to find an approximate MAP estimate of the labels, which can be used inside a Viterbi training procedure. However, this produces a very discontinuous estimate of the gradient (though one could presu... |

962 | On the statistical analysis of dirty pictures - Besag - 1986 |

510 | Discriminative training methods for hidden Markov models: Theory and experiments with the perceptron algorithm - Collins - 2002 |

463 | Shallow parsing with conditional random fields
- Sha, Pereira
(Show Context)
Citation Context ...ort results for both exact and inexact inference techniques. 1. Introduction Conditional Random Fields (CRFs) have recently gained popularity in the machine learning community (Lafferty et al., 2001; =-=Sha & Pereira, 2003-=-; Kumar & Hebert, 2004). Current training methods for CRFs 1 include generalized iterative scaling (GIS), conjugate gradient (CG), and limited-memory BFGS. These are all batch-only algorithms that do ... |

399 | A separator theorem for planar graphs
- Lipton, Tarjan
- 1979
(Show Context)
Citation Context ...tly. Unfortunately for many CRFs the treewidth is too large for exact inference (and hence exact gradient computation) to be tractable. The treewidth of an N = k × k grid, for instance, is w = O(2k) (=-=Lipton & Tarjan, 1979-=-), so exact inference takes O(| Y | 2k ) time. Various approximate inference methods have been used in parameter learning algorithms (Parise & Welling, 2005). Here we consider two of the simplest: mea... |

325 |
Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation
- Griewank
(Show Context)
Citation Context ...s by using second-order information to adapt the gradient step sizes (Schraudolph, 1999, 2002). Key to SMD’s efficiency is the implicit computation of fast Hessian-vector products (Pearlmutter, 1994; =-=Griewank, 2000-=-). In this paper we marry the above two techniques and show how SMD can be used to significantly accelerate the training of CRFs. The rest of the paper is organized as follows: Section 2 gives a brief... |

323 | Convergent tree-reweighted message passing for energy minimization
- Kolmogorov
(Show Context)
Citation Context ...t has been found empirically to often better approximate the log-likelihood than the MF free energy (Weiss, 2001). Although LBP can sometimes oscillate, convergent versions have been developed (e.g., =-=Kolmogorov, 2004-=-). For some kinds of potentials, one can use graph cuts (Boykov et al., 2001) to find an approximate MAP estimate of the labels, which can be used inside a Viterbi training procedure. However, this pr... |

217 |
Image Analysis, Random Fields and Dynamic Monte Carlo Methods
- Winkler
- 1995
(Show Context)
Citation Context ...y a good approximation to the likelihood, as the amount of training data (or the size of the lattice, when using tied parameters) tends to infinity, its maximum coincides with that of the likelihood (=-=Winkler, 1995-=-). Note that pseudo-likelihood estimates the parameters conditional on i’s neighbors being observed. As a consequence, PL tends to place too much emphasis on the edge potentials, and not enough on the... |

173 |
Information and Exponential Families in Statistical Theory
- Barndorff-Nielsen
- 1978
(Show Context)
Citation Context ...) Here φ(x, y) is called the sufficient statistics of the distribution, 〈·, ·〉 denotes the inner product, and z(·) the log-partition function z(θ|x) := ln � exp(〈φ(x, y), θ〉). (2) y It is well-known (=-=Barndorff-Nielsen, 1978-=-) that the logpartition function is a C ∞ convex function. Furthermore, it is also the cumulant generating function of the exponential family, i.e., ∂ ∂θ z(θ|x) = E p(y|x;θ)[φ(x, y)], (3) ∂ 2 (∂θ) 2 z... |

158 | Interactive image segmentation using an adaptive GMMRF model - Blake, Rother, et al. |

149 |
Overview of BioCreAtIvE: Critical assessment of information extraction for biology
- Hirschman
(Show Context)
Citation Context ... a far slower rate. We also obtained comparable results (not reported here) with a similar setup on the first BioCreAtivE (Critical Assessment of Information Extraction in Biology) challenge task 1A (=-=Hirschman et al., 2005-=-). 5. Experiments on 2D Lattice CRFs For the 2D CRF experiments we compare four optimization algorithms: SGD, SMD, BFGS as implemented in Matlab’s fminunc function (with ‘largeScale’ set to ‘off’), an... |

135 | Introduction to the CoNLL-2000 shared task: Chunking
- Sang, Buchholz
(Show Context)
Citation Context ...ndom permutations of the data substantially identical results to those reported below. 4.1. CoNLL-2000 Base NP Chunking Task Our first experiment uses the well-known CoNLL-2000 Base NP chunking task (=-=Sang & Buchholz, 2000-=-). Text 2 Available under LGPL from http://chasen.org/ ∼ taku/software/CRF++/. Our modified code, as well as the data sets, configuration files, and results for all experiments reported here will be a... |

108 | Discriminative Fields for Modeling Spatial Dependencies in Natural Images
- Kumar, Hebert
- 2003
(Show Context)
Citation Context ...exact and inexact inference techniques. 1. Introduction Conditional Random Fields (CRFs) have recently gained popularity in the machine learning community (Lafferty et al., 2001; Sha & Pereira, 2003; =-=Kumar & Hebert, 2004-=-). Current training methods for CRFs 1 include generalized iterative scaling (GIS), conjugate gradient (CG), and limited-memory BFGS. These are all batch-only algorithms that do not work well in an on... |

82 | Introduction to the bio-entity recognition task at JNLPBA
- Jin-Dong, Tomoko, et al.
- 2004
(Show Context)
Citation Context ... 92.7% from a peak of 92.9% reached earlier. 4.2. BioNLP/NLPBA-2004 Shared Task Our second experiment uses the BioNLP/NLPBA2004 shared task of biomedical named-entity recognition on the GENIA corpus (=-=Kim et al., 2004-=-). Namedentity recognition aims to identify and classify technical terms in a given domain (here: molecular biology) that refer to concepts of interest to domain experts (Kim et al., 2004). Following ... |

72 | Fast exact multiplication by the hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ...elerate this process by using second-order information to adapt the gradient step sizes (Schraudolph, 1999, 2002). Key to SMD’s efficiency is the implicit computation of fast Hessian-vector products (=-=Pearlmutter, 1994-=-; Griewank, 2000). In this paper we marry the above two techniques and show how SMD can be used to significantly accelerate the training of CRFs. The rest of the paper is organized as follows: Section... |

60 | Local gain adaptation in stochastic gradient descent
- Schraudolph
- 1999
(Show Context)
Citation Context ...vergence to the optimum is often painfully slow. Gain adaptation methods like Stochastic Meta-Descent (SMD) accelerate this process by using second-order information to adapt the gradient step sizes (=-=Schraudolph, 1999-=-, 2002). Key to SMD’s efficiency is the implicit computation of fast Hessian-vector products (Pearlmutter, 1994; Griewank, 2000). In this paper we marry the above two techniques and show how SMD can b... |

54 | Man-made structure detection in natural images using a causal multiscale random field - Kumar, Hebert |

40 | Fast curvature matrix-vector products for second-order gradient descent - Schraudolph |

30 | Comparing the Mean Field Method and Belief Propagation for Approximate Inference in MRFs
- Weiss
(Show Context)
Citation Context ...rious approximate inference methods have been used in parameter learning algorithms (Parise & Welling, 2005). Here we consider two of the simplest: mean field (MF) and loopy belief propagation (LBP) (=-=Weiss, 2001-=-; Yedidia et al., 2003). The MF free energy is a lower bound on the loglikelihood, and hence an upper bound on our negative log-likelihood objective. The Bethe free energy minimized by LBP is not a bo... |

15 |
Learning in Markov random fields: An empirical study
- Parise, Welling
- 2005
(Show Context)
Citation Context ... N = k × k grid, for instance, is w = O(2k) (Lipton & Tarjan, 1979), so exact inference takes O(| Y | 2k ) time. Various approximate inference methods have been used in parameter learning algorithms (=-=Parise & Welling, 2005-=-). Here we consider two of the simplest: mean field (MF) and loopy belief propagation (LBP) (Weiss, 2001; Yedidia et al., 2003). The MF free energy is a lower bound on the loglikelihood, and hence an ... |

12 | Step size adaptation in reproducing kernel Hilbert space - Vishwanathan, Schraudolph, et al. - 2006 |

10 | Combining Conjugate Direction Methods with Stochastic Approximation of Gradients
- Schraudolph, Graepel
(Show Context)
Citation Context ...d to be computationally most efficient. Unfortunately most advanced gradient methods do not tolerate the sampling noise inherent in stochastic approximation: it collapses conjugate search directions (=-=Schraudolph & Graepel, 2003-=-) and confuses the line searches that both conjugate gradient and quasi-Newton methods depend upon. Full secondorder methods are unattractive here because the computational cost of inverting the Hessi... |

2 | Online SVM with multiclass classification and SMD step size adaptation - Vishwanathan, Schraudolph, et al. - 2006 |

1 |
Overview of BioCreAtivE:critical assessment of information extraction for biology
- Hirschman, Yeh, et al.
- 2005
(Show Context)
Citation Context ... a far slower rate. We also obtained comparable results (not reported here) with a similar setup on the first BioCreAtivE (Critical Assessment of Information Extraction in Biology) challenge task 1A (=-=Hirschman et al., 2005-=-). 5. Experiments on 2D Lattice CRFs For the 2D CRF experiments we compare four optimization algorithms: SGD, SMD, BFGS as implemented in Matlab’s fminunc function (with ‘largeScale’ set to ‘off’), an... |

1 | Biomedical named intity recognition using conditional random fields and rich feature sets - Settles - 2004 |

1 |
Image Analysis, Random Fields and Accelerated Training of CRFs with Stochastic Gradient Methods Dynamic Monte Carlo Methods
- Winkler
- 1995
(Show Context)
Citation Context ...y a good approximation to the likelihood, as the amount of training data (or the size of the lattice, when using tied parameters) tends to infinity, its maximum coincides with that of the likelihood (=-=Winkler, 1995-=-). Note that pseudo-likelihood estimates the parameters conditional on i’s neighbors being observed. As a consequence, PL tends to place too much emphasis on the edge potentials, and not enough on the... |