## A tutorial on energy-based learning (2006)

Venue: | Predicting Structured Data |

Citations: | 42 - 6 self |

### BibTeX

@INPROCEEDINGS{Lecun06atutorial,

author = {Yann Lecun and Sumit Chopra and Raia Hadsell and Fu Jie Huang and G. Bakir and T. Hofman and B. Schölkopf and A. Smola and B. Taskar (eds},

title = {A tutorial on energy-based learning},

booktitle = {Predicting Structured Data},

year = {2006},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graph-transformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of non-probabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches. 1

### Citations

8973 | Statistical Learning Theory - Vapnik - 1998 |

2309 | Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data - Lafferty, McCallum, et al. - 2001 |

1166 | Factor graphs and the sum-product algorithm - Kschischang, Frey, et al. - 2001 |

1156 | Information Theory, Inference, and Learning Algorithms - MacKay - 2005 |

732 | Gradient-based learning applied to document recognition - Bengio, Haffner |

508 | Training products of experts by minimizing contrastive divergence - Hinton - 2002 |

488 | Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms - Collins |

439 | Maximum entropy markov models for information extraction and segmentation - McCallum, Freitag, et al. - 2000 |

436 | Max-margin Markov networks - Taskar, Guestrin, et al. - 2003 |

413 | Constructing Free-Energy Approximations and Generalized Belief FALL 2008 105 Spring Symposium Series Call for Participation AAAI presents the 2009 Spring Symposium Series, to be held Monday - Wednesday, March 23-25, 2008, at Stanford University. The topic - Yedidia, Freeman, et al. |

303 | Finite-State Transducers in Language and Speech Processing - Mohri - 1997 |

269 | Discriminative reranking for natural language parsing - Collins, Koo - 2005 |

163 |
Maximum Mutual Information Estimation of Hidden Markov Models Parameters for Speech Recognition
- Souza, V, et al.
- 1986
(Show Context)
Citation Context ...idely used under the name maximum mutual information estimation for discriminatively training speech recognition systems since the late 80’s, including hidden Markov models with mixtures of Gaussians =-=[3]-=-, and HMM-neural net hybrids [6, 7, 31, 5]. It has also been used extensively for global discriminative training of handwriting recognition systems that integrate neural nets and hidden Markov models ... |

146 | A Neural Probabilistic Language Model
- Bengio, Ducharme, et al.
(Show Context)
Citation Context ...any authors under various names. In the neural network classification literature, it is known as the cross-entropy loss [57]. It was also used by Bengio et al. to train an energy-based language model =-=[9]-=-. It has been widely used under the name maximum mutual information estimation for discriminatively training speech recognition systems since the late 80’s, including hidden Markov models with mixture... |

129 | Minimum classification error rate methods for speech recognition - Juang, Chou, et al. - 1997 |

125 | Efficient backprop - LeCun, Bottou, et al. - 1998 |

108 | An input output HMM architecture
- Bengio, Frasconi
- 1995
(Show Context)
Citation Context ... 41, 12, 42, 14] and discriminative forward training [43]. Finally, it is the loss function of choice for training other probabilistic discriminative sequence labeling models such as input/output HMM =-=[10]-=-, conditional random fields [40], and discriminative random fields [39]. Minimum Empirical Error Loss: Some authors have argued that the negative log likelihood loss puts too much emphasis on mistakes... |

108 | Discriminative fields for modeling spatial dependencies in natural images - Kumar, Hebert |

97 |
Improving the Convergence of back-propagation Learning with Second order Methods
- Becker, leCun
- 1988
(Show Context)
Citation Context ...the parameters on the basis of a single sample, whereas “batch” optimization methods waste considerable resources to compute exact descent directions, often nullifying the theoretical speed advantage =-=[4, 43, 44, 15, 16, 61]-=-. featurevectors E(W,Z,Y, X) wordin thelexicon wordtemplates Path (acousticvectors) X Z Y Figure 21: Figure showing the architecture of a speech recognition system using latent variables. An acoustic ... |

95 | Learning a similarity metric discriminatively, with application to face verification - Chopra, Hadsell, et al. - 2005 |

95 | Accelerated training of conditional random fields with stochastic gradient methods - Vishwanathan, Schraudolph, et al. - 2006 |

69 | Global optimization of a neural network-hidden Markov model hybrid
- Bengio, Mori, et al.
- 1992
(Show Context)
Citation Context ...m mutual information estimation for discriminatively training speech recognition systems since the late 80’s, including hidden Markov models with mixtures of Gaussians [3], and HMM-neural net hybrids =-=[6, 7, 31, 5]-=-. It has also been used extensively for global discriminative training of handwriting recognition systems that integrate neural nets and hidden Markov models under the names maximum mutual information... |

63 |
Stochastic learning
- Bottou
- 2004
(Show Context)
Citation Context ...the parameters on the basis of a single sample, whereas “batch” optimization methods waste considerable resources to compute exact descent directions, often nullifying the theoretical speed advantage =-=[4, 43, 44, 15, 16, 61]-=-. featurevectors E(W,Z,Y, X) wordin thelexicon wordtemplates Path (acousticvectors) X Z Y Figure 21: Figure showing the architecture of a speech recognition system using latent variables. An acoustic ... |

56 |
Neural Networks for Speech and Sequence Recognition
- Bengio
- 1996
(Show Context)
Citation Context ...m mutual information estimation for discriminatively training speech recognition systems since the late 80’s, including hidden Markov models with mixtures of Gaussians [3], and HMM-neural net hybrids =-=[6, 7, 31, 5]-=-. It has also been used extensively for global discriminative training of handwriting recognition systems that integrate neural nets and hidden Markov models under the names maximum mutual information... |

56 | Synergistic Face Detection and Pose Estimation with Energy-Based Models - Osadchy, LeCun, et al. - 2007 |

56 | Accelerated learning in layered neural networks - Solla, Levin, et al. - 1988 |

54 |
Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole
- Bottou
- 1991
(Show Context)
Citation Context ...a of late normalization solves several problems associated with the internal normalization of HMMs and Bayesian nets. The first problem is the socalled label bias problem, first pointed out by Bottou =-=[13]-=-: transitions leaving a given state compete with each other, but not with other transitions in the model. Hence, paths whose states have few outgoing transitions tend to have higher probability than p... |

52 | Trading convexity for scalability - Collobert, Sinz, et al. |

51 | Energy-based models for sparse overcomplete representations - Teh, Welling, et al. - 2003 |

46 | Continuous speech recognition: An introduction to the hybrid hmm/connectionist approach - Bourlard, Morgan - 1995 |

43 | Lerec: A nn/hmm hybrid for on-line handwriting recognition
- Bengio, LeCun, et al.
- 1995
(Show Context)
Citation Context ... It has also been used extensively for global discriminative training of handwriting recognition systems that integrate neural nets and hidden Markov models under the names maximum mutual information =-=[11, 41, 12, 42, 14]-=- and discriminative forward training [43]. Finally, it is the loss function of choice for training other probabilistic discriminative sequence labeling models such as input/output HMM [10], conditiona... |

41 | Dimensionality reduction by learning an invariant mapping - Hadsell, Chopra, et al. - 2006 |

41 | REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities. Application to Transition-Based Connectionist Speech Recognition”, Internal report of ICSI - Bourlard, Konig, et al. - 1995 |

33 | Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition - HAFFNER, FRANZINI, et al. - 1991 |

32 | Investigating loss functions and optimization methods for discriminative learning of label sequences
- Altun, Johnson, et al.
- 2003
(Show Context)
Citation Context ...training has emerged, largely motivated by sequence labeling problems in natural language processing, notably conditional random fields [40], perceptron-like models [21], support vector Markov models =-=[2]-=-, and maximum margin Markov networks [58]. These models can be easily described in an EBM setting. The energy function in these models is assumed to be a linear function of the parameters W : E(W, Y, ... |

32 | Globally trained handwritten word recognizer using spatial representation, space displacement neural networks and hidden Markov models
- Bengio, Cun, et al.
- 1994
(Show Context)
Citation Context ... It has also been used extensively for global discriminative training of handwriting recognition systems that integrate neural nets and hidden Markov models under the names maximum mutual information =-=[11, 41, 12, 42, 14]-=- and discriminative forward training [43]. Finally, it is the loss function of choice for training other probabilistic discriminative sequence labeling models such as input/output HMM [10], conditiona... |

31 | Signature verification using a ”siamese” time delay neural network - Bromley, Bentz, et al. - 1993 |

30 | Connectionist Viterbi training: a new hybrid method for continuous speech recognition - Franzini, Lee, et al. - 1990 |

30 | Loss functions for discriminative training of energy-based models - LeCun, Huang - 2005 |

27 | Discriminative Training for Speech Recognition - McDermott - 1997 |

23 | Large-scale learning with svm and convolutional nets for generic object categorization - Huang, LeCun - 2006 |

22 | Multi-state time delay neural networks for continuous speech recognition - Haffner, Waibel - 1991 |

21 | Estimation of Hidden Markov Model Parameters by Minimizing Empirical Error Rate - Ephraim, Rabiner - 1990 |

18 | A continuous speech recognition system embedding MLP into HMM - Bourlard, Morgan - 1990 |

15 | Toward Automatic Phenotyping of Developing Embryos from Videos - Ning, Delhomme, et al. - 2005 |

13 | Speaker-Independent Word Recognition using Dynamic Programming Neural Networks - Sakoe, Isotani, et al. - 1989 |

11 | Global training of document processing systems using graph transformer networks - Bottou, Bengio, et al. - 1997 |

10 | Large margin methods for label sequence learning
- Altun, Hofmann
- 2003
(Show Context)
Citation Context ...le of generalized margin loss is the hinge loss, which is used in combination with linearly parameterized energies and a quadratic regularizer in support vector machines, support vector Markov models =-=[1]-=-, and maximum-margin Markov networks [58]: Lhinge(W, Y i , X i ) = max � 0, m + E(W, Y i , X i ) − E(W, ¯ Y i , X i ) � , (11) where m is the positive margin. The shape of this loss function is given ... |

9 | Word-level training of a handwritten word recognizer based on convolutional neural networks - Cun, Bengio - 1994 |

9 | Reading checks with graph transformer networks - LeCun, Bottou, et al. - 1997 |