Results 1  10
of
59
Conditional random fields: Probabilistic models for segmenting and labeling sequence data
, 2001
"... We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions ..."
Abstract

Cited by 3396 (88 self)
 Add to MetaCart
(Show Context)
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and naturallanguage data. 1.
Gradientbased learning applied to document recognition
 Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract

Cited by 1468 (84 self)
 Add to MetaCart
Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify highdimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2D) shapes, are shown to outperform all other techniques. Reallife document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradientbased methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Markovian Models for Sequential Data
, 1996
"... Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many machine learning applications, especially for speech recognition. Furthermore, in the last few years, many new and promising probabilistic models related to HMMs have been proposed. We firs ..."
Abstract

Cited by 117 (2 self)
 Add to MetaCart
(Show Context)
Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many machine learning applications, especially for speech recognition. Furthermore, in the last few years, many new and promising probabilistic models related to HMMs have been proposed. We first summarize the basics of HMMs, and then review several recent related learning algorithms and extensions of HMMs, including in particular hybrids of HMMs with artificial neural networks, InputOutput HMMs (which are conditional HMMs using neural networks to compute probabilities), weighted transducers, variablelength Markov models and Markov switching statespace models. Finally, we discuss some of the challenges of future research in this very active area. 1 Introduction Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many applications in artificial intelligence, pattern recognition, speech recognition, and modeling of biological ...
Convergence Properties of the KMeans Algorithms
"... This paper studies the convergence properties of the well known KMeans clustering algorithm. The KMeans algorithm can be described either as a gradient descent algorithm or by slightly extending the mathematics of the EM algorithm to this hard threshold case. We show that the KMeans algorithm act ..."
Abstract

Cited by 107 (2 self)
 Add to MetaCart
This paper studies the convergence properties of the well known KMeans clustering algorithm. The KMeans algorithm can be described either as a gradient descent algorithm or by slightly extending the mathematics of the EM algorithm to this hard threshold case. We show that the KMeans algorithm actually minimizes the quantization error using the very fast Newton algorithm.
Trading convexity for scalability
 ICML06, 23rd International Conference on Machine Learning
, 2006
"... Convex learning algorithms, such as Support Vector Machines (SVMs), are often seen as highly desirable because they offer strong practical properties and are amenable to theoretical analysis. However, in this work we show how nonconvexity can provide scalability advantages over convexity. We show h ..."
Abstract

Cited by 88 (3 self)
 Add to MetaCart
(Show Context)
Convex learning algorithms, such as Support Vector Machines (SVMs), are often seen as highly desirable because they offer strong practical properties and are amenable to theoretical analysis. However, in this work we show how nonconvexity can provide scalability advantages over convexity. We show how concaveconvex programming can be applied to produce (i) faster SVMs where training errors are no longer support vectors, and (ii) much faster Transductive SVMs. 1.
Guided Learning for Bidirectional Sequence Classification
, 2007
"... In this paper, we propose guided learning, a new learning framework for bidirectional sequence classification. The tasks of learning the order of inference and training the local classifier are dynamically incorporated into a single Perceptron like learning algorithm. We apply this novel learning al ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we propose guided learning, a new learning framework for bidirectional sequence classification. The tasks of learning the order of inference and training the local classifier are dynamically incorporated into a single Perceptron like learning algorithm. We apply this novel learning algorithm to POS tagging. It obtains an error rate of 2.67 % on the standard PTB test set, which represents 3.3 % relative error reduction over the previous best result on the same data set, while using fewer features. 1
Conditional Structure versus Conditional Estimation in NLP Models
 In EMNLP 2002
, 2002
"... This paper separates conditional parameter estimation, which consistently raises test set accuracy on statistical NLP tasks, from conditional model structures, such as the conditional Markov model used for maximumentropy tagging, which tend to lower accuracy. Error analysis on partofspeech t ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
(Show Context)
This paper separates conditional parameter estimation, which consistently raises test set accuracy on statistical NLP tasks, from conditional model structures, such as the conditional Markov model used for maximumentropy tagging, which tend to lower accuracy. Error analysis on partofspeech tagging shows that the actual tagging errors made by the conditionally structured model derive not only from label bias, but also from other ways in which the independence assumptions of the conditional model structure are unsuited to linguistic sequences. The paper presents new wordsense disambiguation and POS tagging experiments, and integrates apparently conflicting reports from other recent work.
A tutorial on energybased learning
 Predicting Structured Data
, 2006
"... EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in ..."
Abstract

Cited by 55 (6 self)
 Add to MetaCart
(Show Context)
EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graphtransformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of nonprobabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches. 1
Stochastic gradient learning in neural networks
 In Proceedings of NeuroNîmes. EC2
, 1991
"... Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z,w)) = J(z,w)dP(z) where dP is an unknown probability distribution that characterizes the problem to learn, and J, the loss function, defines the learning system itself. This popular statistical formulatio ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z,w)) = J(z,w)dP(z) where dP is an unknown probability distribution that characterizes the problem to learn, and J, the loss function, defines the learning system itself. This popular statistical formulation has led to many theoretical results. The minimization of such a cost may be achieved with a stochastic gradient descent algorithm, e.g.: wt+1 = wt − ɛt∇wJ(z,wt) With some restrictions on J and C, this algorithm converges, even if J is non differentiable on a set of measure 0. Links with simulated annealing are depicted. Résumé De nombreux algorithmes connexionnistes consistent à minimiser un coût de la forme C(w) = E(J(z,w)) = J(z,w)dP(z) où dP est une distribution de probabilité inconnue qui caractérise le problème, et J, le critère local, décrit le système d’apprentissage lui même. Cette formulation statistique bien connue a donné lieu à de nombreux résultats théoriques. La minimisation d’un tel coût peut être accomplie au moyen d’un algorithme de descente stochastique de gradient, par exemple: wt+1 = wt − ɛt∇wJ(z,wt) Au prix de quelques restrictions sur C et J, cet algorithme converge, même si J n’est pas dérivable sur un ensemble de mesure nulle. Des liens avec les méthodes de recuit simulé sont également soulignés.
An Alternate Objective Function for Markovian Fields
, 2002
"... In labelling or prediction tasks, a trained model's test performance is often based on the quality of its singletime marginal distributions over labels rather than its joint distribution over label sequences. We propose using a new cost function for discriminative learning that more accurately ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
In labelling or prediction tasks, a trained model's test performance is often based on the quality of its singletime marginal distributions over labels rather than its joint distribution over label sequences. We propose using a new cost function for discriminative learning that more accurately reflects such test time conditions. We present an efficient method to compute the gradient of this cost for Maximum Entropy Markov Models, Conditional Random Fields, and for an extension of these models involving hidden states. Our experimental results show that the new cost can give significant improvements and that it provides a novel and effective way of dealing with the 'labelbias' problem.