Results 1  10
of
54
Shallow Parsing with Conditional Random Fields
, 2003
"... Conditional random fields for sequence labeling offer advantages over both generative models like HMMs and classifiers applied at each sequence position. Among sequence labeling tasks in language processing, shallow parsing has received much attention, with the development of standard evaluati ..."
Abstract

Cited by 473 (8 self)
 Add to MetaCart
Conditional random fields for sequence labeling offer advantages over both generative models like HMMs and classifiers applied at each sequence position. Among sequence labeling tasks in language processing, shallow parsing has received much attention, with the development of standard evaluation datasets and extensive comparison among methods. We show here how to train a conditional random field to achieve performance as good as any reported base nounphrase chunking method on the CoNLL task, and better than any reported single model. Improved training methods based on modern optimization algorithms were critical in achieving these results. We present extensive comparisons between models and training methods that confirm and strengthen previous results on shallow parsing and training methods for maximumentropy models.
Discriminative probabilistic models for relational data
, 2002
"... In many supervised learning tasks, the entities to be labeled are related to each other in complex ways and their labels are not independent. For example, in hypertext classification, the labels of linked pages are highly correlated. A standard approach is to classify each entity independently, igno ..."
Abstract

Cited by 360 (13 self)
 Add to MetaCart
(Show Context)
In many supervised learning tasks, the entities to be labeled are related to each other in complex ways and their labels are not independent. For example, in hypertext classification, the labels of linked pages are highly correlated. A standard approach is to classify each entity independently, ignoring the correlations between them. Recently, Probabilistic Relational Models, a relational version of Bayesian networks, were used to define a joint probabilistic model for a collection of related entities. In this paper, we present an alternative framework that builds on (conditional) Markov networks and addresses two limitations of the previous approach. First, undirected models do not impose the acyclicity constraint that hinders representation of many important relational dependencies in directed models. Second, undirected models are well suited for discriminative training, where we optimize the conditional likelihood of the labels given the features, which generally improves classification accuracy. We show how to train these models effectively, and how to use approximate probabilistic inference over the learned model for collective classification of multiple related entities. We provide experimental results on a webpage classification task, showing that accuracy can be significantly improved by modeling relational dependencies. 1
A Comparison of Algorithms for Maximum Entropy Parameter Estimation
"... A comparison of algorithms for maximum entropy parameter estimation Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of class ..."
Abstract

Cited by 239 (2 self)
 Add to MetaCart
A comparison of algorithms for maximum entropy parameter estimation Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of classification problems in natural language processing. However, the flexibility of ME models is not without cost. While parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are very large, and may well contain many thousands of free parameters. In this paper, we consider a number of algorithms for estimating the parameters of ME models, including iterative scaling, gradient ascent, conjugate gradient, and variable metric methods. Surprisingly, the standardly used iterative scaling algorithms perform quite poorly in comparison to the others, and for all of the test problems, a limitedmemory variable metric algorithm outperformed the other choices.
Hidden Markov Support Vector Machines
, 2003
"... This paper presents a novel discriminative learning technique for label sequences based on a combination of the two most successful learning algorithms, Support Vector Machines and Hidden Markov Models which we call Hidden Markov Support Vector Machine. ..."
Abstract

Cited by 202 (8 self)
 Add to MetaCart
This paper presents a novel discriminative learning technique for label sequences based on a combination of the two most successful learning algorithms, Support Vector Machines and Hidden Markov Models which we call Hidden Markov Support Vector Machine.
Discriminative random fields: A discriminative framework for contextual interaction in classification
 In ICCV
, 2003
"... ..."
(Show Context)
Efficient structure learning of Markov networks using L1regularization
 In NIPS
, 2006
"... Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
(Show Context)
Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally effective method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the loglinear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: modeling the joint distribution of pixel values in the MNIST data, and modeling the joint distribution of genetic sequence variations in the human HapMap data. We show that our L1based method achieves considerably higher generalization performance than the more standard L2based method (a Gaussian parameter prior) or pure maximumlikelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem. 1
Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers
 In Proceedings of the Eighteenth Annual National Conference on Artificial Intelligence (AAAI02
, 2002
"... Abstract. Bayesian belief nets (BNs) are often used for classification tasks — typically to return the most likely class label for each specified instance. Many BNlearners, however, attempt to find the BN that maximizes a different objective function — viz., likelihood, rather than classification a ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
Abstract. Bayesian belief nets (BNs) are often used for classification tasks — typically to return the most likely class label for each specified instance. Many BNlearners, however, attempt to find the BN that maximizes a different objective function — viz., likelihood, rather than classification accuracy — typically by first learning an appropriate graphical structure, then finding the parameters for that structure that maximize the likelihood of the data. As these parameters may not maximize the classification accuracy, “discriminative parameter learners ” follow the alternative approach of seeking the parameters that maximize conditional likelihood (CL), over the distribution of instances the BN will have to classify. This paper first formally specifies this task, shows how it extends standard logistic regression, and analyzes its inherent sample and computational complexity. We then present a general algorithm for this task, ELR, that applies to arbitrary BN structures and that works effectively even when given incomplete training data. Unfortunately, ELR is not guaranteed to find the parameters that optimize conditional likelihood; moreover, even the optimalCL parameters need not have minimal classification error. This paper therefore presents empirical evidence that ELR produces effective classifiers, often superior to the ones produced by the standard “generative” algorithms, especially in common situations where the given BNstructure is incorrect. Keywords: (Bayesian) belief nets, Logistic regression, Classification, PAClearning, Computational/sample complexity
WordSense Disambiguation for Machine Translation
 In EMNLP
, 2005
"... In word sense disambiguation, a system attempts to determine the sense of a word from contextual features. Major barriers to building a highperforming word sense disambiguation system include the difficulty of labeling data for this task and of predicting finegrained sense distinctions. These issu ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
(Show Context)
In word sense disambiguation, a system attempts to determine the sense of a word from contextual features. Major barriers to building a highperforming word sense disambiguation system include the difficulty of labeling data for this task and of predicting finegrained sense distinctions. These issues stem partly from the fact that the task is being treated in isolation from possible uses of automatically disambiguated data. In this paper, we consider the related task of word translation, where we wish to determine the correct translation of a word from context. We can use parallel language corpora as a large supply of partially labeled data for this task. We present algorithms for solving the word translation problem and demonstrate a significant improvement over a baseline system. We then show that the wordtranslation system can be used to improve performance on a simplified machinetranslation task and can effectively and accurately prune the set of candidate translations for a word. 1
A maximum entropy approach to species distribution modeling
 In Proceedings of the TwentyFirst International Conference on Machine Learning
, 2004
"... We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximumentropy techniques for this problem, specifically, sequentialupdate algorithms that can handle a very large number of features. We describe experiments compar ..."
Abstract

Cited by 52 (7 self)
 Add to MetaCart
(Show Context)
We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximumentropy techniques for this problem, specifically, sequentialupdate algorithms that can handle a very large number of features. We describe experiments comparing maxent with a standard distributionmodeling tool, called GARP, on a dataset containing observation data for North American breeding birds. We also study how well maxent performs as a function of the number of training examples and training time, analyze the use of regularization to avoid overfitting when the number of examples is small, and explore the interpretability of models constructed using maxent. 1.
Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity
, 2004
"... Binary classification is a core data mining task. For large datasets or realtime applications, desirable classifiers are accurate, fast, and need no parameter tuning. We present a simple implementation of logistic regression that meets these requirements. A combination of regularization, truncated ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
(Show Context)
Binary classification is a core data mining task. For large datasets or realtime applications, desirable classifiers are accurate, fast, and need no parameter tuning. We present a simple implementation of logistic regression that meets these requirements. A combination of regularization, truncated Newton methods, and iteratively reweighted least squares make it faster and more accurate than modern SVM implementations, and relatively insensitive to parameters. It is robust to linear dependencies and some scaling problems, making most data preprocessing unnecessary. 1