Results 1 - 10
of
46
Incorporating non-local information into information extraction systems by gibbs sampling
- In ACL
, 2005
"... Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sampling, ..."
Abstract
-
Cited by 192 (15 self)
- Add to MetaCart
Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sampling, a simple Monte Carlo method used to perform approximate inference in factored probabilistic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consistency constraints. This technique results in an error reduction of up to 9 % over state-of-the-art systems on two established information extraction tasks. 1
Piecewise training of undirected models
- In Proc. of UAI
, 2005
"... For many large undirected models that arise in real-world applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
For many large undirected models that arise in real-world applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, but also on the observed input, requiring repeated inference over each training example. An appealing idea for such models is to independently train a local undirected classifier over each clique, afterwards combining the learned weights into a single global model. In this paper, we show that this piecewise method can be justified as minimizing a new family of upper bounds on the log partition function. On three natural-language data sets, piecewise training is more accurate than pseudolikelihood, and often performs comparably to global training using belief propagation. 1
Dependency parsing by belief propagation
- In Proceedings of EMNLP
, 2008
"... We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. E ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with second-order features or latent variables, which would make exact parsing considerably slower or NP-hard, BP needs only O(n3) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features would increase the runtime additively rather than multiplicatively. 1
First-order probabilistic models for coreference resolution
- In Proceedings of HLT-NAACL 2007
, 2007
"... Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a first-order probabilistic model for coreference. We outline a set of approximat ..."
Abstract
-
Cited by 40 (14 self)
- Add to MetaCart
Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a first-order probabilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our method to the ACE coreference dataset, achieving a 45 % error reduction over a comparable method that only considers features of pairs of noun phrases. This result demonstrates an example of how a firstorder logic representation can be incorporated into a probabilistic model and scaled efficiently. 1
Structured learning with approximate inference
- Advances in Neural Information Processing Systems
"... In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1
Extracting personal names from emails: Applying named entity recognition to informal text
- In HLT-EMNLP
, 2005
"... There has been little prior work on Named Entity Recognition for ”informal ” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple docu ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
There has been little prior work on Named Entity Recognition for ”informal ” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple documents. 1
An effective two-stage model for exploiting non-local dependencies in named entity recognition
- In ACL-COLING’06: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
, 2006
"... This paper shows that a simple two-stage approach to handle non-local dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle non-local dependencies, while being much more computationally efficient. NER systems typically use sequence models for tractable inferen ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
This paper shows that a simple two-stage approach to handle non-local dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle non-local dependencies, while being much more computationally efficient. NER systems typically use sequence models for tractable inference, but this makes them unable to capture the long distance structure present in text. We use a Conditional Random Field (CRF) based NER system using local features to make predictions and then train another CRF which uses both local information and features extracted from the output of the first CRF. Using features capturing non-local dependencies from the same document, our approach yields a 12.6 % relative error reduction on the F1 score, over state-of-theart NER systems using local-information alone, when compared to the 9.3 % relative error reduction offered by the best systems that exploit non-local information. Our approach also makes it easy to incorporate non-local information from other documents in the test corpus, and this gives us a 13.3 % error reduction over NER systems using local-information alone. Additionally, our running time for inference is just the inference time of two sequential CRFs, which is much less than that of other more complicated approaches that directly model the dependencies and do approximate inference. 1
A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance
- Association for Computational Linguistics
, 2006
"... We describe a probabilistic approach to content selection for meeting summarization. We use skipchain Conditional Random Fields (CRF) to model non-local pragmatic dependencies between paired utterances such as QUESTION-ANSWER that typically appear together in summaries, and show that these models ou ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
We describe a probabilistic approach to content selection for meeting summarization. We use skipchain Conditional Random Fields (CRF) to model non-local pragmatic dependencies between paired utterances such as QUESTION-ANSWER that typically appear together in summaries, and show that these models outperform linear-chain CRFs and Bayesian models in the task. We also discuss different approaches for ranking all utterances in a sequence using CRFs. Our best performing system achieves 91.3 % of human performance when evaluated with the Pyramid evaluation metric, which represents a 3.9 % absolute increase compared to our most competitive non-sequential classifier. 1
Efficient inference with cardinality-based clique potentials
- In Proc. 24th ICML
, 2007
"... Many collective labeling tasks require inference on graphical models where the clique potentials depend only on the number of nodes that get a particular label. We design efficient inference algorithms for various families of such potentials. Our algorithms are exact for arbitrary cardinality-based ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Many collective labeling tasks require inference on graphical models where the clique potentials depend only on the number of nodes that get a particular label. We design efficient inference algorithms for various families of such potentials. Our algorithms are exact for arbitrary cardinality-based clique potentials on binary labels and for max-like and majority-like clique potentials on multiple labels. Moving towards more complex potentials, we show that inference becomes NP-hard even on cliques with homogeneous Potts potentials. We present a 13 15-approximation algorithm with runtime sub-quadratic in the clique size. In contrast, the best known previous guarantee for graphs with Potts potentials is only 0.5. We perform empirical comparisons on real and synthetic data, and show that our proposed methods are an order of magnitude faster than the well-known Tree-based reparameterization (TRW) and graph-cut algorithms. 1.
Factorie: Probabilistic programming via imperatively defined factor graphs
- In Advances in Neural Information Processing Systems 22
, 2009
"... Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to defin ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to define these structures in a powerful and flexible way. Rather than using a declarative language, such as SQL or first-order logic, we advocate using an imperative language to express various aspects of model structure, inference, and learning. By combining the traditional, declarative, statistical semantics of factor graphs with imperative definitions of their construction and operation, we allow the user to mix declarative and procedural domain knowledge, and also gain significant efficiencies. We have implemented such imperatively defined factor graphs in a system we call FACTORIE, a software library for an object-oriented, strongly-typed, functional language. In experimental comparisons to Markov Logic Networks on joint segmentation and coreference, we find our approach to be 3-15 times faster while reducing error by 20-25%—achieving a new state of the art. 1

