Results 1  10
of
87
Incorporating nonlocal information into information extraction systems by gibbs sampling
 In ACL
, 2005
"... Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sampling, ..."
Abstract

Cited by 669 (25 self)
 Add to MetaCart
Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sampling, a simple Monte Carlo method used to perform approximate inference in factored probabilistic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate nonlocal structure while preserving tractable inference. We use this technique to augment an existing CRFbased information extraction system with longdistance dependency models, enforcing label consistency and extraction template consistency constraints. This technique results in an error reduction of up to 9 % over stateoftheart systems on two established information extraction tasks. 1
Piecewise training of undirected models
 In Proc. of UAI
, 2005
"... For many large undirected models that arise in realworld applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, ..."
Abstract

Cited by 100 (5 self)
 Add to MetaCart
(Show Context)
For many large undirected models that arise in realworld applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, but also on the observed input, requiring repeated inference over each training example. An appealing idea for such models is to independently train a local undirected classifier over each clique, afterwards combining the learned weights into a single global model. In this paper, we show that this piecewise method can be justified as minimizing a new family of upper bounds on the log partition function. On three naturallanguage data sets, piecewise training is more accurate than pseudolikelihood, and often performs comparably to global training using belief propagation. 1
Factorie: Probabilistic programming via imperatively defined factor graphs
 In Advances in Neural Information Processing Systems 22
, 2009
"... Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to defin ..."
Abstract

Cited by 82 (16 self)
 Add to MetaCart
(Show Context)
Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to define these structures in a powerful and flexible way. Rather than using a declarative language, such as SQL or firstorder logic, we advocate using an imperative language to express various aspects of model structure, inference, and learning. By combining the traditional, declarative, statistical semantics of factor graphs with imperative definitions of their construction and operation, we allow the user to mix declarative and procedural domain knowledge, and also gain significant efficiencies. We have implemented such imperatively defined factor graphs in a system we call FACTORIE, a software library for an objectoriented, stronglytyped, functional language. In experimental comparisons to Markov Logic Networks on joint segmentation and coreference, we find our approach to be 315 times faster while reducing error by 2025%—achieving a new state of the art. 1
Dependency parsing by belief propagation
 In Proceedings of EMNLP
, 2008
"... We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. E ..."
Abstract

Cited by 80 (9 self)
 Add to MetaCart
(Show Context)
We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with secondorder features or latent variables, which would make exact parsing considerably slower or NPhard, BP needs only O(n3) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact firstorder methods. Incorporating additional features would increase the runtime additively rather than multiplicatively. 1
Firstorder probabilistic models for coreference resolution
 In HLT/NAACL
, 2007
"... Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a firstorder probabilistic model for coreference. We outline a set of approximat ..."
Abstract

Cited by 80 (19 self)
 Add to MetaCart
(Show Context)
Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a firstorder probabilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our method to the ACE coreference dataset, achieving a 45 % error reduction over a comparable method that only considers features of pairs of noun phrases. This result demonstrates an example of how a firstorder logic representation can be incorporated into a probabilistic model and scaled efficiently. 1
Structured learning with approximate inference
 Advances in Neural Information Processing Systems
"... In many structured prediction problems, the highestscoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
(Show Context)
In many structured prediction problems, the highestscoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LPrelaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1
A SkipChain Conditional Random Field for Ranking Meeting Utterances by Importance
 Association for Computational Linguistics
, 2006
"... We describe a probabilistic approach to content selection for meeting summarization. We use skipchain Conditional Random Fields (CRF) to model nonlocal pragmatic dependencies between paired utterances such as QUESTIONANSWER that typically appear together in summaries, and show that these models ou ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
(Show Context)
We describe a probabilistic approach to content selection for meeting summarization. We use skipchain Conditional Random Fields (CRF) to model nonlocal pragmatic dependencies between paired utterances such as QUESTIONANSWER that typically appear together in summaries, and show that these models outperform linearchain CRFs and Bayesian models in the task. We also discuss different approaches for ranking all utterances in a sequence using CRFs. Our best performing system achieves 91.3 % of human performance when evaluated with the Pyramid evaluation metric, which represents a 3.9 % absolute increase compared to our most competitive nonsequential classifier. 1
An effective twostage model for exploiting nonlocal dependencies in named entity recognition
 In ACLCOLING’06: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
, 2006
"... This paper shows that a simple twostage approach to handle nonlocal dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle nonlocal dependencies, while being much more computationally efficient. NER systems typically use sequence models for tractable inferen ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
(Show Context)
This paper shows that a simple twostage approach to handle nonlocal dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle nonlocal dependencies, while being much more computationally efficient. NER systems typically use sequence models for tractable inference, but this makes them unable to capture the long distance structure present in text. We use a Conditional Random Field (CRF) based NER system using local features to make predictions and then train another CRF which uses both local information and features extracted from the output of the first CRF. Using features capturing nonlocal dependencies from the same document, our approach yields a 12.6 % relative error reduction on the F1 score, over stateoftheart NER systems using localinformation alone, when compared to the 9.3 % relative error reduction offered by the best systems that exploit nonlocal information. Our approach also makes it easy to incorporate nonlocal information from other documents in the test corpus, and this gives us a 13.3 % error reduction over NER systems using localinformation alone. Additionally, our running time for inference is just the inference time of two sequential CRFs, which is much less than that of other more complicated approaches that directly model the dependencies and do approximate inference. 1
Extracting personal names from emails: Applying named entity recognition to informal text
 In HLTEMNLP
, 2005
"... There has been little prior work on Named Entity Recognition for ”informal ” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple docu ..."
Abstract

Cited by 44 (9 self)
 Add to MetaCart
(Show Context)
There has been little prior work on Named Entity Recognition for ”informal ” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple documents. 1
Piecewise pseudolikelihood for efficient CRF training
 In International Conference on Machine Learning (ICML
, 2007
"... Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accur ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accuracy can be poor. Piecewise training (Sutton & McCallum, 2005) can have better accuracy but does not scale as well in the variable cardinality. In this paper, we introduce piecewise pseudolikelihood, which retains the computational efficiency of pseudolikelihood but can have much better accuracy. On several benchmark NLP data sets, piecewise pseudolikelihood has better accuracy than standard pseudolikelihood, and in many cases nearly equivalent to maximum likelihood, with five to ten times less training time than batch CRF training. 1.