## Piecewise training of undirected models (2005)

### Cached

### Download Links

Venue: | In Proc. of UAI |

Citations: | 84 - 5 self |

### BibTeX

@INPROCEEDINGS{Sutton05piecewisetraining,

author = {Charles Sutton and Andrew Mccallum},

title = {Piecewise training of undirected models},

booktitle = {In Proc. of UAI},

year = {2005}

}

### Years of Citing Articles

### OpenURL

### Abstract

For many large undirected models that arise in real-world applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, but also on the observed input, requiring repeated inference over each training example. An appealing idea for such models is to independently train a local undirected classifier over each clique, afterwards combining the learned weights into a single global model. In this paper, we show that this piecewise method can be justified as minimizing a new family of upper bounds on the log partition function. On three natural-language data sets, piecewise training is more accurate than pseudolikelihood, and often performs comparably to global training using belief propagation. 1

### Citations

2506 | Conditional random fields: probabilistic modeling for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...lihood p(y|x) instead of the generative likelihood p(y, x). This allows inclusion of rich, overlapping features of x without needing to model their distribution, which can greatly improve performance =-=[8]-=-. Conditional training can be expensive, however, because the partition function Z(x) depends not only on the model parameters but also on the input data. This means that parameter estimation requires... |

2235 | Building a Large Annotated Corpus for English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...e has a two-level grid structure, shown in Figure 1. Our data comes from the CoNLL 2000 shared task [11], and consists of sentences from the Wall Street Journal annotated by the Penn Treebank project =-=[9]-=-. We consider each sentence to be a training instance, with single words as tokens. We report results here on subsets of 223 training sentences, and the standard test set of 2012 sentences. Results ar... |

515 | Factorial hidden markov models
- Ghahramani, Jordan
- 1997
(Show Context)
Citation Context ...5.2 FACTORIAL CRF The first loopy model we consider is the factorial CRF introduced by Sutton, Rohanimanesh, and McCallum [13]. Factorial CRFs are the conditionally-trained analogue of factorial HMMs =-=[6]-=-; it consists of a series of undirected linear chains with connections between cotemporal labels. This is a natural model for jointly performing multiple dependent sequence labeling tasks. We consider... |

476 | Learning Low-Level Vision
- Freeman, Pasztor, et al.
- 2000
(Show Context)
Citation Context ...such an intuitively appealing method, it has been used in several scattered places in the literature, for tasks such as information extraction [16], collective classification [7], and computer vision =-=[4]-=-. In these papers, the piecewise method is reported as a successful heuristic for training large models, but its performance is not compared against other training methods. We are unaware of previous ... |

360 | Discriminative probabilistic models for relational data, in
- Taskar, Abbeel, et al.
- 2002
(Show Context)
Citation Context ... performs comparably to global training using belief propagation. 1 INTRODUCTION Large graphical models are becoming increasingly common in applications including computer vision, relational learning =-=[14]-=-, and natural language processing [16, 3]. Often the cheapest way to build such models is to estimate their parameters from labeled training data. But exact maximum-likelihood estimation requires repe... |

287 |
Statistical analysis of non-lattice data
- Besag
- 1975
(Show Context)
Citation Context ...d by Wainwright, Jaakkola, and Willsky [15], a connection which motivates several generalizations of the basic piecewise procedure. The piecewise estimator is also closely related to pseudolikelihood =-=[1, 2]-=-. Both estimators are based on locally normalizing small pieces of the full model. But pseudolikelihood conditions on the true value of neighboring nodes, which has the effect of coupling parameters i... |

191 | Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons
- McCallum, Li
- 2003
(Show Context)
Citation Context ...lopment set of 3,466 sentences. Evaluation is done using precision and recall on the extracted chunks, and we report F1 = 2P R/P + R. We use a linear-chain CRF, whose features are described elsewhere =-=[10]-=-.t1 t t+1 t t+1 t1 t t+1 t1 y t-1 y t y t+1 ... y t+100 y t+101 y t+101 ... x t-1 Speaker: xt xt+1 ... John Smith x t+100 Professor x t+101 Smith x t+101 will ... Figure 2: Graphical model for skip-c... |

163 | A new class of upper bounds on the log partition function
- Wainwright, Jaakkola, et al.
- 2002
(Show Context)
Citation Context ...an upper bound on the exact log partition function. Although this bound can be proved directly, it can also be derived from the variational upper bounds presented by Wainwright, Jaakkola, and Willsky =-=[15]-=-, a connection which motivates several generalizations of the basic piecewise procedure. The piecewise estimator is also closely related to pseudolikelihood [1, 2]. Both estimators are based on locall... |

129 | Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data
- Sutton, McCallum, et al.
(Show Context)
Citation Context ... node, we use each edge, hoping to take more of the sequential interactions into account. We evaluate piecewise training on three models used in previous work: a linear-chain CRF [8], a factorial CRF =-=[13]-=-, and a skip-chain CRF [12]. All of these models use input features such as word identity, part-of-speech tags, capitalization, and membership in domain-specific lexicons; these are described fully in... |

110 |
Efficiency of pseudo-likelihood estimation for simple Gaussian fields”, Biometrika 64
- Besag
- 1977
(Show Context)
Citation Context ...d by Wainwright, Jaakkola, and Willsky [15], a connection which motivates several generalizations of the basic piecewise procedure. The piecewise estimator is also closely related to pseudolikelihood =-=[1, 2]-=-. Both estimators are based on locally normalizing small pieces of the full model. But pseudolikelihood conditions on the true value of neighboring nodes, which has the effect of coupling parameters i... |

104 | Machine learning for information extraction in informal domains. Machine Learning
- Freitag
- 1999
(Show Context)
Citation Context ...mail messages announcing seminars at Carnegie Mellon University. The messages are annotated with the seminar’s starting time, ending time, location, and speaker. This data set is due to Dayne Freitag =-=[5]-=-, and has been used in much previous work. Often the speaker is listed multiple times in the same message. For example, the speaker’s name might be included both near the beginning and later on, in a ... |

80 | An integrated, conditional model of information extraction and coreference with application to citation matching
- Wellner, McCallum, et al.
- 2004
(Show Context)
Citation Context ...g using belief propagation. 1 INTRODUCTION Large graphical models are becoming increasingly common in applications including computer vision, relational learning [14], and natural language processing =-=[16, 3]-=-. Often the cheapest way to build such models is to estimate their parameters from labeled training data. But exact maximum-likelihood estimation requires repeatedly computing marginals of the model d... |

79 | Collective segmentation and labeling of distant entities in information extraction
- Sutton, McCallum
- 2004
(Show Context)
Citation Context ...ping to take more of the sequential interactions into account. We evaluate piecewise training on three models used in previous work: a linear-chain CRF [8], a factorial CRF [13], and a skip-chain CRF =-=[12]-=-. All of these models use input features such as word identity, part-of-speech tags, capitalization, and membership in domain-specific lexicons; these are described fully in the original papers. In al... |

77 | Collective Information Extraction with Relational Markov Networks. ACL’2004
- Bunescu, Mooney
- 2004
(Show Context)
Citation Context ...g using belief propagation. 1 INTRODUCTION Large graphical models are becoming increasingly common in applications including computer vision, relational learning [14], and natural language processing =-=[16, 3]-=-. Often the cheapest way to build such models is to estimate their parameters from labeled training data. But exact maximum-likelihood estimation requires repeatedly computing marginals of the model d... |

27 |
Tjong Kim Sang and Sabine Buchholz, Introduction to the CoNLL-2000 Shared Task: Chunking
- Erik
- 2000
(Show Context)
Citation Context ...tly predicting part-ofspeech tags and segmenting noun phrases in newswire text. Thus, the FCRF we use has a two-level grid structure, shown in Figure 1. Our data comes from the CoNLL 2000 shared task =-=[11]-=-, and consists of sentences from the Wall Street Journal annotated by the Penn Treebank project [9]. We consider each sentence to be a training instance, with single words as tokens. We report results... |

6 | Learning coordination classifiers
- Greiner, Guo, et al.
- 2005
(Show Context)
Citation Context ...e piecewise estimator is such an intuitively appealing method, it has been used in several scattered places in the literature, for tasks such as information extraction [16], collective classification =-=[7]-=-, and computer vision [4]. In these papers, the piecewise method is reported as a successful heuristic for training large models, but its performance is not compared against other training methods. We... |

1 | http://lcg-www.uia.ac.be/ ∼ erikt/ research/np-chunking.html - See |

1 | http://lcg-www.uia.ac.be/˜erikt/ research/np-chunking.html - See |