• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

DMCA

Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)

Cached

  • Download as a PDF

Download Links

  • [www.seas.upenn.edu]
  • [l2r.cs.uiuc.edu]
  • [www.cis.upenn.edu]
  • [l2r.cs.uiuc.edu]
  • [www.wisdom.weizmann.ac.il]
  • [www2.denizyuret.com]
  • [damas.ift.ulaval.ca]
  • [www.damas.ift.ulaval.ca]
  • [www.seas.upenn.edu]
  • [www.cs.utah.edu]
  • [www.wisdom.weizmann.ac.il]
  • [l2r.cs.uiuc.edu]
  • [nlp.cs.nyu.edu]
  • [www.aladdin.cs.cmu.edu]
  • [www.cs.cmu.edu]
  • [www.cs.toronto.edu]
  • [www.cis.upenn.edu]
  • [www.cs.cmu.edu]
  • [www.cs.cmu.edu]
  • [www.cs.columbia.edu]
  • [www.cs.columbia.edu]
  • [www.menem.com]
  • [www.menem.com]
  • [repository.upenn.edu]
  • [www.facweb.iitkgp.ernet.in]
  • [www.facweb.iitkgp.ernet.in]

  • Other Repositories/Bibliography

  • CiteULike
  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by John Lafferty
Citations:3482 - 85 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

Citations

3495 A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55(1 - Freund, Schapire - 1997 (Show Context)

Citation Context

...ications and deserve further study. In this section we briefly mention just two. Conditional random fields can be trained using the exponential loss objective function used by the AdaBoost algorithm (=-=Freund & Schapire, 1997-=-). Typically, boosting is applied to classification problems with a small, fixed number of classes; applications of boosting to sequence labeling have treated each label as a separate classification p...

1529 Gradient-based learning applied to document recognition - LeCun, Bottou, et al. - 1998 (Show Context)

Citation Context

...tates can be traded off against each other. We can also think of a CRF as a finite state model with unnormalized transition probabilities. However, unlike some other weighted finite-state approaches (=-=LeCun et al., 1998-=-), CRFs assign a well-defined probability distribution over possible labelings, trained by maximum likelihood or MAP estimation. Furthermore, the loss function is convex,2 guaranteeing convergence to ...

1365 A maximum entropy approach to natural language processing - Berger, Pietra, et al. - 1996 (Show Context)

Citation Context

...cations of exponential models in sequence modeling have either attempted to build generative models (Rosenfeld, 1997), which involve a hard normalization problem, or adopted local conditional models (=-=Berger et al., 1996-=-; Ratnaparkhi, 1996; McCallum et al., 2000) that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging (Brill, 1995; Roth, 1998; Ab...

1218 Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, - Durbin - 1998 (Show Context)

Citation Context

...ational biology, HMMs and stochastic grammars have been successfully used to align biological sequences, find sequences homologous to a known evolutionary family, and analyze RNA secondary structure (=-=Durbin et al., 1998-=-). In computational linguistics and computer science, HMMs and stochastic grammars have been applied to a wide variety of problems in text and speech processing, including topic segmentation, part-ofs...

1138 Foundation of Statistical Natural Language Processing - Manning, Schütze - 1999
924 Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging - Brill - 1995 (Show Context)

Citation Context

...models (Berger et al., 1996; Ratnaparkhi, 1996; McCallum et al., 2000) that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging (=-=Brill, 1995-=-; Roth, 1998; Abney et al., 1999). Because of the computational complexity of global training, these models are only trained to minimize the error of individual label decisions assuming that neighbori...

670 Inducing features of random fields - Pietra, Pietra, et al. - 1997
580 A maximum entropy model for part-of-speech tagging,” - Ratnaparkhi - 1996 (Show Context)

Citation Context

...entical state structure on a part-of-speech tagging task. 2. The Label Bias Problem Classical probabilistic automata (Paz, 1971), discriminative Markov models (Bottou, 1991), maximum entropy taggers (=-=Ratnaparkhi, 1996-=-), and MEMMs, as well as non-probabilistic sequence tagging and segmentation models with independently trained next-state classifiers (Punyakanok & Roth, 2001) are all potential victims of the label b...

561 Maximum entropy Markov models for information extraction and segmentation - McCallum, Freitag, et al. - 2000 (Show Context)

Citation Context

...stance conditional independence given the labels, to achieve tractability. Maximum entropy Markov models (MEMMs) are conditional probabilistic sequence models that attain all of the above advantages (=-=McCallum et al., 2000-=-). In MEMMs, each source state1 has a exponential model that takes the observation features as input, and outputs a distribution over possible next states. These exponential models are trained by an a...

511 Generalized iterative scaling for log-linear models. - Darroch, Ratcliff - 1972
392 Finite-state transducers in language and speech processing, - Mohri - 1997 (Show Context)

Citation Context

... structure of the model. In the above example we could collapse states 1 and 4, and delay the branching until we get a discriminating observation. This operation is a special case of determinization (=-=Mohri, 1997-=-), but determinization of weighted finite-state machines is not always possible, and even when possible, it may lead to combinatorial explosion. The other solution mentioned is to start with a fullyco...

333 Discriminative reranking for natural language parsing - Collins, Koo - 2004 (Show Context)

Citation Context

... use a more global discriminative model to rerank those candidates. This approach is standard in large-vocabulary speech recognition (Schwartz & Austin, 1993), and has also been proposed for parsing (=-=Collins, 2000-=-). However, these methods fail when the correct output is pruned away in the first pass. Closest to our proposal are gradient-descent methods that adjust the parameters of all of the local classifiers...

222 Introduction to Probabilistic Automata. - Paz - 1971 (Show Context)

Citation Context

...laimed advantages of conditional models by evaluating HMMs, MEMMs and CRFs with identical state structure on a part-of-speech tagging task. 2. The Label Bias Problem Classical probabilistic automata (=-=Paz, 1971-=-), discriminative Markov models (Bottou, 1991), maximum entropy taggers (Ratnaparkhi, 1996), and MEMMs, as well as non-probabilistic sequence tagging and segmentation models with independently trained...

173 Markov Fields on Finite Graphs and Lattices. Unpublished Manuscript. - Hammersley, Clifford - 1968 (Show Context)

Citation Context

...= (Y1,Y2, . . . ,Yn). If the graph G = (V,E) of Y is a tree (of which a chain is the simplest example), its cliques are the edges and vertices. Therefore, by the fundamental theorem of random fields (=-=Hammersley & Clifford, 1971-=-), the joint distribution over the label sequence Y given X has the form pθ(y |x) ∝ (1) exp ∑ e∈E,k λk fk(e,y|e,x) + ∑ v∈V,k µk gk(v,y|v,x) , where x is a data sequence, y a label sequence, and ...

172 Learning to resolve natural language ambiguities: A unified approach, - Roth - 1998 (Show Context)

Citation Context

...r et al., 1996; Ratnaparkhi, 1996; McCallum et al., 2000) that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging (Brill, 1995; =-=Roth, 1998-=-; Abney et al., 1999). Because of the computational complexity of global training, these models are only trained to minimize the error of individual label decisions assuming that neighboring labels ar...

130 Information extraction with HMM structures learned by stochastic optimization - Freitag, McCallum (Show Context)

Citation Context

...yconnected model and let the training procedure figure out a good structure. But that would preclude the use of prior structural knowledge that has proven so valuable in information extraction tasks (=-=Freitag & McCallum, 2000-=-). Proper solutions require models that account for whole state sequences at once by letting some transitions “vote” more strongly than others depending on the corresponding observations. This implies...

73 Boosting Applied to Tagging and PPattachment. - Abney, Schapire, et al. - 1999 (Show Context)

Citation Context

...ally, boosting is applied to classification problems with a small, fixed number of classes; applications of boosting to sequence labeling have treated each label as a separate classification problem (=-=Abney et al., 1999-=-). However, it is possible to apply the parallel update algorithm of Collins et al. (2000) to optimize the per-sequence exponential loss. This requires a forward-backward algorithm to compute efficien...

64 Minimization algorithms for sequential transducers. - Mohri - 2000 (Show Context)

Citation Context

...e paper we tacitly assume that the graph G is fixed. In the simplest and most impor3Weighted determinization and minimization techniques shift transition weights while preserving overall path weight (=-=Mohri, 2000-=-); their connection to this discussion deserves further study. tant example for modeling sequences, G is a simple chain or line: G = (V = {1, 2, . . .m}, E = {(i, i + 1)}). X may also have a natural g...

61 Une approche theorique de l’apprentissage connexionniste: Applications a la reconnaissance de la parole. Doctoral dissertation, Universite de - Bottou - 1991 (Show Context)

Citation Context

...ng recall and doubling precision relative to HMMs in a FAQ segmentation task. MEMMs and other non-generative finite-state models based on next-state classifiers, such as discriminative Markov models (=-=Bottou, 1991-=-), share a weakness we call here the label bias problem: the transitions leaving a given state compete only against each other, rather than against all other transitions in the model. In probabilistic...

60 Boltzmann Chains and Hidden markov models - Saul, Jordan - 1995 (Show Context)

Citation Context

..., y) gy,x (v,y|v,x) = δ(yv, y) δ(xv, x) . The corresponding parameters λy′,y and µy,x play a similar role to the (logarithms of the) usual HMM parameters p(y′ | y) and p(x|y). Boltzmann chain models (=-=Saul & Jordan, 1996-=-; MacKay, 1996) have a similar form but use a single normalization constant to yield a joint distribution, whereas CRFs use the observation-dependent normalization Z(x) for conditional distributions. ...

30 A whole sentence maximum entropy language model,” in Automatic Speech Recognition and Understanding, - Rosenfeld - 1997 (Show Context)

Citation Context

...he benefits of conditional models with the global normalization of random field models. Other applications of exponential models in sequence modeling have either attempted to build generative models (=-=Rosenfeld, 1997-=-), which involve a hard normalization problem, or adopted local conditional models (Berger et al., 1996; Ratnaparkhi, 1996; McCallum et al., 2000) that may suffer from label bias. Non-probabilistic lo...

9 Equivalence of linear Boltzmann chains and hidden markov models. Neural computation - Mackay - 1781 (Show Context)

Citation Context

... δ(yv, y) δ(xv, x) . The corresponding parameters λy′,y and µy,x play a similar role to the (logarithms of the) usual HMM parameters p(y′ | y) and p(x|y). Boltzmann chain models (Saul & Jordan, 1996; =-=MacKay, 1996-=-) have a similar form but use a single normalization constant to yield a joint distribution, whereas CRFs use the observation-dependent normalization Z(x) for conditional distributions. Although it en...

2 The use of classifiers in sequential inference. NIPS 13. Forthcoming - Punyakanok, Roth - 2001 (Show Context)

Citation Context

...models (Bottou, 1991), maximum entropy taggers (Ratnaparkhi, 1996), and MEMMs, as well as non-probabilistic sequence tagging and segmentation models with independently trained next-state classifiers (=-=Punyakanok & Roth, 2001-=-) are all potential victims of the label bias problem. For example, Figure 1 represents a simple finite-state model designed to distinguish between the two words rib and rob. Suppose that the observat...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University