## Generalized expectation criteria for semi-supervised learning of conditional random fields (2008)

### Cached

### Download Links

Venue: | In In Proc. ACL, pages 870 – 878 |

Citations: | 64 - 8 self |

### BibTeX

@INPROCEEDINGS{Mann08generalizedexpectation,

author = {Gideon S. Mann and Andrew Mccallum},

title = {Generalized expectation criteria for semi-supervised learning of conditional random fields},

booktitle = {In In Proc. ACL, pages 870 – 878},

year = {2008}

}

### OpenURL

### Abstract

This paper presents a semi-supervised training method for linear-chain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distribution on unlabeled data matches a target distribution. We induce target conditional probability distributions of labels given features from both annotated feature occurrences in context and adhoc feature majority label assignment. The use of generalized expectation criteria allows for a dramatic reduction in annotation time by shifting from traditional instance-labeling to feature-labeling, and the methods presented outperform traditional CRF training and other semi-supervised methods when limited human effort is available. 1

### Citations

8134 | Maximum likelihood from incomplete data via the em algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ntion in order to avoid performance loss during the bootstrapping process, such as in Riloff and Shepherd (2000). 2.1.2 EXPECTATION MAXIMIZATION Generative models trained by expectation maximization (=-=Dempster et al., 1977-=-) have been widely studied for semi-supervised learning. EM consists of two steps: an expectation step Q(θ|θ (t) ) = E p(y|x,θ (t)[logL(θ;x,y)], and a maximization step, θ (t+1) = argmax θ Q(θ|θ (t) )... |

2320 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- LAFFERTY, MCCALLUM, et al.
- 2001
(Show Context)
Citation Context ...expectation criteria to classification models. However, GE can additionally be applied to structured models. In this section, we examine the case of linear chain structured conditional random fields (=-=Lafferty et al., 2001-=-), and derive the GE gradient for this model. Linear-chain CRFs are a discriminative probabilistic model over sequences x = 〈x1..xn〉 of feature vectors and label sequences y = 〈y1..yn〉, where |x| = |y... |

1245 | Combining labeled and unlabeled data with co-training
- BLUM, T
- 1998
(Show Context)
Citation Context ..., f (t−1) (xi)) f (t) ← J(D ∪UB) until done One of the most successful examples of this work is Yarowsky (1995), where a small ’seed set’ of labeled instances is incrementally augmented. Co-training (=-=Blum and Mitchell, 1998-=-) looks at the case where two complementary classifiers can both be applied to a particular problem. Abney (2004) provides a deeper understanding of these methods by demonstrating that they optimize a... |

1079 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...tion, we describe how to apply the GE criteria proposed above to conditionally trained log-linear models, starting with conditional maximum-entropy models, aka multinomial logistic regression models (=-=Berger et al., 1996-=-). In these models, there are k scalar feature functions ψk(z,y), and the probability of the label y for input x is calculated by p(y|x;θ) = 1 Z(x) exp ( ∑ k ) θkψk(x,y) , where Z(x) = ∑y ′ exp(∑k θkψ... |

702 | Class-based n-gram models of natural language. Computational Linguistics - Brown, Pietra, et al. - 1990 |

681 | Transductive inference for text classification using support vector machines
- Joachims
- 1999
(Show Context)
Citation Context ...cases in which there is a small amount of fully labeled data and a much larger amount of unlabeled data, presumably from the same data source. For example, EM (Nigam et al., 1998), transductive SVMs (=-=Joachims, 1999-=-), entropy regularization (Grandvalet and Bengio, 2004), and graph-based Traditional Full Instance Labeling address : *number* oak avenue rent $ ADDRESS ADDRESS ADDRESS ADDRESS ADDRESS RENT RENT addre... |

490 | Unsupervised word sense disambiguation rivaling supervised methods - Yarowsky - 1995 |

317 | A framework for learning predictive structures from multiple tasks and unlabeled data - Ando, Zhang - 2005 |

266 | Learning from labeled and unlabeled data using graph mincuts - BLUM, andCHAWLA - 2001 |

244 | Tagging english text with a probabilistic model
- Mérialdo
- 1993
(Show Context)
Citation Context ...This conclusion reflects the experimental evidence and theoretical support from a large span of work. Expectation-maximization is notoriously fickle for semi-supervised learning. In a classic result (=-=Merialdo, 1994-=-) attempts semi-supervised learning to improve HMM part-of-speech tagging and finds that EM with unlabeled data reduces accuracy. Ng and Cardie (2003) also apply EM but finds that it fails to improve ... |

206 | Partially labeled classification with markov random walks
- Szummer, Jaakkola
- 2001
(Show Context)
Citation Context ...ch noncontiguous feature occurrences in context are labeled for the purpose of deriving a conditional probability distribution of labels given a particular feature. methods (Zhu and Ghahramani, 2002; =-=Szummer and Jaakkola, 2002-=-) have all been applied to a limited amount of fully labeled data in conjunction with unlabeled data to improve the accuracy of a classifier. In this paper, we explore an alternative approach in which... |

171 | Corpus-based induction of syntactic structure: Models of dependency and constituency
- Klein, Manning
- 2004
(Show Context)
Citation Context ...98) which presents a naïve Bayes model for text classification trained using EM and semisupervised data. EM has also been applied to structured classification problems such as part-of-speech tagging (=-=Klein and Manning, 2004-=-), where EM can succeed after very careful and clever initialization. While these models can often be very effective, especially when used with “prototypes” (Haghighi and Klein, 2006b), they cannot ef... |

162 | Learning to classify text from labeled and unlabeled documents
- Nigam, McCallum, et al.
- 2000
(Show Context)
Citation Context ...semi-supervised learning are applied to cases in which there is a small amount of fully labeled data and a much larger amount of unlabeled data, presumably from the same data source. For example, EM (=-=Nigam et al., 1998-=-), transductive SVMs (Joachims, 1999), entropy regularization (Grandvalet and Bengio, 2004), and graph-based Traditional Full Instance Labeling address : *number* oak avenue rent $ ADDRESS ADDRESS ADD... |

119 | Contrastive estimation: Training log-linear models on unlabeled data
- Smith, Eisner
- 2005
(Show Context)
Citation Context ... expected class distribution over each instance. Unlike Schapire et al. (2002), these distributions start from a fixed point and then are allowed to change during training. In contrastive estimation (=-=Smith and Eisner, 2005-=-), EM is performed over a restricted loglikelihood function, where instead of L(θ) = ∑i log p(xi;θ), the contrastive estimation log-likelihood function is LCE(θ) = ∑i log p(xi|N (xi);θ). The neighborh... |

107 | Learning from labeled and unlabeled data with label propagation
- Ghahramani
- 2002
(Show Context)
Citation Context ...m: Feature-labeling in which noncontiguous feature occurrences in context are labeled for the purpose of deriving a conditional probability distribution of labels given a particular feature. methods (=-=Zhu and Ghahramani, 2002-=-; Szummer and Jaakkola, 2002) have all been applied to a limited amount of fully labeled data in conjunction with unlabeled data to improve the accuracy of a classifier. In this paper, we explore an a... |

89 | A probability analysis on the value of unlabeled data for classification problems - Zhang, Oles - 2000 |

80 | Semi-supervised learning by entropy minimization
- Grandvalet, Bengio
- 2005
(Show Context)
Citation Context ... fully labeled data and a much larger amount of unlabeled data, presumably from the same data source. For example, EM (Nigam et al., 1998), transductive SVMs (Joachims, 1999), entropy regularization (=-=Grandvalet and Bengio, 2004-=-), and graph-based Traditional Full Instance Labeling address : *number* oak avenue rent $ ADDRESS ADDRESS ADDRESS ADDRESS ADDRESS RENT RENT address : *number* oak avenue rent $ .... ADDRESS .. ( plea... |

75 | Prototype-driven learning for sequence models
- Haghighi, Klein
- 2006
(Show Context)
Citation Context ...radient, showing that GE provides significant improvements. We achieve competitive performance in comparison to alternate model families, in particular generative models such as MRFs trained with EM (=-=Haghighi and Klein, 2006-=-) and HMMs trained with soft constraints (Chang et al., 2007). Finally, in Section 5.3 we show that feature-labeling can lead to dramatic reductions in the annotation time that is required in order to... |

63 | Learning from labeled features using generalized expectation criteria
- Druck, Mann, et al.
- 2008
(Show Context)
Citation Context ...es here in order to compare with previous results. Though in practice we have found that feature selection is often intuitive, recent work has experimented with automatic feature selection using LDA (=-=Druck et al., 2008-=-). For some of the experiments we also use two sets of 33 additional features that we chose by the same method as HK06, the first 33 of which are also shown in Table 1. We use the same tokenization of... |

63 | Semi-supervised protein classification using cluster kernels
- Weston, Leslie, et al.
- 2005
(Show Context)
Citation Context ...o significantly improve accuracy. On the secondary structure prediction (SecStr), we had access to published results for a supervised SVM using a radial-basis function (RBF) kernel, a Cluster Kernel (=-=Weston et al., 2006-=-) and a graph based-method, the Quadratic Cost Criterion with Class Mean Normalization (Bengio et al., 2006) trained using various data sub-sampling schemes (Delalleau et al., 2006): a random sampler ... |

59 | Guiding semi-supervision with constraint-driven learning
- Chang, Ratinov, et al.
- 2007
(Show Context)
Citation Context ...ieve competitive performance in comparison to alternate model families, in particular generative models such as MRFs trained with EM (Haghighi and Klein, 2006) and HMMs trained with soft constraints (=-=Chang et al., 2007-=-). Finally, in Section 5.3 we show that feature-labeling can lead to dramatic reductions in the annotation time that is required in order to achieve the same level of accuracy as traditional instance-... |

58 | Semisupervised conditional random fields for improved sequence segmentation and labeling
- Jiao, Wang, et al.
- 2010
(Show Context)
Citation Context ...py regularization is fragile, and accuracy gains can come only with precise settings of λ. High values of λ fall into the minimal entropy trap, while low values of λ have no effect on the model (see (=-=Jiao et al., 2006-=-) for an example). When some instances have partial labelings (i.e. labels for some of their tokens), it is possible to train CRFs via expected gradient methods (Salakhutdinov et al., 2003). Here a re... |

56 | Name tagging with word clusters and discriminative training - Miller, Guinness, et al. - 2004 |

55 | Incorporating prior knowledge into boosting - Schapire, Rochery, et al. - 2002 |

52 | Large scale semi-supervised linear SVMs - Sindhwani, Keerthi - 2006 |

51 | Maximum margin semi-supervised learning for structured variables - Altun, McAllester, et al. - 2005 |

51 | A conditional random field for discriminatively-trained finite-state string edit distance - McCallum, Bellare, et al. - 2005 |

50 | Expectation maximization and posterior constraints - GRACA, GANCHEV, et al. - 2007 |

48 | robust, scalable semi-supervised learning via expectation regularization - Simple - 2007 |

45 | Video suggestion and discovery for youtube: taking random walks through the view graph - Baluja, Seth, et al. |

43 | Understanding the Yarowsky algorithm - ABNEY - 2004 |

42 | A new metric-based approach to model selection - Schuurmans - 1997 |

39 | Label propagation and quadratic criterion - Delalleau, Roux - 2006 |

38 | Unsupervised Learning of Field Segmentation Models for Information Extraction - Grenager, Klein, et al. - 2005 |

37 | Optimization with em and expectationconjugate-gradient
- Salakhutdinov, Roweis, et al.
- 2003
(Show Context)
Citation Context ...t on the model (see (Jiao et al., 2006) for an example). When some instances have partial labelings (i.e. labels for some of their tokens), it is possible to train CRFs via expected gradient methods (=-=Salakhutdinov et al., 2003-=-). Here a reformulation is presented in which the gradient is computed for a probability distribution with a marginalized hidden variable, z, and observed training labels y: ∇L(θ) = ∂ ∑ log p(x, y, z;... |

37 | Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning - Zhu, Lafferty - 2005 |

36 |
On information regularization
- Corduneanu, Jaakkola
- 2003
(Show Context)
Citation Context ...Benchmark tests have shown that entropy regularization performs as well as TSVMs (when the SVM is given a linear kernel) (Chapelle et al., 2006). Another related method is information regularization (=-=Corduneanu and Jaakkola, 2003-=-), which measures distance via the mutual information between a classifier and the marginal distribution p(x). 2.1.5 GRAPH-BASED METHODS Graph-based (manifold) methods can be very accurate when applie... |

36 | Weakly supervised natural language learning without redundant views - Ng, Cardie - 2003 |

35 | Learning from measurements in exponential families - Liang, Jordan, et al. - 2009 |

32 | An alternate objective function for Markovian fields - Kakade, Teh, et al. - 2002 |

29 | Active learning by labeling features - Druck, Settles, et al. - 2009 |

27 |
Trained named entity recognition using distributional clusters
- Freitag
- 2004
(Show Context)
Citation Context ...or a given sentence, in addition to standard features, additional features corresponding to the latent clusters of the tokens in the sentence, are added. This technique, along with similar approachs (=-=Freitag, 2004-=-; Li and McCallum, 2005), have yielded small but consistent success. This method can be applied independently of the particular training method and in Section 6.3, we explore combining our method with... |

26 | Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data
- Suzuki, Isozaki
- 2008
(Show Context)
Citation Context ... and discriminative models, by combining ML estimates over the labeled data with EM parameter estimates over the unlabeled data, for a joint model which combines a CRF and a HMM (Suzuki et al., 2007; =-=Suzuki and Isozaki, 2008-=-). In this formulation, the log-likelihood can be viewed as two separate log-likelihood functions L1(θ) and L2(θ) which respectively correspond to the CRF and HMM log-likelihood. When optimizing 957M... |

24 | Semi-supervised sequence modeling with syntactic topic models
- Li, McCallum
- 2005
(Show Context)
Citation Context ...ence, in addition to standard features, additional features corresponding to the latent clusters of the tokens in the sentence, are added. This technique, along with similar approachs (Freitag, 2004; =-=Li and McCallum, 2005-=-), have yielded small but consistent success. This method can be applied independently of the particular training method and in Section 6.3, we explore combining our method with those described by Mil... |

22 | SemiSupervised Classification with Hybrid Generative/Discriminative Methods. The 13 th
- Druck, Pal, et al.
- 2007
(Show Context)
Citation Context ...semi-supervised model in the parametric model family, such as expected gradient methods (Salakhutdinov et al., 2003), can be easily combined with GE, and certain generative models such as naïve MRFs (=-=Druck et al., 2007-=-) can be simply combined as well. More distantly, just as various models can be augmented with regularization terms (as in ridge regression for linear regression models), GE may be augmented in the sa... |

22 | Dependency grammar induction via bitext projection constraints - Ganchev, Gillenwater, et al. - 2009 |

19 | Active learning with feedback on both features and instances - Raghavan, Madani, et al. - 2006 |

19 | Alternating projections for learning with expectation constraints - Bellare, Druck, et al. - 2009 |

19 | Word Sense Disambiguation Using Label Propagation Based Semi-Supervised - Niu, Ji, et al. - 2005 |

17 | The latent maximum entropy principle - Wang, Rosenfeld, et al. - 2002 |