Results 1  10
of
14
Contrastive estimation: Training loglinear models on unlabeled data
 In Proc. of ACL
, 2005
"... Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are loglinear, allowing the incorporation of arbitrary features into the model. To train on unlabele ..."
Abstract

Cited by 115 (14 self)
 Add to MetaCart
Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are loglinear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for loglinear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features. 1
Generalized expectation criteria for semisupervised learning of conditional random fields
 In In Proc. ACL, pages 870 – 878
, 2008
"... This paper presents a semisupervised training method for linearchain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distri ..."
Abstract

Cited by 64 (8 self)
 Add to MetaCart
This paper presents a semisupervised training method for linearchain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distribution on unlabeled data matches a target distribution. We induce target conditional probability distributions of labels given features from both annotated feature occurrences in context and adhoc feature majority label assignment. The use of generalized expectation criteria allows for a dramatic reduction in annotation time by shifting from traditional instancelabeling to featurelabeling, and the methods presented outperform traditional CRF training and other semisupervised methods when limited human effort is available. 1
2004. Annealing techniques for unsupervised statistical language learning
 In Proc. of ACL
, 2004
"... Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the ExpectationMaximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA beg ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the ExpectationMaximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA begins by globally maximizing an easy concave function and maintains a local maximum as it gradually morphs the function into the desired nonconcave likelihood function. Applying DA to parsing and tagging models is shown to be straightforward; significant improvements over EM are shown on a partofspeech tagging task. We describe a variant, skewed DA, which can incorporate a good initializer when it is available, and show significant improvements over EM on a grammar induction task. 1
Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
, 2006
"... This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likel ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations)
Guiding unsupervised grammar induction using contrastive estimation
 In Proc. of IJCAI Workshop on Grammatical Inference Applications
, 2005
"... We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihoodbased objective functions. This criterion is a generalization ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihoodbased objective functions. This criterion is a generalization of the function maximized by the ExpectationMaximization algorithm [Dempster et al., 1977]. CE is a natural fit for loglinear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, loglinear dependency grammar models trained using CE can drastically outperform EMtrained generative models on the task of matching human linguistic annotations (the MATCHLINGUIST task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood one can target the objective of grammar induction to a specific application. 1
Mixtures of Conditional Maximum Entropy Models
 In Proc. of ICML2003
, 2002
"... Driven by successes in several application areas, maximum entropy modeling has recently gained considerable popularity. We generalize the standard maximum entropy formulation of classi cation problems to better handle the case where complex data distributions arise from a mixture of simpler u ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Driven by successes in several application areas, maximum entropy modeling has recently gained considerable popularity. We generalize the standard maximum entropy formulation of classi cation problems to better handle the case where complex data distributions arise from a mixture of simpler underlying (latent) distributions. We develop a theoretical framework for characterizing data as a mixture of maximum entropy models. We formulate a maximumlikelihood interpretation of the mixture model learning, and derive a generalized EM algorithm to solve the corresponding optimization problem. We present empirical results for a number of data sets showing that modeling the data as a mixture of latent maximum entropy models gives signi cant improvement over the standard, single component, maximum entropy approach.
Leveraging the Margin More Carefully
 In International Conference on Machine Learning, ICML
, 2004
"... Boosting is a popular approach for building accurate classifiers. Despite the initial popular belief, boosting algorithms do exhibit overfitting and are sensitive to label noise. Part of the sensitivity of boosting algorithms to outliers and noise can be attributed to the unboundedness of the ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Boosting is a popular approach for building accurate classifiers. Despite the initial popular belief, boosting algorithms do exhibit overfitting and are sensitive to label noise. Part of the sensitivity of boosting algorithms to outliers and noise can be attributed to the unboundedness of the marginbased loss functions that they employ.
Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling via Directed Markov Random Fields
 In Proceedings of ICML 2005
, 2005
"... We present a directed Markov random field (MRF) model that combines ngram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponenti ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
We present a directed Markov random field (MRF) model that combines ngram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponential number of loops and becomes a context sensitive grammar, we are nevertheless able to estimate its parameters in cubic time using an efficient modified EM method, the generalized insideoutside algorithm, which extends the insideoutside algorithm to incorporate the effects of the ngram and PLSA language models. We generalize various smoothing techniques to alleviate the sparseness of ngram counts in cases where there are hidden variables. We also derive an analogous algorithm to calculate the probability of initial subsequence of a sentence, generated by the composite language model. Our experimental results on the Wall Street Journal corpus show that we obtain significant reductions in perplexity compared to the stateoftheart baseline trigram model with GoodTuring and KneserNey smoothings. 1.
Sequence Modeling with Mixtures of Conditional Maximum Entropy Distributions
 IN: PROCEEDINGS OF THE THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM’03
, 2003
"... We present a novel approach to modeling sequences using mixtures of conditional maximum entropy distributions. Our method generalizes the mixture of firstorder Markov models by including the "longterm" dependencies in model components. The "longterm" dependencies are represented by the freque ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We present a novel approach to modeling sequences using mixtures of conditional maximum entropy distributions. Our method generalizes the mixture of firstorder Markov models by including the "longterm" dependencies in model components. The "longterm" dependencies are represented by the frequently used in the natural language processing (NLP) domain probabilistic triggers or rules (such as "A occurred k positions back =) the current symbol is B with probability P "). The maximum entropy framework is then used to create a coherent probabilistic model from all triggers selected for modeling. In order to represent hidden or unobserved effects in the data we use probabilistic mixtures with maximum entropy models as components. We demonstrate how our mixture of conditional maximum entropy models can be learned from data using the EM algorithm that scales linearly in the dimensions of the data and the number of mixture components. We present empirical results on the simulated and realworld data sets and demonstrate that the proposed approach enables us to create better quality models than the mixtures of firstorder Markov models and resist overfitting and curse of dimensionality that would inevitably present themselves for the higher order Markov models.
Learning Mixture Models with the Regularized Latent Maximum Entropy Principle
 IEEE Transactions on Neural Networks
, 2004
"... Abstract—This paper presents a new approach to estimating mixture models based on a recent inference principle we have proposed: the latent maximum entropy principle (LME). LME is different from Jaynes ’ maximum entropy principle, standard maximum likelihood, and maximum a posteriori probability est ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract—This paper presents a new approach to estimating mixture models based on a recent inference principle we have proposed: the latent maximum entropy principle (LME). LME is different from Jaynes ’ maximum entropy principle, standard maximum likelihood, and maximum a posteriori probability estimation. We demonstrate the LME principle by deriving new algorithms for mixture model estimation, and show how robust new variants of the expectation maximization (EM) algorithm can be developed. We show that a regularized version of LME (RLME), is effective at estimating mixture models. It generally yields better results than plain LME, which in turn is often better than maximum likelihood and maximum a posterior estimation, particularly when inferring latent variable models from small amounts of data. Index Terms—Expectation maximization (EM), iterative scaling, latent variables, maximum entropy, mixture models, regularization. I.