## Why Doesn’t EM Find Good HMM POS-Taggers (2007)

### Cached

### Download Links

Venue: | In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic: Association for Computational Linguistics |

Citations: | 26 - 2 self |

### BibTeX

@INPROCEEDINGS{Johnson07whydoesn’t,

author = {Mark Johnson},

title = {Why Doesn’t EM Find Good HMM POS-Taggers},

booktitle = {In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic: Association for Computational Linguistics},

year = {2007},

pages = {296--305}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper investigates why the HMMs es-timated by Expectation-Maximization (EM) produce such poor results as Part-of-Speech (POS) taggers. We find that the HMMs es-timated by EM generally assign a roughly equal number of word tokens to each hid-den state, while the empirical distribution of tokens to POS tags is highly skewed. This motivates a Bayesian approach using a sparse prior to bias the estimator toward such a skewed distribution. We investigate Gibbs Sampling (GS) and Variational Bayes (VB) estimators and show that VB con-verges faster than GS for this task and that VB significantly improves 1-to-1 tagging ac-curacy over EM.We also show that EM does nearly as well as VB when the number of hidden HMM states is dramatically reduced. We also point out the high variance in all of these estimators, and that they require many more iterations to approach conver-gence than usually thought. 1

### Citations

2235 | Building a Large Annotated Corpus for English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...xperiments described below have the same basic structure: an estimator is used to infer a bitag HMM from the unsupervised training corpus (the words of Penn Treebank (PTB) Wall Street Journal corpus (=-=Marcus et al., 1993-=-)), and then the resulting model is used to label each word of that corpus with one of the HMM’s hidden states. This section describes how we evaluate how well these sequences of hidden states corresp... |

1021 | Monte Carlo Statistical Methods
- Robert, Casella
- 1999
(Show Context)
Citation Context ...: Markov Chain Monte Carlo (MCMC) and Variational Bayes (VB). MCMC encompasses a broad range of sampling techniques, including component-wise Gibbs sampling, which is the MCMC technique we used here (=-=Robert and Casella, 2004-=-; Bishop, 2006). In general, MCMC techniques do not produce a single model that characterizes the posterior, but instead produce a stream of samples from the posterior. The application of MCMC techniq... |

869 | An Introduction to Variational Methods for Graphical Models - Jordan, Ghahramani, et al. - 1999 |

798 |
Statistical Methods for Speech Recognition”, The
- Jelinek
- 1997
(Show Context)
Citation Context ...dentical. 3 Maximum Likelihood via Expectation-Maximization There are several excellent textbook presentations of Hidden Markov Models and the Forward-Backward algorithm for Expectation-Maximization (=-=Jelinek, 1997-=-; Manning and Schütze, 1999; Bishop, 2006), so we do not cover them in detail here. Conceptually, a Hidden Markov Model generates a sequence of observations x = (x0, . . . , xn) (here, the words of th... |

515 | Factorial hidden markov models - Ghahramani, Jordan - 1997 |

378 | Feature-rich part-of-speech tagging with a cyclic dependency network - Toutanova, Klein, et al. - 2003 |

258 | Tagging English Text with a Probabilistic Model
- Merialdo
- 1994
(Show Context)
Citation Context ...ach convergence than usually thought. 1 Introduction It is well known that Expectation-Maximization (EM) performs poorly in unsupervised induction of linguistic structure (Carroll and Charniak, 1992; =-=Merialdo, 1994-=-; Klein, 2005; Smith, 2006). In retrospect one can certainly find reasons to explain this failure: after all, likelihood does not appear in the wide variety of linguistic tests proposed for identifyin... |

127 | A fully Bayesian approach to unsupervised part-of-speech tagging
- Goldwater, Griffiths
- 2007
(Show Context)
Citation Context ...ng POS tagging models has focused on semi-supervised methods in the in which the learner is provided with a lexicon specifying the possible tags for each word (Merialdo, 1994; Smith and Eisner, 2005; =-=Goldwater and Griffiths, 2007-=-) or a small number of “prototypes” for each POS (Haghighi and Klein, 2006). In the context of semisupervised learning using a tag lexicon, Wang and Schuurmans (2005) observe discrepencies between the... |

99 | Two experiments on learning probabilistic dependency grammars from corpora
- Carroll, Charniak
- 1992
(Show Context)
Citation Context ...any more iterations to approach convergence than usually thought. 1 Introduction It is well known that Expectation-Maximization (EM) performs poorly in unsupervised induction of linguistic structure (=-=Carroll and Charniak, 1992-=-; Merialdo, 1994; Klein, 2005; Smith, 2006). In retrospect one can certainly find reasons to explain this failure: after all, likelihood does not appear in the wide variety of linguistic tests propose... |

94 | A bit of progress in language modeling
- Goodman
- 2001
(Show Context)
Citation Context ... ways still poorly understood; for example, smoothing is generally regarded as essential for higher-order HMMs, yet it is not clear how to integrate smoothing into unsupervised estimation procedures (=-=Goodman, 2001-=-; Wang and Schuurmans, 2005). Most previous work exploiting unsupervised training data for inferring POS tagging models has focused on semi-supervised methods in the in which the learner is provided w... |

85 | Comparing clusterings by the variation of information - Meilǎ |

84 | Ensemble learning for hidden Markov models - MacKay - 1997 |

35 | An entropic estimator for structure discovery - Brand - 1999 |

30 | Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
- Smith
- 2006
(Show Context)
Citation Context ...thought. 1 Introduction It is well known that Expectation-Maximization (EM) performs poorly in unsupervised induction of linguistic structure (Carroll and Charniak, 1992; Merialdo, 1994; Klein, 2005; =-=Smith, 2006-=-). In retrospect one can certainly find reasons to explain this failure: after all, likelihood does not appear in the wide variety of linguistic tests proposed for identifying linguistic structure (Fr... |

29 | inference for PCFGs via Markov chain Monte Carlo. InHuman Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics - Bayesian |

25 | Variational Bayesian grammar induction for natural language - Kurihara, Sato - 2006 |

21 | 2004. Part of speech tagging in context - Banko, Moore |

11 | Linguistics: An Introduction to Linguistic Theory - Fromkin, editor - 2001 |

3 | An introduction to Markov Chain Monte Carlo methods - Besag - 2004 |