## Why doesn’t EM find good HMM POS-taggers (2007)

### Cached

### Download Links

Venue: | In EMNLP |

Citations: | 24 - 2 self |

### BibTeX

@INPROCEEDINGS{Johnson07whydoesn’t,

author = {Mark Johnson},

title = {Why doesn’t EM find good HMM POS-taggers},

booktitle = {In EMNLP},

year = {2007},

pages = {296--305}

}

### OpenURL

### Abstract

This paper investigates why the HMMs estimated by Expectation-Maximization (EM) produce such poor results as Part-of-Speech (POS) taggers. We find that the HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens to POS tags is highly skewed. This motivates a Bayesian approach using a sparse prior to bias the estimator toward such a skewed distribution. We investigate Gibbs Sampling (GS) and Variational Bayes (VB) estimators and show that VB converges faster than GS for this task and that VB significantly improves 1-to-1 tagging accuracy over EM. We also show that EM does nearly as well as VB when the number of hidden HMM states is dramatically reduced. We also point out the high variance in all of these estimators, and that they require many more iterations to approach convergence than usually thought. 1

### Citations

2124 | Building a large annotated corpus of English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...xperiments described below have the same basic structure: an estimator is used to infer a bitag HMM from the unsupervised training corpus (the words of Penn Treebank (PTB) Wall Street Journal corpus (=-=Marcus et al., 1993-=-)), and then the resulting model is used to label each word of that corpus with one of the HMM’s hidden states. This section describes how we evaluate how well these sequences of hidden states corresp... |

915 | Monte Carlo Statistical Methods
- Robert, Casella
- 2005
(Show Context)
Citation Context ...: Markov Chain Monte Carlo (MCMC) and Variational Bayes (VB). MCMC encompasses a broad range of sampling techniques, including component-wise Gibbs sampling, which is the MCMC technique we used here (=-=Robert and Casella, 2004-=-; Bishop, 2006). In general, MCMC techniques do not produce a single model that characterizes the posterior, but instead produce a stream of samples from the posterior. The application of MCMC techniq... |

838 | An introduction to variational methods for graphical models - Jordan, Ghahramani, et al. - 1999 |

744 |
Statistical Methods for Speech Recognition
- Jelinek
- 1997
(Show Context)
Citation Context ...dentical. 3 Maximum Likelihood via Expectation-Maximization There are several excellent textbook presentations of Hidden Markov Models and the Forward-Backward algorithm for Expectation-Maximization (=-=Jelinek, 1997-=-; Manning and Schütze, 1999; Bishop, 2006), so we do not cover them in detail here. Conceptually, a Hidden Markov Model generates a sequence of observations x = (x0, . . . , xn) (here, the words of th... |

496 | Factorial hidden Markov models - Ghahramani, Jordan - 1997 |

350 | Feature-rich part-of-speech tagging with a cyclic dependency network - Toutanova, Klein, et al. - 2003 |

248 | Tagging English text with a probabilistic model
- Merialdo
- 1994
(Show Context)
Citation Context ...ach convergence than usually thought. 1 Introduction It is well known that Expectation-Maximization (EM) performs poorly in unsupervised induction of linguistic structure (Carroll and Charniak, 1992; =-=Merialdo, 1994-=-; Klein, 2005; Smith, 2006). In retrospect one can certainly find reasons to explain this failure: after all, likelihood does not appear in the wide variety of linguistic tests proposed for identifyin... |

119 | T.: A fully bayesian approach to unsupervised part-of-speech tagging
- Goldwater, Griffiths
- 2007
(Show Context)
Citation Context ...ng POS tagging models has focused on semi-supervised methods in the in which the learner is provided with a lexicon specifying the possible tags for each word (Merialdo, 1994; Smith and Eisner, 2005; =-=Goldwater and Griffiths, 2007-=-) or a small number of “prototypes” for each POS (Haghighi and Klein, 2006). In the context of semisupervised learning using a tag lexicon, Wang and Schuurmans (2005) observe discrepencies between the... |

98 | Two experiments on learning probabilistic dependency grammars from corpora
- Carroll, Charniak
- 1992
(Show Context)
Citation Context ...any more iterations to approach convergence than usually thought. 1 Introduction It is well known that Expectation-Maximization (EM) performs poorly in unsupervised induction of linguistic structure (=-=Carroll and Charniak, 1992-=-; Merialdo, 1994; Klein, 2005; Smith, 2006). In retrospect one can certainly find reasons to explain this failure: after all, likelihood does not appear in the wide variety of linguistic tests propose... |

87 | A bit of progress in language modeling
- Goodman
- 2001
(Show Context)
Citation Context ... ways still poorly understood; for example, smoothing is generally regarded as essential for higher-order HMMs, yet it is not clear how to integrate smoothing into unsupervised estimation procedures (=-=Goodman, 2001-=-; Wang and Schuurmans, 2005). Most previous work exploiting unsupervised training data for inferring POS tagging models has focused on semi-supervised methods in the in which the learner is provided w... |

80 | Ensemble learning for hidden markov models - MacKay - 1997 |

35 | An entropic estimator for structure discovery - Brand - 1999 |

29 | inference for pcfgs via markov chain monte carlo - Bayesian |

29 | Novel estimation methods for unsupervised discovery of latent structure in natural language text
- Smith, Eisner
- 2007
(Show Context)
Citation Context ...thought. 1 Introduction It is well known that Expectation-Maximization (EM) performs poorly in unsupervised induction of linguistic structure (Carroll and Charniak, 1992; Merialdo, 1994; Klein, 2005; =-=Smith, 2006-=-). In retrospect one can certainly find reasons to explain this failure: after all, likelihood does not appear in the wide variety of linguistic tests proposed for identifying linguistic structure (Fr... |

25 | Variational Bayesian grammar induction for natural language - Kurihara, Sato - 2006 |

20 | Part of speech tagging in context - Banko, Moore - 2004 |

8 | Linguistics: An Introduction to Linguistic Theory - Fromkin - 2000 |

3 | An introduction to Markov Chain Monte Carlo methods - Besag |