## Best-first Model Merging for Hidden Markov Model Induction (1994)

Citations: | 97 - 7 self |

### BibTeX

@TECHREPORT{Stolcke94best-firstmodel,

author = {Andreas Stolcke and Stephen M. Omohundro},

title = {Best-first Model Merging for Hidden Markov Model Induction},

institution = {},

year = {1994}

}

### Years of Citing Articles

### OpenURL

### Abstract

This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability criterion is used to determine which states to merge and when to stop generalizing. The procedure may be considered a heuristic search for the HMM structure with the highest posterior probability. We discuss a variety of possible priors for HMMs, as well as a number of approximations which improve the computational efficiency of the algorithm. We studied three applications to evaluate the procedure. The first compares the merging algorithm with the standard Baum-Welch approach in inducing simple finitestate languages from small, positive-only training samples. We found that the merging procedure is more robust and accurate, part...

### Citations

9138 | Elements of information theory - Cover, Thomas - 1991 |

8919 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

4099 | Introduction to Automata Theory, Languages and Computation, 2nd ed - Hopcroft, Motwani, et al. - 2000 |

1262 | Error bounds for convolution codes and an asymptotically optimal decoding algorithm - Viterbi - 1967 |

1132 | A Bayesian method for the induction of probabilistic networks from data - Cooper, Herskovits - 1992 |

900 | An introduction to hidden Markov models - Rabiner, Juang - 1986 |

837 |
A maximization technique occuring in the statistical analysis of probabilistic functions of markov chains
- Baum, Petrie, et al.
- 1972
(Show Context)
Citation Context ...wn statistical techniques. Section 2 defines the HMM formalism and gives an overview of these standard estimation methods. In contrast to traditional HMM estimation based on the Baum-Welch technique (=-=Baum et al. 1970-=-), our method uses Bayesian evidence evaluation to adjust the model topology to the data. The approach is based on the idea that models should evolve from simply storing examples to representing more ... |

736 | Class-based n-gram models of natural language
- Brown, Pietra, et al.
- 1990
(Show Context)
Citation Context ...es between classes, whereas emission probabilities correspond to the word distributions for each class (for n ? 2, higher-order HMMs are required). The incremental word clustering algorithm given in (=-=Brown et al. 1992-=-) then becomes an instance of HMM merging, albeit one that is entirely based on likelihoods. 8 6 Evaluation We have evaluated the HMM merging algorithm experimentally in a series of applications. Such... |

700 | Estimation of probabilities from sparse data for the language model component of a speech recognizer
- Katz
- 1987
(Show Context)
Citation Context ...the first one returns probability zero) do not yield consistent probabilities unless they are combined with `discounting ' of probabilities to ensure that the total probability mass sums up to unity (=-=Katz 1987-=-). The discounting scheme, as well as various smoothing approaches (e.g., adding a fixed number of virtual `Dirichlet' samples into parameter estimates) tend to be specific to the model used, and are ... |

542 | Mixture densities, maximum likelihood, and the EM algorithm - Redner, Walker - 1984 |

483 | Connectionist Speech Recognition- A Hybrid Approach - Bourlard, Morgan - 1994 |

433 |
A universal prior for integers and estimation by minimum description length’, The Annals of Statistics 11(2
- Rissanen
- 1983
(Show Context)
Citation Context ...s. Minimum Description Length Especially in the domain of discrete structures, it is useful to remember the standard duality between the Bayesian approach and inference by Minimum Description Length (=-=Rissanen 1983-=-; Wallace & Freeman 1987). Briefly, the maximization of P (M; X) = P (M)P (X jM) implicit in Bayesian model inference is equivalent to minimizing \Gamma log P (M; X) = \Gamma log P (M) \Gamma log P (X... |

381 |
The estimation of stochastic context-free grammars using the Inside-Outside algorithm’, Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ... The structure finding problem in this domain is even more severe, as standard EM-based estimation methods have great difficulty when presented with unstructured, fully parameterized grammars (Lari & =-=Young 1990-=-; Pereira & Schabes 1992). We have achieved good results inducing SCFGs using a merging heuristic that is a direct generalization of the one used for HMMs; this approach will be described in a forthco... |

369 | A Practical Part-ofspeech Tagger
- Cutting, Kupiec, et al.
- 1992
(Show Context)
Citation Context ...ure. Some of their first uses were in the area of cryptanalysis and they are now the model of choice for speech recognition (Rabiner & Juang 1986). Recent applications include part-of-speech tagging (=-=Cutting et al. 1992-=-) and protein classification and alignment (Haussler et al. 1992; Baldi et al. 1993). Because HMMs can be seen as probabilistic generalizations of non-deterministic finite-state automata they are also... |

352 |
Interpolated estimation of markov source parameters from sparse data
- JELINEK, MERCER
- 1980
(Show Context)
Citation Context ... data is then used to estimate the parameters, including the mixture weights. This holding-out of training data makes the mixture model approach similar to the deleted interpolation method (Jelinek & =-=Mercer 1980-=-). The main difference is that the component 48 parameters are estimated jointly with the mixture proportions. 16 In our experiments we always used half of the training data in the structure induction... |

314 |
Inductive inference: Theory and methods
- Angluin, Smith
- 1983
(Show Context)
Citation Context ...n of state equivalence classes, and as such is pervasively used in much of automata theory (Hopcroft & Ullman 1979). It has also been applied to the induction of non-probabilistic automata (Angluin & =-=Smith 1983-=-). Still in the field of non-probabilistic automata induction, Tomita (1982) has used a simple hill-climbing procedure combined with a goodness measure based on positive/negative samples to search the... |

305 | Inferring Decision Trees Using the Minimum Description Length Principle - Quinlan, Rivest - 1989 |

277 | Inside-Outside Reestimation from Partially Bracketed Corpora - Pereira, Schabes - 1992 |

252 |
AutoClass: A Bayesian Classification System
- Cheeseman, Kelly, et al.
- 1988
(Show Context)
Citation Context ... the prior: the prior expectation of ` i is ff i ff 0 , where ff 0 = P i ff i is the total prior weight. One important reason for the use of the Dirichlet prior in the case of multinomial parameters (=-=Cheeseman et al. 1988-=-; Cooper & Herskovits 1992; Buntine 1992) is its mathematical expediency. It is a conjugate prior, i.e., of the same functional form as the likelihood function for the multinomial. The likelihood for ... |

198 | Theory of Refinement on Bayesian Networks
- Buntine
- 1991
(Show Context)
Citation Context ...y, many aspects of the priors discussed in this section can be found in Bayesian approaches to the induction of graph-based models in other domains (e.g., Bayesian networks (Cooper & Herskovits 1992; =-=Buntine 1991-=-) and decision trees (Buntine 1992)). 3.4.1 Structural vs. parameter priors An HMM can be described in two stages: 1. A model structure or topology is specified as a set of states, transitions and emi... |

197 | Estimation and inference by compact coding - Wallace, Freeman - 1987 |

140 | Hidden markov model induction by bayesian model merging - Stolcke, Omohundro - 1993 |

132 | Learning classification trees
- Buntine
- 1992
(Show Context)
Citation Context ...ussed in this section can be found in Bayesian approaches to the induction of graph-based models in other domains (e.g., Bayesian networks (Cooper & Herskovits 1992; Buntine 1991) and decision trees (=-=Buntine 1992-=-)). 3.4.1 Structural vs. parameter priors An HMM can be described in two stages: 1. A model structure or topology is specified as a set of states, transitions and emissions. Transitions and emissions ... |

81 |
Bayesian inductive inference and maximum entropy,” in Foundations
- Gull
- 1988
(Show Context)
Citation Context ...GammajQj for some constant C ? 1. However, as we will see below, the state-based priors by themselves produce a tendency towards reducing the number of states as a result of Bayesian `Occam factors' (=-=Gull 1988-=-). In the case of narrow parameter priors we need to specify how the prior probability mass is distributed among all possible model topologies with a given number of states. For practical reasons it i... |

79 | The power of amnesia - Ron, Singer, et al. - 1994 |

75 | A study of grammatical inference - Horning - 1969 |

55 |
Getting started with the DARPA-TIMIT CD-ROM: An acoustic phonetic continuous speech database
- Garofolo
- 1988
(Show Context)
Citation Context ...able for HMM modeling. The TIMIT (Texas Instruments-MIT) database is a collection of hand-labeled speech samples compiled for the purpose of training speaker-independent phonetic recognition systems (=-=Garofolo 1988-=-). It contains acoustic data segmented by words and aligned with discrete labels from an alphabet of 62 phones. For our purposes, we ignored the continuous, acoustic data and viewed the database simpl... |

46 |
A statistical model for generating pronunciation networks
- Riley
- 1991
(Show Context)
Citation Context ...ules for pronunciations of individual phonemes based on their contexts (e.g., using decision tree induction), which can then be concatenated into networks representing word pronunciations (Chen 1990; =-=Riley 1991-=-). A detailed comparison of the two approaches is desirable, but so far hasn't been carried out. We simply remark that both approaches could be combined by generating allophone sequences from induced ... |

36 | Best-first model merging for dynamic learning and recognition
- Omohundro
- 1992
(Show Context)
Citation Context ...003 January 1994 Revised April 1994 Abstract This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (=-=Omohundro 1992-=-). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability crite... |

35 | The Berkeley restaurant project
- Jurafsky, Wooters, et al.
- 1994
(Show Context)
Citation Context ...t (BeRP). BeRP is medium vocabulary, speaker-independent spontaneous continuous speech understanding system that functions as a consultant for finding restaurants in the city of Berkeley, California (=-=Jurafsky et al. 1994-=-). In this application, the merging algorithm is run on strings of phone labels obtained by Viterbi-aligning previously existing word models to sample speech (using the TIMIT labels as the phone alpha... |

30 | Bayesian learning of Gaussian mixture densities for hidden Markov models - Gauvain, Lee - 1991 |

24 |
Identification of contextual factors for pronunciation networks
- CHEN
- 1990
(Show Context)
Citation Context ...t induces rules for pronunciations of individual phonemes based on their contexts (e.g., using decision tree induction), which can then be concatenated into networks representing word pronunciations (=-=Chen 1990-=-; Riley 1991). A detailed comparison of the two approaches is desirable, but so far hasn't been carried out. We simply remark that both approaches could be combined by generating allophone sequences f... |

19 | Dynamic construction of finite automata from examples using hill-climbing - Tomita - 1982 |

14 | Hidden Markov Models in molecular biology: New algorithms and applications’, this volume
- Baldi, Chauvin, et al.
- 1993
(Show Context)
Citation Context ...model of choice for speech recognition (Rabiner & Juang 1986). Recent applications include part-of-speech tagging (Cutting et al. 1992) and protein classification and alignment (Haussler et al. 1992; =-=Baldi et al. 1993-=-). Because HMMs can be seen as probabilistic generalizations of non-deterministic finite-state automata they are also of interest from the point of view of formal language induction. For most modeling... |

8 | Dynamic programming inference of Markov networks from finite set of sample strings - THOMASON, GRANUM - 1986 |

4 |
Implicit learning of artifical grammars
- Reber
- 1967
(Show Context)
Citation Context ...ction (set union) operator. 12 This test model was inspired by finite-state models with similar characteristics that have been the subject of investigations into human language learning capabilities (=-=Reber 1969-=-; Cleeremans 1991) 30 Start a a c b b c End 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 1.0 0.5 0.5 Figure 5: Case study I: HMM generating the test language ac a [ bc b. 31 Alternatively, a sample of 20 rando... |

2 |
Mechanisms of Implicit Learning. A Parallel Distributed Processing Model of Sequence Acquisition
- CLEEREMANS
- 1991
(Show Context)
Citation Context ...nion) operator. 12 This test model was inspired by finite-state models with similar characteristics that have been the subject of investigations into human language learning capabilities (Reber 1969; =-=Cleeremans 1991-=-) 30 Start a a c b b c End 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 1.0 0.5 0.5 Figure 5: Case study I: HMM generating the test language ac a [ bc b. 31 Alternatively, a sample of 20 random strings was use... |

2 | Learning automata from ordered examples. Machine Learning 7.109–138 - PORAT, FELDMAN - 1991 |