## Probabilistic independence networks for hidden Markov probability models (1996)

### Cached

### Download Links

Citations: | 173 - 12 self |

### BibTeX

@MISC{Smyth96probabilisticindependence,

author = {Padhraic Smyth and David Heckerman and Michael I. Jordan},

title = {Probabilistic independence networks for hidden Markov probability models},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

Graphical techniques for modeling the dependencies of random variables have been explored in a variety of different areas including statistics, statistical physics, artificial intelligence, speech recognition, image processing, and genetics. Formalisms for manipulating these models have been developed relatively independently in these research communities. In this paper we explore hidden Markov models (HMMs) and related structures within the general framework of probabilistic independence networks (PINs). The paper contains a self-contained review of the basic principles of PINs. It is shown that the well-known forward-backward (F-B) and Viterbi algorithms for HMMs are special cases of more general inference algorithms for arbitrary PINs. Furthermore, the existence of inference and estimation algorithms for more general graphical models provides a set of analysis tools for HMM practitioners who wish to explore a richer class of HMM structures. Examples of relatively complex models to handle sensor fusion and coarticulation in speech recognition are introduced and treated within the graphical model framework to illustrate the advantages of the general approach.

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

7493 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...contains the variables in the intersection of the two cliques which it links. Given a junction tree representation, one can factorize p(U) as the product of clique marginals over separator marginals (=-=Pearl 1988-=-): p(u) = Q C2VC p(xC) Q S2VS p(xS) where p(xC) andp(xS) are the marginal (joint) distributions for the variables in clique C and separator S respectively and VC and VS are the set of cliques and sepa... |

4597 | A tutorial on hidden Markov models and selected applications in speech processing
- Rabiner
- 1989
(Show Context)
Citation Context ...special cases of inference algorithms for UPINs and can be considerably less e cient (Shachter et al. 1994). 4 Modeling HMMs as PINs 4.1 PINs for HMMs In hidden Markov modeling problems (Poritz 1988��=-=� Rabiner 1989) we are interes-=-ted in the set of random variables U = fH 1�O 1�H 2�O 2�:::�HN ;1�ON ;1�HN�ON g, whereHi is a discretevalued hidden variable at index i, and Oi is the corresponding discrete-valued obs... |

4055 |
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...ical speci cations of a particular probability model consistent with the UPIN structure. Terms used in the literature to described UPINs of one form or another include Markov random elds (Isham 1981, =-=Geman and Geman 1984-=-), Markov networks (Pearl 1988), Boltzmann machines (Hinton and Sejnowski 1986), and log-linear models (Bishop, Fienberg, & Holland 1973). 3.1.1 Conditional Independence Semantics of UPIN Structures L... |

2771 | Estimating the dimension of a model - Schwarz - 1978 |

1349 | Local computations with probabilities on graphical structures and their application to expert systems (with discussion - Lauritzen, Spiegelhalter - 1988 |

1014 |
Quantum Field Theory
- Itzykson, Zuber
- 1980
(Show Context)
Citation Context ...e statistical physics literature, where undirected graphical models in the form of chains, trees, lattices, and "decorated" variations on chains and trees have been studied for many years (s=-=ee, e.g., Itzykson and Drouff'e, 1991-=-). The general methods developed there, notably the transfer matrix formalism (e.g., Morgenstern and Binder, 1983), support exact calculations on general undirected graphs. The transfer matrix recursi... |

981 |
An Introduction to Bayesian Networks
- Jensen
- 1996
(Show Context)
Citation Context ..., adding links if necessary. If no node can be eliminated without adding links, then we choose the node that can be eliminated by adding the links that yield the clique with the smallest state-space (=-=Jensen 1995-=-). After triangulation the JLO algorithm constructs a junction tree from G 0 , i.e., a clique tree satisfying the running intersection property. The junction tree construction is based on the 11sX1 X ... |

953 | Learning Bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

826 |
Optimal Statistical Decision
- Degroot
- 1971
(Show Context)
Citation Context ... Dirichlet distribution for the parameters of discrete variables and the mixing coe cients of Gaussian-mixture codebooks, and the normal-Wishart distribution for the parameters of Gaussian codebooks (=-=DeGroot 1970� -=-Buntine 1994� Heckerman and Geiger 1995). These priors have also been used in MAP estimates of standard HMMs (e.g., Gauvain and Lee, 1994). Heckerman and Geiger (1995) describe a simple method for a... |

594 | Probabilistic inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...aximum-a-posteriori (MAP), or full Bayesian methods, using traditional techniques such as gradient descent, expectation-maximization (EM) (e.g., Dempster et al., 1977), and MonteCarlo sampling (e.g., =-=Neal, 1993-=-). For the standard HMM(1,1) model discussed in this paper, where either discrete, Gaussian, or Gaussian-mixture codebooks are used, a ML or MAP estimate using EM is a well-known e cient approach (Por... |

582 | Bayesian interpolation
- MacKay
- 1991
(Show Context)
Citation Context ...the observation that, under certain conditions, the quantity p(` s jS) 1 p(Dj` s ; S) converges to a multivariate Gaussian distribution as the sample size increases (see, e.g., Kass et al., 1988, and =-=MacKay, 1992-=-ab). Less accurate but more efficient approximations are based on the observation that the Gaussian distribution converges to a delta function centered at the maximum-a-posteriori (MAP) and eventually... |

546 | Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...ine 1994; Heckerman and Geiger 1995). Heckerman and Geiger (1995) describe a simple method for assessing these priors. These priors have also been used for learning parameters in standard HMMs (e.g., =-=Gauvain and Lee, 1994-=-). Parameter independence is usually not assumed in general for HMM structures. For example, in the HMM(1,1) model, a standard assumption is that p(H i jH i01 ) = p(H j jH j01 ) and p(O i jH i ) = p(O... |

522 | Discrete Multivariate Analysis: Theory and Practice
- Bishop, Fienberg, et al.
- 1975
(Show Context)
Citation Context ...nt probability distribution, i.e., a marginal representation. An algorithm known as Iterative Proportional Fitting (IPF) is available to perform this conversion. Classically, IPF proceeds as follows (=-=Bishop, Fienberg, & Holland, 1973-=-). Suppose for simplicity that all of the random variables are discrete (a Gaussian version of IPF is also available (Whittaker 1990)) such that the joint distribution can be represented as a table. T... |

515 | Factorial hidden markov models - Ghahramani, Jordan - 1997 |

467 |
Graphical models in applied multivariate statistics
- Whittaker
- 1990
(Show Context)
Citation Context ...er we restrict our attention to discrete-valued random variables, however, many of the results stated generalize directly to continuous and mixed sets of random variables (Lauritzen and Wermuth 1989��=-=� Whittaker 1990-=-). Let lower case x 1 denote one of the values of variable X 1: the notation P x1 is taken to mean the sum over all possible values of X 1.Letp(xi) be shorthand for the particular probability p(Xi = x... |

429 | A Practical Bayesian Framework for Backpropagation Networks
- Mackay
- 1992
(Show Context)
Citation Context ...the observation that, under certain conditions, the quantity p(` s jS) 1 p(Dj` s ; S) converges to a multivariate Gaussian distribution as the sample size increases (see, e.g., Kass et al., 1988, and =-=MacKay, 1992-=-ab). Less accurate but more efficient approximations are based on the observation that the Gaussian distribution converges to a delta function centered at the maximum-a-posteriori (MAP) and eventually... |

395 |
Statistical inference for probabilistic functions of finite state markov chains
- Baum, Petrie
- 1966
(Show Context)
Citation Context ...re in fact special cases of inference algorithms for UPINs and can be considerably less efficient (Shachter et al. 1994). 4 Modeling HMMs as PINs 4.1 PINs for HMMs In hidden Markov modeling problems (=-=Baum and Petrie 1966-=-; Poritz 1988; Rabiner 1989; Huang, Ariki, and Jack 1990; Elliott, Aggoun, and Moore 1995) we are interested in the set of random variables U = fH 1 ; O 1 ; H 2 ; O 2 ; . . . ; HN01 ; ON01 ; HN ; ON g... |

323 | Bayesian Model Selection in Social Research
- Raftery
- 1995
(Show Context)
Citation Context ...). The BIC score is the additive inverse of Rissanen's (1987) minimum description length (MDL). Other scores, which can be viewed as approximations to the marginal likelihood, are hypothesis testing (=-=Raftery 1995-=-) and cross validation (Fung and Crawford 1990). Buntine 2 One caveat: The BIC score is derived under the assumption that the parameter prior is positive throughout its domain. 28(in press) provides ... |

305 |
Learning and Relearning in Boltzmann Machines
- Hinton, Sejnowski
- 1986
(Show Context)
Citation Context ...IN structure. Terms used in the literature to described UPINs of one form or another include Markov random fields (Isham 1981, Geman and Geman 1984), Markov networks (Pearl 1988), Boltzmann machines (=-=Hinton and Sejnowski 1986-=-), and log-linear models (Bishop, Fienberg, & Holland 1973). 3.1.1 Conditional Independence Semantics of UPIN Structures Let A, B, and S be any disjoint subsets of nodes in an undirected graph (UG) G.... |

295 | Hidden Markov Models for speech recognition - D, Ariki, et al. - 1990 |

293 | Model selection and accounting for model uncertainty in graphical models using Occam’s window - Madigan, Raftery - 1994 |

275 | Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters 58:86–8 - Swendsen, Wang - 1987 |

253 | Operations for learning with graphical models
- Buntine
- 1994
(Show Context)
Citation Context ...e distributions include the normal-Wishart distribution for the parameters of Gaussian codebooks and the Dirichlet distribution for the mixing coefficients of Gaussianmixture codebooks (DeGroot 1970; =-=Buntine 1994-=-; Heckerman and Geiger 1995). Heckerman and Geiger (1995) describe a simple method for assessing these priors. These priors have also been used for learning parameters in standard HMMs (e.g., Gauvain ... |

206 | 1990. Sequential updating of conditional probabilities on directed graphical structures - Spiegelhalter, Lauritzen |

179 | A guide to the literature on learning probabilistic networks from data - Buntine - 1996 |

174 |
Graphical Models for Association Between Variables, Some Which Are Qualitative and Some Quantitative.” Annals of Statistics. 17(1989): 31 – 57
- Lauritzen, Wermuth
(Show Context)
Citation Context ...For the purposes of this paper we restrict our attention to discrete-valued random variables, however, many of the results stated generalize directly to continuous and mixed sets of random variables (=-=Lauritzen and Wermuth 1989��-=-� Whittaker 1990). Let lower case x 1 denote one of the values of variable X 1: the notation P x1 is taken to mean the sum over all possible values of X 1.Letp(xi) be shorthand for the particular prob... |

169 |
Explaining phonetic variation: A sketch of the H&H theory
- Lindblom
- 1990
(Show Context)
Citation Context ...ple, equivalent shifts in formant frequencies can be caused by lip-rounding or tongue-raising� such phenomena are generically refered to as \trading relations" in the speech psychophysics liter=-=ature (Lindblom 1990� -=-Perkell et al. 1993). Once a particular acoustic pattern is observed, the causes become dependent� thus for example, evidence that the lips are rounded would act to discount inferences that the tong... |

153 |
Bayesian updating in recursive graphical models by local computation
- Jensen, Lauritzen, et al.
- 1990
(Show Context)
Citation Context ...ete one gets a new representation Kf such that the local potential on each clique is f (xC) =p(xh C�e), i.e., the joint probability of the local unobserved clique variables and the observed evidence=-= (Jensen et al. 1990) (s-=-imilarly for the separator potential functions). If one marginalizes at the clique over the unobserved local clique variables, X X h C p(x h C�e)=p(e)� (16) one gets the probability of the observe... |

147 |
Independence properties of directed Markov fields. Networks
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ... complex interpretation in the directed context: S separates A from B in a directed graph if S separates A from B in the moral (undirected) graph of the smallest ancestral set containing A, B, and S (=-=Lauritzen et al. 1990-=-). It can be shown that this definition of a DPIN structure is equivalent to the more intuitive statement that, given the values of its parents, a variable X i is independent of all other nodes in the... |

104 | Exploiting tractable substructures in intractable networks - Saul, Jordan - 1996 |

103 | Hidden Markov Models: Estimation and Control - Elliott, Aggoun, et al. - 1995 |

89 | Applications of a general propagation algorithm for probabilistic expert systems - Dawid - 1992 |

67 |
Hidden Markov models: A guided tour
- Poritz
- 1988
(Show Context)
Citation Context ... are in fact special cases of inference algorithms for UPINs and can be considerably less e cient (Shachter et al. 1994). 4 Modeling HMMs as PINs 4.1 PINs for HMMs In hidden Markov modeling problems (=-=Poritz 1988� Rabiner 1989) -=-we are interested in the set of random variables U = fH 1�O 1�H 2�O 2�:::�HN ;1�ON ;1�HN�ON g, whereHi is a discretevalued hidden variable at index i, and Oi is the corresponding discr... |

66 |
Maximum A-Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...normal-Wishart distribution for the parameters of Gaussian codebooks (DeGroot 1970� Buntine 1994� Heckerman and Geiger 1995). These priors have also been used in MAP estimates of standard HMMs (e.=-=g., Gauvain and Lee, 1994-=-). Heckerman and Geiger (1995) describe a simple method for assessing these priors. The use of the EM algorithm for UPINs is similar. Suppose that the undirected model M consists of cliques Cij such t... |

56 | Learning Bayesian networks with discrete variables from data - Spirtes, Meek - 1995 |

53 | Boltzmann chains and hidden Markov models
- Saul, Jordan
- 1995
(Show Context)
Citation Context ...istence of the JLO algorithm frees us from having to derive particular recursive algorithms on a case-by-case basis. The rst model that we consider can be viewed as a coupling of two HMM(1,1) chains (=-=Saul & Jordan, 1995-=-). Such a model can be useful in general sensor fusion problems, for example in the fusion of an audio signal with a video signal in lipreading. Because di erent sensory signals generally have di eren... |

50 |
Prequential analysis, stochastic complexity and Bayesian inference,” in Bayesian Statistics 4
- Dawid
- 1992
(Show Context)
Citation Context ...erse of Rissanen's (1987) minimum description length (MDL). Other scores, which can be viewed as approximations to the marginal likelihood, are hypothesis testing (Raftery 1995) and cross validation (=-=Dawid 1992-=-b). Buntine (in press) provides a comprehensive review of scores for model selection and model averaging in the context of PINs. Another complication with Bayesian model averaging is that there may be... |

44 | Coarticulation in recent speech production models - Kent, Minifie - 1977 |

44 | Global conditioning for probabilistic inference in belief networks - Shachter, Andersen, et al. - 1994 |

43 | A Markov random field model-based approach to image interpretation - Modestino, Zhang - 1992 |

42 |
Hidden Markov Models for Fault Detection in Dynamic Systems
- Smyth
- 1994
(Show Context)
Citation Context ...ticular hidden state value given the observed evidence. Inferring the posterior state probabilities is useful when the states have direct physical interpretations (as in fault monitoring applictions (=-=Smyth 1994-=-)) and is also implicitly required during the standard Baum-Welch learning algorithm for HMM(1,1). In general, both of these computations scale as m N where m is the number of states for each hidden v... |

41 | Stochastic Complexity (with discussion - Rissanen - 1987 |

40 | Constructor: a system for the induction of probabilistic models
- Fung, Crawford
- 1990
(Show Context)
Citation Context ...erse of Rissanen's (1987) minimum description length (MDL). Other scores, which can be viewed as approximations to the marginal likelihood, are hypothesis testing (Raftery 1995) and cross validation (=-=Fung and Crawford 1990).-=- Buntine (in press) provides a comprehensive review of the literature on learning PINs. In the context of HMM(K� J) type structures, an obvious question is how one could learn such structure from da... |

29 | The logic of influence diagrams - PEARL, GEIGER, et al. - 1988 |

25 |
Trading relations between tongue-body raising and lip rounding in production of the vowel /u/: A pilot ‘motor equivalence’ study
- Perkell, Matthies, et al.
- 1993
(Show Context)
Citation Context ... shifts in formant frequencies can be caused by lip-rounding or tongue-raising; such phenomena are generically refered to as \trading relations" in the speech psychophysics literature (Lindblom 1990; =-=Perkell et al. 1993-=-). Once a particular acoustic pattern is observed, the causes become dependent; thus for example, evidence that the lips are rounded would act to discount inferences that the tongue has been raised. T... |

21 |
An Introduction to Spatial Point Processes and Markov Random Fields
- Isham
- 1981
(Show Context)
Citation Context ... clique functions. The clique functions 4 X 3 VC X5srepresent the particular parameters associated with the UPIN structure. This corresponds directly to the standard de nition of a Markov random eld (=-=Isham 1981). Th-=-e clique functions re ect the relative \compatibility" of the value assignments in the clique. A model p is said to be decomposable if it has a minimal UPIN structure G which is triangulated (Fig... |

20 | On the effective implementation of the iterative proportional fitting procedure - Jirousek, Preucil - 1995 |

19 |
Independence properties of directed Markov elds
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ... di erent interpretation in the directed context: S separates A from B in a directed graph if S separates A from B in the moral (undirected) graph of the smallest ancestral set containing A, B, andS (=-=Lauritzen et al. 1990-=-). It can be shown that this is equivalent to the statement that a variable Xi is independent of all other nodes in the graph except for its descendants, given the values of its parents. Thus, as with... |

19 |
Asymptotics in Bayesian computation
- Kass, Tierney, et al.
- 1988
(Show Context)
Citation Context ...icient is one based on the observation that, under certain conditions, the quantity p(` s jS) 1 p(Dj` s ; S) converges to a multivariate Gaussian distribution as the sample size increases (see, e.g., =-=Kass et al., 1988-=-, and MacKay, 1992ab). Less accurate but more efficient approximations are based on the observation that the Gaussian distribution converges to a delta function centered at the maximum-a-posteriori (M... |