## Factored sparse inverse covariance matrices (2000)

### Cached

### Download Links

Venue: | In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing |

Citations: | 38 - 10 self |

### BibTeX

@INPROCEEDINGS{Bilmes00factoredsparse,

author = {Jeff A. Bilmes},

title = {Factored sparse inverse covariance matrices},

booktitle = {In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

Most HMM-based speech recognition systems use Gaussian mixtures as observation probability density functions. An important goal in all such systems is to improve parsimony. One method is to adjust the type of covariance matrices used. In this work, factored sparse inverse covariance matrices are introduced. Based on Í �Í factorization, the inverse covariance matrix can be represented using linear regressive coefficients which 1) correspond to sparse patterns in the inverse covariance matrix (and therefore represent conditional independence properties of the Gaussian), and 2), result in a method of partial tying of the covariance matrices without requiring non-linear EM update equations. Results show that the performance of full-covariance Gaussians can be matched by factored sparse inverse covariance Gaussians having significantly fewer parameters. 1.

### Citations

1103 |
Numerical Modelling of the
- S, ERRAUD, et al.
- 2000
(Show Context)
Citation Context ...in elements of the inverse covariance matrix to be zero, the number of parameters in the system can be reduced. This is the idea behind covariance selection, originally advocated in [4], described in =-=[6, 9, 13]-=-, and proposed for speech in [1, 3]. Also, in [14], a procedure was given for learning the structure of mixtures of Bayesian networks which, in that work, corresponded to mixtures of Gaussians with sp... |

362 | Towards optimal feature selection
- Koller, Sahami
- 1996
(Show Context)
Citation Context ...certain elements of�to be non-zero, and trains the result. This is somewhat analogous to a forward approach [4] to covariance selection (and also similar to the forward feature selection procedure of =-=[7]-=-). This strategy has the advantage that it requires only a simple boot system containing relatively few but robustly estimated parameters. Adding additional dependencies can be seen as correcting defi... |

187 |
Covariance selection
- Dempster
- 1972
(Show Context)
Citation Context .... By forcing certain elements of the inverse covariance matrix to be zero, the number of parameters in the system can be reduced. This is the idea behind covariance selection, originally advocated in =-=[4]-=-, described in [6, 9, 13], and proposed for speech in [1, 3]. Also, in [14], a procedure was given for learning the structure of mixtures of Bayesian networks which, in that work, corresponded to mixt... |

181 | Semi-tied covariance matrices for hidden Markov models
- Gales
- 1999
(Show Context)
Citation Context ...ly, various matrix decomposition methods of the form�����(where�is diagonal and�is an arbitrary matrix) have been applied to covariance matrices along with different styles of partial parameter tying =-=[5, 10, 16]-=-. These methods could collectively be called partially tied covariance matrices since only a portion of the covariance matrix is not tied and remains uniquely associated with each mixture component of... |

85 | Buried Markov models for speech recognition
- Bilmes
- 1999
(Show Context)
Citation Context ...olated-word, telephone-speech database[11]. Data is represented using 12 MFCCs plus and deltas resulting in a���element feature vector sampled every 10ms. The training and test sets are as defined in =-=[2]-=-. Test words do not occur in the training vocabulary, so test word models are constructed using phone models learned during training. Strictly left-to-right transition matrices were used except for an... |

76 |
A review of large-vocabulary continuous-speech recognition
- Young
- 1996
(Show Context)
Citation Context ... state-of-the-art speech recognition systems represent the joint distribution of features for each utterance using hidden Markov models (HMMs) with multivariate Gaussian mixture observation densities =-=[15]-=-. An important goal for designers of automatic speech recognition (ASR) systems is to achieve a high level of performance while minimizing the number of parameters used by the system. One way of contr... |

74 |
Linear predictive hidden Markov models and the speech signal
- Poritz
- 1982
(Show Context)
Citation Context ... is like a conditional Gaussian distribution with conditioning variables coming from the same feature vector rather than from somewhere else in time. The same optimization procedures as those used in =-=[1, 2, 12]-=- can therefore be used here. As in other matrix decompositions [5, 16], the covariance matrix for a particular mixture component and HMM state can be represented as¦Ñ�ÍÖ�ÑÍÖwhereÍÖmay be tied together... |

60 |
PhoneBook: A phonetically-rich isolatedword telephone-speech database
- Pitrelli, Fong, et al.
- 1995
(Show Context)
Citation Context ...owever, only the later approach is evaluated. 5. RESULTS Speech recognition results were obtained using NYNEX PHONEBOOK, a large-vocabulary, phonetically-rich, isolated-word, telephone-speech database=-=[11]-=-. Data is represented using 12 MFCCs plus and deltas resulting in a���element feature vector sampled every 10ms. The training and test sets are as defined in [2]. Test words do not occur in the traini... |

53 | Natural Statistical Models for Automatic Speech Recognition
- Bilmes
- 1999
(Show Context)
Citation Context ...ber of parameters) is by adjusting the inherent statistical dependencies made by a probabilistic model. Ideally, only the important statistical dependencies in the training data should be represented =-=[1]-=- and the direct relationships between the remaining random variables should be left unspecified. Covariance matrices are no exception to this rule. In general, the location of any zeros in the inverse... |

27 |
The importance of cepstral parameter correlations in speech recognition
- Ljolje
- 1994
(Show Context)
Citation Context ...mber of components. The alternative, requiring many more parameters, has been to use full covariance matrices where each component corresponds to a more complex distribution. It has been demonstrated =-=[10]-=- that, at least for the standard features used for speech recognition (cepstral features), representing correlation explicitly by including non-zero off-diagonal covariance elements can improve word a... |

21 | Learning mixtures of Bayesian networks
- Thiesson, Meek, et al.
- 1997
(Show Context)
Citation Context ... the number of parameters in the system can be reduced. This is the idea behind covariance selection, originally advocated in [4], described in [6, 9, 13], and proposed for speech in [1, 3]. Also, in =-=[14]-=-, a procedure was given for learning the structure of mixtures of Bayesian networks which, in that work, corresponded to mixtures of Gaussians with sparse inverse covariance matrices. This paper intro... |

5 |
Model selection in acoustic modeling. EUROSPEECH
- Chen, Gopinath
- 1999
(Show Context)
Citation Context ...atrix to be zero, the number of parameters in the system can be reduced. This is the idea behind covariance selection, originally advocated in [4], described in [6, 9, 13], and proposed for speech in =-=[1, 3]-=-. Also, in [14], a procedure was given for learning the structure of mixtures of Bayesian networks which, in that work, corresponded to mixtures of Gaussians with sparse inverse covariance matrices. T... |

3 |
Joint estimation of feature transformation parameters and Gaussian mixture model for speaker identification
- Yuo, Wang
- 1999
(Show Context)
Citation Context ...ly, various matrix decomposition methods of the form�����(where�is diagonal and�is an arbitrary matrix) have been applied to covariance matrices along with different styles of partial parameter tying =-=[5, 10, 16]-=-. These methods could collectively be called partially tied covariance matrices since only a portion of the covariance matrix is not tied and remains uniquely associated with each mixture component of... |

2 |
Covariance Selection.” Suppl
- Knuiman
- 1978
(Show Context)
Citation Context ...in elements of the inverse covariance matrix to be zero, the number of parameters in the system can be reduced. This is the idea behind covariance selection, originally advocated in [4], described in =-=[6, 9, 13]-=-, and proposed for speech in [1, 3]. Also, in [14], a procedure was given for learning the structure of mixtures of Bayesian networks which, in that work, corresponded to mixtures of Gaussians with sp... |

2 |
Sparse inverse covariance matrices and efficient maximum likelihood classification of hyperspectral data
- Roger
- 1995
(Show Context)
Citation Context ...in elements of the inverse covariance matrix to be zero, the number of parameters in the system can be reduced. This is the idea behind covariance selection, originally advocated in [4], described in =-=[6, 9, 13]-=-, and proposed for speech in [1, 3]. Also, in [14], a procedure was given for learning the structure of mixtures of Bayesian networks which, in that work, corresponded to mixtures of Gaussians with sp... |

1 |
A flexible method of creating HMM using block-diagonalization of covariance matrices
- Koshiba, Tachimori, et al.
- 1998
(Show Context)
Citation Context ... a variety of choices for covariance structure other than diagonal or full, some of which have been previously used as HMM state-conditioned observation densities. Two examples include block-diagonal =-=[8]-=- and banded-diagonal matrices. Another method often used by ASR systems to reduce parameters (and thereby increase estimation robustness) is tying, where certain parameters are shared amongst a number... |