## Subspace constrained gaussian mixture models for speech recognition (2005)

Venue: | IEEE Transactions on Speech and Audio Processing |

Citations: | 10 - 3 self |

### BibTeX

@INPROCEEDINGS{Axelrod05subspaceconstrained,

author = {Scott Axelrod and Vaibhava Goel and Ramesh A. Gopinath and Senior Member and Peder A. Olsen and Karthik Visweswariah},

title = {Subspace constrained gaussian mixture models for speech recognition},

booktitle = {IEEE Transactions on Speech and Audio Processing},

year = {2005},

pages = {submitted.}

}

### OpenURL

### Abstract

Abstract — A standard approach to automatic speech recognition uses Hidden Markov Models whose state dependent distributions are Gaussian mixture models. Each Gaussian can be viewed as an exponential model whose features are linear and quadratic monomials in the acoustic vector. We consider here models in which the weight vectors of these exponential models are constrained to lie in an affine subspace shared by all the Gaussians. This class of models includes Gaussian models with linear constraints placed on the precision (inverse covariance) matrices (such as diagonal covariance, MLLT, or EMLLT) as well as the LDA/HLDA models used for feature selection which tie the part of the Gaussians in the directions not used for discrimination. In this paper we present algorithms for training these models using a maximum likelihood criterion. We present experiments on both small vocabulary, resource constrained, grammar based tasks as well as large vocabulary, unconstrained resource tasks to explore the rather large parameter space of models that fit within our framework. In particular, we demonstrate significant improvements can be obtained in both word error rate and computational complexity. I.

### Citations

1064 |
The use of multiple measures in taxonomic problems
- Fisher
(Show Context)
Citation Context ...ve semi-tied full covariance models with varying numbers of Gaussians and cluster centers. The MLLT technique is often used in conjunction with Fisher-Rao Linear Discriminant Analysis (LDA) technique =-=[10]-=-–[12] for feature selection. In this technique, one starts with an “unprojected” acoustic data vector £ with some large � number of components. (Such a vector at a given � time is typically obtained b... |

625 |
Perceptual linear predictive (PLP) analysis of speech
- Hermansky
- 1990
(Show Context)
Citation Context ...LDA projection of � �sdimensional vectors, which were themselves obtained by concatenatingsconsecutive vectors of mean normalized, � ¡ dimensional MFCC features. Speaker adaptive VTLN warping factors =-=[36]-=-, and FMLLR transformations [37] were trained based on an initial decoding. A trigram language model with �¢¡ � words was used. We took as our baseline model, the one with the BIC penalty of � and � �... |

568 |
Linear Statistical Inference and Its Applications
- Rao
- 1965
(Show Context)
Citation Context ...mi-tied full covariance models with varying numbers of Gaussians and cluster centers. The MLLT technique is often used in conjunction with Fisher-Rao Linear Discriminant Analysis (LDA) technique [10]–=-=[12]-=- for feature selection. In this technique, one starts with an “unprojected” acoustic data vector £ with some large � number of components. (Such a vector at a given � time is typically obtained by con... |

517 | On the limited memory bfgs method for large scale optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...imply taken to be that of the gradient. However, there are many standard algorithms that have much faster convergence. We have used both the conjugate-gradient [27] and limited memory BFGS techniques =-=[28]-=-, which produce comparable results for us in roughly comparable runtime. Both techniques implicitly use information about the Hessian of , without actually having to compute or store the Hessian, to a... |

510 | Probabilistic principal component analysis
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ...ed for full covariance models – computational cost and potential overtraining due to large numbers of parameters. Approaches such as Factor Analysis [1] and Probabilistic Principal Component Analysis =-=[2]-=- generalize diagonal models by adding a set of rank-one terms to a diagonal covariance model. They have been applied to speech recognition with moderate success [3]–[5]. However, these direct covarian... |

434 | Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition
- Gales
- 1997
(Show Context)
Citation Context ...l vectors, which were themselves obtained by concatenatingsconsecutive vectors of mean normalized, � ¡ dimensional MFCC features. Speaker adaptive VTLN warping factors [36], and FMLLR transformations =-=[37]-=- were trained based on an initial decoding. A trigram language model with �¢¡ � words was used. We took as our baseline model, the one with the BIC penalty of � and � �¢¡ � Gaussians, which had an err... |

198 | Semi-Tied Covariance Matrices for Hidden Markov Models
- Gales
- 1999
(Show Context)
Citation Context ...ich the Gaussians are diagonal to be the columns of a linear transform which maximizes likelihood on the training data. One generalization of MLLT models are the “semi-tied full covariance models” of =-=[7]-=- in which the Gaussians are clustered and each cluster has its own basis in which all of its Gaussian are diagonal. Another class of models generalizing the MLLT models are the Extended Maximum Likeli... |

181 | Determinant maximization with linear matrix inequality constraints
- Vandenberghe, Boyd, et al.
- 1998
(Show Context)
Citation Context ...f a PCGMM is the problem of finding a matrix with maximal determinant subjects to some linear constraints. Although the discussion presented here is self-contained, the interested reader may refer to =-=[26]-=- and references cited therein for a thorough mathematical discussion of these dual optimization problems. The core of the optimizations for both the tied and untied parameters uses a gradient based se... |

174 |
Estimating the Dimensions of a Model. The Annals of Statistics
- Schwarz
- 1978
(Show Context)
Citation Context ...es. The simplest approach is to assign the same number of Gaussians to each state. However, a somewhat more principled, but still tractable, approach based on the Bayesian Information Criterion (BIC) =-=[33]-=- has been shown to yield more accurate speech recognition systems [34]. In our context, we can apply BIC to determine the most likely number £ ¨ of Gaussian components for the untied model for state §... |

159 |
Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory
- Brown
- 1986
(Show Context)
Citation Context ...n the future [24]. APPENDIX I DEFINITION AND SIMPLE PROPERTIES OF EXPONENTIAL MODELS We first give some background about distributions in the exponential family. For further detail, see, for example, =-=[38]-=-. An exponential model £ � ��� associated with feature �¥¤ � � �� �§¦ function � ¡ £ ¨ � ¡ � � ¡¤£ ¨ ¤ � � � ¤ � ¦ ¡¤£ ¨ ¨ , , and weight � � ¡ � � ¤ � � � ¤ � ¦ ¨ � � ¦ vector is the probability dens... |

103 |
algorithms for the ML factor analysis
- Robin, Thayer, et al.
- 1996
(Show Context)
Citation Context ...is then leads back to the same problems that occurred for full covariance models – computational cost and potential overtraining due to large numbers of parameters. Approaches such as Factor Analysis =-=[1]-=- and Probabilistic Principal Component Analysis [2] generalize diagonal models by adding a set of rank-one terms to a diagonal covariance model. They have been applied to speech recognition with moder... |

103 | Maximum likelihood modeling with Gaussian distributions for classification
- Gopinath
- 1998
(Show Context)
Citation Context ...xity problems. One approach that yields significant improvements in accuracy over simple diagonal modeling at minimal computation cost is the Maximum Likelihood Linear Transformation (MLLT) technique =-=[6]-=- where one still uses diagonal Gaussians, but chooses the basis in which the Gaussians are diagonal to be the columns of a linear transform which maximizes likelihood on the training data. One general... |

91 |
Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition,” Speech Communication
- Kumar, Andreou
- 1998
(Show Context)
Citation Context ...ction of Gaussian distributions which are tied in directions complementary to £ � and which all have equal covariance matrices. The Heteroscedastic Linear Discriminant Analysis (HLDA) models of [14], =-=[15]-=- generalize Campbell’s models for LDA by allowing the covariance matrices in the £ � directions to be Gaussian dependent. In this paper we consider state dependent distributions which are Gaussian mix... |

83 |
Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition
- Kumar
- 1997
(Show Context)
Citation Context ... collection of Gaussian distributions which are tied in directions complementary to £ � and which all have equal covariance matrices. The Heteroscedastic Linear Discriminant Analysis (HLDA) models of =-=[14]-=-, [15] generalize Campbell’s models for LDA by allowing the covariance matrices in the £ � directions to be Gaussian dependent. In this paper we consider state dependent distributions which are Gaussi... |

81 | Maximum Likelihood Discriminant Feature Spaces
- Saon, Padmanabhan, et al.
- 2000
(Show Context)
Citation Context ...ons were good, we also used the Gaussian level statistics of the models FC ¡ � �£¡ ¨ and FC ¡ ¨ � ¨ to construct LDA and HLDA projection matrices (as well as a successful variant of HLDA presented in =-=[32]-=-). The models FC ¡ � � ¨ gave WERs with less than, usually much less than, a � � ¨¥¤ degradation relative to the best performing of all of the full covariance systems with the same projected dimension... |

80 |
Computational Methods in Optimization: A Unified Approach
- Polak
- 1971
(Show Context)
Citation Context ...cent, in which the search direction is simply taken to be that of the gradient. However, there are many standard algorithms that have much faster convergence. We have used both the conjugate-gradient =-=[27]-=- and limited memory BFGS techniques [28], which produce comparable results for us in roughly comparable runtime. Both techniques implicitly use information about the Hessian of , without actually havi... |

77 |
The method of projections for finding the common point of convex sets
- Gubin, Polyak, et al.
- 1967
(Show Context)
Citation Context ...ces whose directions form a basis of � � . We handle the general case in this appendix with a simple adaptation of the method of alternating Projections Onto Convex Sets (POCS). The POCS theorem [39]–=-=[41]-=- states that if there is a point in the intersection of two closed convex sets (both contained in a suitable metric space), then the point can be found from an arbitrary initial point, by alternativel... |

57 |
Vector Quantization for the Efficient Computation of Continuous Density
- Bocchieri
- 1993
(Show Context)
Citation Context ...MPARISON OF WER FOR SOME FULL COVARIANCE MODELS DESCRIBED IN SECTION V-E. as defined in (20). The number of Gaussians that we need to evaluate will be reduced by using a Gaussian clustering technique =-=[35]-=-. These experiments were first reported on in [18], which the reader can refer to for further details. Table IV shows that the model � � ��s¨ � has error rate � � ¡ ¤ . A comparable error rate of � � ... |

38 | Maximum likelihood and minimum classification error factor analysis for automatic speech recognition
- Saul, Rahim
- 1999
(Show Context)
Citation Context ...listic Principal Component Analysis [2] generalize diagonal models by adding a set of rank-one terms to a diagonal covariance model. They have been applied to speech recognition with moderate success =-=[3]-=-–[5]. However, these direct covariance modeling techniques suffer from computational complexity problems. One approach that yields significant improvements in accuracy over simple diagonal modeling at... |

35 | Modeling inverse covariance matrices by basis expansion
- Olsen, Gopinath
(Show Context)
Citation Context ...each cluster has its own basis in which all of its Gaussian are diagonal. Another class of models generalizing the MLLT models are the Extended Maximum Likelihood Linear Transformation (EMLLT) models =-=[8]-=-, [9]. These models constrain the precision (a.k.a inverse covariance) matrices to be in a subspace of the space of symmetric matrices spanned by � (����� ) rank one matrices, so that they may be writ... |

34 | Modeling with a subspace constraint on inverse covariance matrices
- Axelrod, Olsen, et al.
- 2002
(Show Context)
Citation Context ...one can indeed obtain an improved complexity/performance curve as each generalization is introduced. Many of the theoretical and experimental results in this paper have appeared in a series of papers =-=[16]-=-–[20] which this paper both summarizes and extends. We have tried to make this paper as self contained as possible, although we make a few references to the previous papers for technical points. The S... |

19 |
Canonical variate analysis – a general formulation
- Campbell
- 1984
(Show Context)
Citation Context ... matrix ¢ � . The original formulation of LDA is motivated by an attempt to choose £ � to consist of features that are most likely to have power to discriminate between states. Subsequently, Campbell =-=[13]-=- showed that the the projection matrix ¢ � may be calculated by maximizing the likelihood of a collection of Gaussian distributions which are tied in directions complementary to £ � and which all have... |

18 | Factor analysis invariant to linear transformations of data - Gopinath, Ramabhadran, et al. |

17 |
Finding the common point of convex sets by the method of successive projection
- Bregman
- 1965
(Show Context)
Citation Context ...matrices whose directions form a basis of � � . We handle the general case in this appendix with a simple adaptation of the method of alternating Projections Onto Convex Sets (POCS). The POCS theorem =-=[39]-=-–[41] states that if there is a point in the intersection of two closed convex sets (both contained in a suitable metric space), then the point can be found from an arbitrary initial point, by alterna... |

16 |
et al., “Performance of the IBM large vocabulary continuous speech recognition system
- Bahl
- 1995
(Show Context)
Citation Context ... the same grammar based language models, HMM state transition probabilities, and Viterbi decoder which is passed state dependent probabilities for each frame vector which are obtained by table lookup =-=[31]-=- based on the ranking of probabilities obtained with a constrained Gaussian mixture model. All model training in this section was done using a fixed Viterbi alignment of 300 hours of multi-style train... |

15 | Model selection in acoustic modeling
- Chen, Gopinath
- 1999
(Show Context)
Citation Context ... each state. However, a somewhat more principled, but still tractable, approach based on the Bayesian Information Criterion (BIC) [33] has been shown to yield more accurate speech recognition systems =-=[34]-=-. In our context, we can apply BIC to determine the most likely number £ ¨ of Gaussian components for the untied model for state § , given a fixed tied model. By performing the steepest descent approx... |

12 | Maximum likelihood training of subspaces for inverse covariance modeling
- Visweswariah, Olsen, et al.
- 2003
(Show Context)
Citation Context ... generalization of this to the algorithm described in section IV for efficient full ML training of all parameters of a precision constrained model, although implicit in [16], was first implemented in =-=[17]-=-. Reference [17] also introduced affine EMLLT models and hybrid models (see section III), as well as efficient methods for training all of them. For general SPAM models, reference [18] gave efficient ... |

10 | Large vocabulary conversational speech recognition with the Extended Maximum Likelihood Linear Transformation (EMLLT) model
- Huang, Goel, et al.
- 2002
(Show Context)
Citation Context ...an indeed obtain an improved complexity/performance curve as each generalization is introduced. Many of the theoretical and experimental results in this paper have appeared in a series of papers [16]–=-=[20]-=- which this paper both summarizes and extends. We have tried to make this paper as self contained as possible, although we make a few references to the previous papers for technical points. The SPAM m... |

10 |
Mixtures of inverse covariances
- Vanhoucke, Sankar
(Show Context)
Citation Context ...f precision constrained GMMs. That paper showed that the precision constrained models obtained good improvements over MLLT and EMLLT models with equal per Gaussian computational cost. Subsequent work =-=[21]-=- also obtained significant gains with precision constrained GMMs (although the subspace basis in [21] was required to consist of positive definite basis matrices and the experiments there did not comp... |

9 | Low-resource speech recognition of 500-word vocabularies
- Deligne, Eide, et al.
- 2001
(Show Context)
Citation Context ...on we report on results of maximum likelihood training for various constrained exponential models. The experiments were all performed with the training and test set and Viterbi decoder reported on in =-=[30]-=- and [8], [9]. Some of the results here appear in [16]–[19]. The test set consists of 73743 words from utterances in four small vocabulary grammar based tasks (addresses, digits, command, and control)... |

7 | Dimensional reduction, covariance modeling and computational complexity in asr systems
- Axelrod, Gopinath, et al.
- 2003
(Show Context)
Citation Context ... implemented in [17]. Reference [17] also introduced affine EMLLT models and hybrid models (see section III), as well as efficient methods for training all of them. For general SPAM models, reference =-=[18]-=- gave efficient algorithms for finding an untied basis which approximately maximizes likelihood and for full ML training of the untied parameters given a tied basis. That paper also gave the first com... |

7 | Discriminative estimation of subspace precision and mean (SPAM) models
- Goel, Axelrod, et al.
- 2003
(Show Context)
Citation Context ...d LDA as techniques for dimensional reduction. Full ML training of general subspace constrained Gaussian mixture models was described in [19]. Discriminative training of these models was described in =-=[23]-=- and will be presented in a more systematic and rigorous manner in future work [24]. II. THE MODEL We consider speech recognition systems consisting of the following components: a frontend which proce... |

5 | Acoustic modeling with mixtures of subspace constrained exponential models
- Visweswariah, Axelrod, et al.
- 2003
(Show Context)
Citation Context ...asis. That paper also gave the first comparison between SPAM and LDA as techniques for dimensional reduction. Full ML training of general subspace constrained Gaussian mixture models was described in =-=[19]-=-. Discriminative training of these models was described in [23] and will be presented in a more systematic and rigorous manner in future work [24]. II. THE MODEL We consider speech recognition systems... |

5 | Toward domain-independent conversational speech recognition
- Kingsbury, Mangu, et al.
- 2003
(Show Context)
Citation Context ...d GMMs (although the subspace basis in [21] was required to consist of positive definite basis matrices and the experiments there did not compare to the gains obtainable with ordinary MLLT). In [20], =-=[22]-=-, precision constrained Gaussian mixture models with speaker adaptive training were applied to large vocabulary conversational speech recognition. The precision constrained models in the experiments o... |

3 |
Covariance and precision modeling in shared multiple subspaces
- Dharanipragada, Visweswariah
- 2003
(Show Context)
Citation Context ...ic Principal Component Analysis [2] generalize diagonal models by adding a set of rank-one terms to a diagonal covariance model. They have been applied to speech recognition with moderate success [3]–=-=[5]-=-. However, these direct covariance modeling techniques suffer from computational complexity problems. One approach that yields significant improvements in accuracy over simple diagonal modeling at min... |

2 |
The quadratic eigenvalue problem,” Society for industrial and applied mathematics
- Tisseur, Meerbergen
- 2001
(Show Context)
Citation Context ...e search when optimizing for the � � , the precision matrix is a quadratic function of the line search parameter. The parameter values where the determinant is zero are called “quadratic eigenvalues” =-=[29]-=-. They may be readily computed using an eigenvalue decomposition of a certain matrix of size � � . To perform efficient line search, we express the determinant of the precision matrix in terms of the ... |

1 |
Discriminatively trained acoustic models comprised of mixtures of exponentials with a tied subspace constraint,” in preparation
- Axelrod, Goel, et al.
(Show Context)
Citation Context ... constrained Gaussian mixture models was described in [19]. Discriminative training of these models was described in [23] and will be presented in a more systematic and rigorous manner in future work =-=[24]-=-. II. THE MODEL We consider speech recognition systems consisting of the following components: a frontend which processes a raw input acoustic waveform into a time series of acoustic feature £ vectors... |