## Linear Gaussian models for speech recognition (2004)

### Cached

### Download Links

Venue: | CAMBRIDGE UNIVERSITY |

Citations: | 15 - 0 self |

### BibTeX

@TECHREPORT{Rosti04lineargaussian,

author = {Antti-Veikko Ilmari Rosti},

title = {Linear Gaussian models for speech recognition},

institution = {CAMBRIDGE UNIVERSITY},

year = {2004}

}

### OpenURL

### Abstract

Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo-

### Citations

8539 | Maximum likelihood from incomplete data via the EM algorithm (with discussion
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...rocesses are based on linear functions and Gaussian distributed noise sources. Linear Gaussian models are popular as many forms may be trained efficiently using the expectation maximisation algorithm =-=[19]-=-. This work generalises these models to include Gaussian mixture models as the noise sources. The observation process in this work will be assumed to be based on factor analysis, although linear discr... |

5086 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...nd caching may be used to increase efficiency [34]. 2.3.2 Maximum Likelihood Parameter Estimation Maximum likelihood estimation is a standard scheme to learn a set of model parameters given some data =-=[11]. The -=-objective is to find parameters, ˆ θ, that maximise the likelihood function p(O|θ). If the data O = {o1, . . . , oN} are assumed independent, the objective function can be written as p(O|θ) = N�... |

4490 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...he linear discriminant observation process is illustrated in case of HMM based [33, 45, 70] and linear first-order Gauss-Markov based state evolution processes [111]. The standard hidden Markov model =-=[106, 127]-=- can be viewed as a special case of both the observation processes when k = p by just omitting the observation noise and setting the observation matrix to an identity matrix; that is, C = I. Also semi... |

2301 | A new approach to linear filtering and prediction problems - Kalman - 1960 |

1227 |
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...ical. Instead the word sequence producing the maximum likelihood state sequence is searched for. There is an efficient algorithm to find the maximum likelihood state sequence called Viterbi algorithm =-=[123].-=- The Viterbi algorithm is based on a variable φj(t) which represents the maximum likelihood of observing vectors {o1, . . . , ot} and being in state j at time t. This variable differs from the forwar... |

1185 | Condensation – conditional density propagation for visual tracking
- Isard, Blake
- 1998
(Show Context)
Citation Context ...]. As an example of classical Monte Carlo methods, importance sampling is considered. Importance sampling is also of fundamental importance in modern sequential Monte Carlo and particle filter theory =-=[25, 26, 59]. -=-Importance sampling is based on drawing samples from a proposal distribution q(x). As long as the proposal distribution and the objective are non-zero in the same region; that is, p(x) ≤ Zq(x), Z < ... |

960 | Monte Carlo Statistical Methods
- Robert, Casella
- 2004
(Show Context)
Citation Context ...DELS 50 with tractable sums ÎN(f) = 1 N N� n=1 f(x (n) � ) −→ I(f) = N→∞ f(x)p(x)dx (4.18) which are unbiased and by the strong law of large numbers will converge almost surely as N tends=-= to infinity [110]-=-. The first problem is how to draw samples from a given probability density function. This is straightforward only if the density, p(x), is of a standard form, for example, in a Gaussian distribution ... |

950 | Sequential Monte Carlo Methods in Practice - Doucet, Freitas, et al. - 2001 |

856 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...mulae for the example may be found in a variety of studies [89, 92]. 4.3.4 Variational Methods Variational methods have recently been introduced to carry out approximate inference in graphical models =-=[62]. -=-If the posterior distribution over the hidden variables is intractable and it hassCHAPTER 4. LEARNING AND INFERENCE IN LINEAR GAUSSIAN MODELS 48 Algorithm 3 Expectation Propagation initialise ˜tn(θ)... |

820 |
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...tic modelling [60]. First, the generative model of HMM is presented along with typical observation density 1 assumptions. The maximum likelihood (ML) parameter estimation and the Baum-Welch algorithm =-=[7, 8]-=- are then reviewed. Alternative training criteria are also discussed. 2.3.1 Generative Model of HMM In HMM based speech recognition, it is assumed that the sequence of p-dimensional observation vector... |

800 | Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
- Davis, Mermelstein
- 1980
(Show Context)
Citation Context ...sers produce sequences of observation vectors which represent the short-term spectrum of the speech signal. The two most commonly used parameterisations are Mel-frequency cepstral coefficients (MFCC) =-=[17]-=- and perceptual linear prediction (PLP) [51]. In both cases the speech signal is assumed to be quasi-stationary so that it can be divided into short frames, often 10ms. In each frame period a new obse... |

708 | On sequential Monte Carlo sampling methods for Bayesian filtering
- Doucet, Godsill, et al.
- 2000
(Show Context)
Citation Context ...]. As an example of classical Monte Carlo methods, importance sampling is considered. Importance sampling is also of fundamental importance in modern sequential Monte Carlo and particle filter theory =-=[25, 26, 59]. -=-Importance sampling is based on drawing samples from a proposal distribution q(x). As long as the proposal distribution and the objective are non-zero in the same region; that is, p(x) ≤ Zq(x), Z < ... |

682 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...assigned, but given a finite training data, some valid word sequences may also be assigned a zero probability. A number of smoothing schemes such as discounting, backing off and deleted interpolation =-=[68]-=- have been proposed. There is often a mismatch between the contribution of the acoustic model and the language model in speech recognisers. This is due to different dynamic ranges of the discrete prob... |

634 |
Applied Multivariate Statistical Analysis
- Johnson, Wichem
- 1988
(Show Context)
Citation Context ...ce matrices are used. 2.5.1 Covariance Matrix Modelling Factor analysis is a statistical method for modelling the covariance structure of high dimensional data with a small number of hidden variables =-=[61]. -=-The use of factor analysis for covariance modelling in speech recognition has been investigated [120]. In factor analysis the covariance matrix assumes the following form where Σ (o) j = diag(σ(o)2 ... |

622 | Maximum likelihood linear regression for speaker adaptation of the parameters of continuous density hidden markov models,” Computer Speech and Language
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...ransform, [b ′ A ′ ] ′ . The transform parameters are optimised using EM algorithm with adaptation data from the new speaker. The lth row vector ˆml of the extended transform matrix can be writ=-=ten as [76] ˆml = k ′ l G−1 l wh-=-ere the matrix Gl and column vector kl are defined as follows M� Ns � 1 Gl = σ j=1 n=1 2 ξjnξ jnl ′ jn t=1 M� Ns � 1 kl = σ j=1 n=1 2 jnl t=1 (2.35) T� γjn(t) (2.36) T� γjn(t)otlξ... |

588 | Dynamic Bayesian Networks: Representation, Inference and Learning
- Murphy
- 2002
(Show Context)
Citation Context ...ar models [12]. In this work, the observation process is an important part of the correlation model for the high dimensional observation vectors. 3.2 Bayesian Networks In this work, Bayesian networks =-=[37, 94]-=- are used to illustrate the statistical independence assumptions between different random variables in probabilistic models. Bayesian networks are directed acyclic graphs, also known as graphical mode... |

541 |
Telephone speech corpus for research and development
- GODFREY, HOLLIMAN, et al.
- 1992
(Show Context)
Citation Context ...tractable. A number of approximate methods have been proposed [74, 80, 81, 83, 121]. Some gains compared to HMMs were reported in N-best rescoring experiments using various subsets of the Switchboard =-=[43]-=- corpus.sCHAPTER 2. STATISTICAL FRAMEWORK FOR SPEECH RECOGNITION 23 2.7 Summary The statistical framework for speech recognition has been described in this chapter. First, the standard front-ends were... |

508 | The Infinite Hidden Markov Model
- Beal, Ghahramani, et al.
- 2001
(Show Context)
Citation Context ...tion matrix to an identity matrix; that is, C = I. Also semi-tied covariance matrix HMMs (STC) [32] can be described by both observation processes when k = p and v = 0. Factorial hidden Markov models =-=[41]-=- use distributed representation of the discrete state space so that several independent HMMs can be viewed to have produced the observation vectors. 3.5.4 Example: Linear Dynamical System The linear d... |

500 | A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
- Bilmes
- 1998
(Show Context)
Citation Context ...essary to find the exact maximum of the auxiliary function in the M step. Any increase in the auxiliary function will increase the log-likelihood function. This is called the generalised EM algorithm =-=[10]-=-. For some models, it is not possible to optimise all the parameters simultaneously. Optimising the parameters one at a time is valid in the generalised EM framework and is useful for some of the mode... |

425 | Maximum likelihood linear transformations for HMM-based speech recognition
- Gales
- 1998
(Show Context)
Citation Context ...ew observation matrix, Ĉj, has to be optimised row by row as in SFA [46]. The scheme adopted in this work closely follows the maximum likelihood linear regression (MLLR) transform matrix optimisation=-= [31]. The -=-lth row vector ĉjl of the new observation matrix can be written as ĉjl = k ′ jl G−1 jl (5.18)sCHAPTER 5. PIECE-WISE CONSTANT STATE EVOLUTION 59 where the k by k matrix Gjl and the k-dimensional ... |

379 |
Statistical inference for probabilistic functions of finite state Markov chains
- Baum, Petrie
- 1966
(Show Context)
Citation Context ...tic modelling [60]. First, the generative model of HMM is presented along with typical observation density 1 assumptions. The maximum likelihood (ML) parameter estimation and the Baum-Welch algorithm =-=[7, 8]-=- are then reviewed. Alternative training criteria are also discussed. 2.3.1 Generative Model of HMM In HMM based speech recognition, it is assumed that the sequence of p-dimensional observation vector... |

355 |
On Gibbs Sampling for State Space Models
- Carter, Kohn
- 1994
(Show Context)
Citation Context ...ference algorithms for linear dynamical systems and hidden Markov models [6]. These hybrid models have been called by many names such as switching Kalman filter [91, 93], conditionally Gaussian model =-=[14, 15]-=-, jump Markov linear system [24] and switching linear dynamical system [52, 101, 102, 132]. In this work, the term switching linear dynamical system is used due to its intuitive connection to a linear... |

350 | New results in linear filtering and prediction theory
- Kalman, Bucy
- 1961
(Show Context)
Citation Context ...tial observation sequence up to time t. The filtered estimates are defined as x t|t = E{xt|o1:t} (3.32) Σ t|t = E{xtx ′ t|o1:t} (3.33) To evaluate these estimates the standard Kalman filter recursi=-=on [65, 66] can be written as follows Σ t|t-=- = Σ t|t−1 − Σ t|t−1C ′� CΣ t|t−1C ′ + Σ (o)� −1 CΣt|t−1 (3.34) Σ t+1|t = AΣ t|tA ′ + Σ (x) with initial condition Σ 1|0 = Σ (i) and the mean vectors are given by (3.35... |

328 |
2001 Expectation propagation for approximate Bayesian inference
- Minka
(Show Context)
Citation Context ...hood by a Gaussian, ˜qn(m1) = N (m1; u, v). The dependence on the observation sequence is omitted since the function ˜qn(m1) is only an approximation of p(m1|o1:n). In assumed density filtering (ADF=-=) [88]-=- the observations are processed sequentially by updatingsCHAPTER 4. LEARNING AND INFERENCE IN LINEAR GAUSSIAN MODELS 46 the approximate posterior according to the exact inference and projecting the re... |

289 |
DiscreteTime Processing of Speech Signals
- Deller, Hansen, et al.
- 2000
(Show Context)
Citation Context ... Comparing the sampled acoustic waveforms is not easy due to varying speaker and acoustic characteristics. Instead, the spectral shape of the speech signal conveys most of the significant information =-=[18]-=-. Acoustic front-ends in speech recognisers produce sequences of observation vectors which represent the short-term spectrum of the speech signal. The two most commonly used parameterisations are Mel-... |

288 | Variational Algorithms for Approximate Bayesian Inference
- Beal
- 2003
(Show Context)
Citation Context ...so tend to be inaccurate if there is not enough training data. The training does not take any prior knowledge about the model parameters into account. 2.3.4 Bayesian Learning In the Bayesian learning =-=[9]-=-, the parameters are also treated as random variables. The Bayesian approach attempts to integrate over the possible settings of all uncertain quantities rather than optimise them as ML learning in Eq... |

274 |
A family of algorithms for approximate Bayesian inference
- Minka
- 2001
(Show Context)
Citation Context ...the moment matching is equivalent to minimising the KL distance, D(˜pn(m1), ˜qn(m1)). The ADF can be summarised as algorithm 2. The exact formulae for the example may be found in a variety of studie=-=s [89, 92]. Algorithm 2 Assumed Density Filter-=-ing initialise ˜q0(θ) = t1(θ) for n = 1 to N do zn = � tn(θ)˜qn−1(θ)dθ ˜pn(θ) = tn(θ)˜qn−1(θ)/zn E˜qn{θ} = E˜pn{θ} E˜qn{θθ ′ } = E˜pn{θθ ′ } end for ˆq(θ) = ˜qN(θ) ... |

274 | A unifying review of linear Gaussian models
- Roweis, Ghahramani
- 1999
(Show Context)
Citation Context ...ocess and an observation process which maps the current continuous state vector onto the observation space. This work considers forms of state space models known as generalised linear Gaussian models =-=[111, 117]-=-. In linear Gaussian models the state evolution and observation processes are based on linear functions and Gaussian distributed noise sources. Linear Gaussian models are popular as many forms may be ... |

268 | Tractable inference for complex stochastic processes
- Boyen, Koller
- 1998
(Show Context)
Citation Context ...for independent observations is called generalised pseudo Bayesian algorithm of order 1. Though, it may seem obvious that the errors introduced at each time step tend to accumulate, it has been shown =-=[13]-=- that the final error is bounded. The stochastic process ensures that the variance of the true distribution is high enough to overlap with the approximate distribution. 4.3.3 Expectation Propagation E... |

242 | An introduction to MCMC for machine learning
- Andrieu, Freitas, et al.
- 2003
(Show Context)
Citation Context ...uted in common with all Monte Carlo methods 1 ˆpNi (x) = Ni Ni � n=1 δ(x − x (n) ) (4.23) which converge towards its invariant density function, p(x), if the Markov chain is irreducible and aperiodic =-=[1]-=-. Gibbs sampling explores the state space by a random walk steered by the conditional distributions. It may be slow to converge if the state space is large. Sometimes the structure of the model allows... |

233 | Independent factor analysis
- Attias
- 1999
(Show Context)
Citation Context ...how the loading matrices are shared among a number of Gaussian components. For a global loading matrix the index may be omitted. An alternative sharing scheme called independent factor analysis (IFA) =-=[1] has been propo-=-sed in machine learning literature. The covariance matrix for IFA assumes the following form Σj = CsΣ (x) j C′ s + Σ (o) s = k� i=1 σ (x)2 ji csic ′ si + p� i=1 σ (o)2 si eie ′ i (2.41)... |

229 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...rocess. In standard factor analysis the state vectors are assumed to be distributed according to a standard normal distribution, N (0, I). Independent factor analysis [1], mixture of factor analysers =-=[38]-=- and shared factor analysis [46] are based on the factor analysis model with different mixture assumptions. These models are discussed in the example below. Linear discriminant analysis [61] is anothe... |

228 | From hmm’s to segment models: A unified view to stochastic modeling for speech recognition
- Ostendorf, Digalakis, et al.
- 1996
(Show Context)
Citation Context ...mance, mathematically this technique conflicts with the independence assumption. This independence assumption is widely thought to be the major drawback of the use of HMMs for speech recognition (eg. =-=[22, 35, 58, 100, 103]-=-). State space models may be used to address the shortcomings of HMM based speech recognition. State space models are based on a hidden continuous state evolution process and an observation process wh... |

206 | Some statistical issues in the comparison of speech recognitionalgorithms
- Gillick, Cox
- 1989
(Show Context)
Citation Context ...ords in the correct transcription [127]. (2.32) When comparing the performance of different systems, it is useful to have a measure of confidence in the relative difference in the WER. McNemar’s tes=-=t [42]-=- is used in this work to yield the percentage probability, P (MINUUE|TUUE), where MINUUE is the minimum number of unique utterance errors of two systems under consideration and TUUE is the total numbe... |

196 |
Matrix Algebra from a Statistician's Perspective
- Harville
- 1997
(Show Context)
Citation Context ...ore memory efficient implementation requires the computation of the inverses and determinants for each time instant. These can be efficiently obtained using the following equality for matrix inverses =-=[50] (CjΣ (x) jn C′ j + Σ (o) j-=-m )−1 = Σ (o)−1 jm − Σ(o)−1 jm Cj(C ′ jΣ (o)−1 jm Cj + Σ (x)−1 jn ) −1 C ′ jΣ (o)−1 jm (5.7) where the inverses of the covariance matrices Σ (o) jm and Σ(x) jn are trivial t... |

193 | Semi-tied covariance matrices for hidden Markov models
- Gales
- 1999
(Show Context)
Citation Context ...the FAHMM does not address the independence assumption. However, it generalises many standard covariance modelling schemes such as the shared factor analysis [46] and semi-tied covariance matrix HMMs =-=[32]-=-. Algorithms to optimise the FAHMM parameters and to use FAHMMs for speech recognition are presented together with various schemes to improve their efficiency. Second, a model based on linear first-or... |

190 | Minimum Phone Error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...rmance than ML. Discriminative optimisation criteria include maximum mutual information (MMI) [4], minimum classification error rate (MCE) [63], frame discrimination [67] and minimum phone error rate =-=[104]-=- of which MMI and MPE have been the most successful in speech recognition [47, 126]. In comparison to ML training, the discriminative methods require recognition runs to be carried out during training... |

189 | Continuous Speech Recognition by Statistical Methods
- Jelinek
- 1976
(Show Context)
Citation Context ...ly complex models, suggests that there may be inherent deficiencies in the modelling paradigm. This work concentrates on the problems associated with acoustic modelling. The Hidden Markov model (HMM) =-=[60]-=- is the most popular and successful choice of acoustic model in modern speech recognisers. However, the HMM is based on assumptions which are not appropriate for modelling speech signals [35, 58]. The... |

170 |
Maximum mutual information estimation of hidden Markov model parameters for speech recognition
- Bahl, Brown, et al.
- 1986
(Show Context)
Citation Context ...hich finds the global maximum. However, it has been found that discriminative training yields better performance than ML. Discriminative optimisation criteria include maximum mutual information (MMI) =-=[4]-=-, minimum classification error rate (MCE) [63], frame discrimination [67] and minimum phone error rate [104] of which MMI and MPE have been the most successful in speech recognition [47, 126]. In comp... |

168 |
A Monte carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms
- Wei, Tanner
- 1990
(Show Context)
Citation Context ...d expectation maximisation algorithm [19] can be generalised by using Monte Carlo approximation in the E step. In the case where multiple samples are drawn, this is known as the Monte Carlo EM (MCEM) =-=[124]. The auxi-=-liary function of SLDS for the EM algorithm can be expressed as Q(θ, θ (k) ) = � � p(X, Q|O, θ (k) ) log p(O, X, Q|θ)dX (6.21) ∀QsCHAPTER 6. LINEAR CONTINUOUS STATE EVOLUTION 78 Using standa... |

164 | Parameter estimation for linear dynamical systems
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...namic Models Dynamic linear Gaussian models and the corresponding static models are illustrated in Figure 3.8. Dynamic models with factor analysis observation process include linear dynamical systems =-=[22, 23, 39, 87, 100, 117]-=-, mixture of linear dynamical systems [111] and switching state space model [40, 91] as well as different variations of factor analysed HMMs presented later in this 4 Mixture weights are not included ... |

152 | Introduction to Monte Carlo Methods
- MacKay
- 1998
(Show Context)
Citation Context ...distribution or in a discrete probability mass function. Otherwise, Monte Carlo methods including rejection sampling, importance sampling or Markov chain Monte Carlo (MCMC) algorithms have to be used =-=[84]-=-. As an example of classical Monte Carlo methods, importance sampling is considered. Importance sampling is also of fundamental importance in modern sequential Monte Carlo and particle filter theory [... |

148 | Variational learning for switching state-space models
- Ghahramani, Hinton
- 2000
(Show Context)
Citation Context ...ure 3.8. Dynamic models with factor analysis observation process include linear dynamical systems [22, 23, 39, 87, 100, 117], mixture of linear dynamical systems [111] and switching state space model =-=[40, 91]-=- as well as different variations of factor analysed HMMs presented later in this 4 Mixture weights are not included for brevity.sCHAPTER 3. GENERALISED LINEAR GAUSSIAN MODELS 35 Factor Analysed Hidden... |

143 | Propagation of probabilities, means, and variances in mixed graphical association models
- Lauritzen
- 1992
(Show Context)
Citation Context ...sterior, ˜qn(m1), is found by matching the first two moments, E˜qn{m1} = E˜pn{m1} and E˜qn{m1m1} = E˜pn{m1m1}. This is also known as weak marginalisation and is the best approximation in the KL s=-=ense [72]. -=-It may be shown that the moment matching is equivalent to minimising the KL distance, D(˜pn(m1), ˜qn(m1)). The ADF can be summarised as algorithm 2. The exact formulae for the example may be found i... |

142 |
Speech Database Development: Design and Analysis of the Acoustic-Phonetic Corpus
- Lamel, Kassel, et al.
- 1986
(Show Context)
Citation Context ...single set of parameters per phone but a linear time warping was used to normalise segment durations before the model was applied. Some gains compared to HMMs in phone classification task using TIMIT =-=[71]-=- corpus were reported. The application of LDSs for speech recognition has recently been addressed elsewhere [27, 28]. Most of the other segment models presented in the literature may be regarded as a ... |

142 | Speech recognition by machines and human
- Lippmann
(Show Context)
Citation Context ...lts for the 13 and 39-dimensional single and two observation noise component baseline FAHMMs on the 1200 utterance test set and 300 utterance train set. The best test set result in literature is 3.6% =-=[80]-=-. 90 xvsLIST OF TABLES xvi 7.7 The “oracle – idiot” word error rates for the 13 and 39-dimensional baseline FAHMMs. These give the limits for the word error rates that may be obtained by rescoring the... |

140 | Inferring parameters and structure of latent variable models by variational Bayes
- Attias
- 1999
(Show Context)
Citation Context ... be bounded as follows log p(O) ≥ � � p(O, Z, m1) q(Z, m1) log q(Z, m1) dm1 (4.15) ∀Z where Z is a matrix of mixture indicators znm defined in Appendix C. Following the derivations in various =-=studies [2, 90], the vari-=-ational approximation is constrained to factor q(Z, m1) = q(Z)q(m1) where the mixture indicator probability may be further simplified to � q(znm) ∝ P (ω = m) exp � � q(m1) log p(on|ω = m, m1... |

137 | Learning dynamic bayesian network
- Ghahramani
- 1997
(Show Context)
Citation Context ...ar models [12]. In this work, the observation process is an important part of the correlation model for the high dimensional observation vectors. 3.2 Bayesian Networks In this work, Bayesian networks =-=[37, 94]-=- are used to illustrate the statistical independence assumptions between different random variables in probabilistic models. Bayesian networks are directed acyclic graphs, also known as graphical mode... |

136 |
Minimum classification error rate methods for speech recognition
- Juang, Chou, et al.
- 1997
(Show Context)
Citation Context ...s been found that discriminative training yields better performance than ML. Discriminative optimisation criteria include maximum mutual information (MMI) [4], minimum classification error rate (MCE) =-=[63]-=-, frame discrimination [67] and minimum phone error rate [104] of which MMI and MPE have been the most successful in speech recognition [47, 126]. In comparison to ML training, the discriminative meth... |

132 |
Maximum likelihood estimates of linear dynamic systems
- Rauch, Tung, et al.
- 1965
(Show Context)
Citation Context ...tions and matrix algebra as described in Appendix E. Traditionally, the statistics of the smoothed state vector, p(xt|O) = N (xt; ˆxt, ˆ Σt), are obtained using the Rauch-Tung-Striebel (RTS) smooth=-=er [107, 108]. The RTS smoothing alg-=-orithm requires the above Kalman filter statistics be known. The recursion can be written as follows ˆΣt = Σt|t + Σt|tA ′ Σ −1 � � ˆΣt+1 −1 t+1|t − Σt+1|t Σt+1|tAΣt|t (3.39) ˆxt... |