## Bayesian adaptive inference and adaptive training (2007)

Venue: | IEEE Transactions Speech and Audio Processing |

Citations: | 9 - 7 self |

### BibTeX

@ARTICLE{Yu07bayesianadaptive,

author = {Kai Yu and Mark J. F. Gales},

title = {Bayesian adaptive inference and adaptive training},

journal = {IEEE Transactions Speech and Audio Processing},

year = {2007},

pages = {1932--1943}

}

### OpenURL

### Abstract

Abstract—Large-vocabulary speech recognition systems are often built using found data, such as broadcast news. In contrast to carefully collected data, found data normally contains multiple acoustic conditions, such as speaker or environmental noise. Adaptive training is a powerful approach to build systems on such data. Here, transforms are used to represent the different acoustic conditions, and then a canonical model is trained given this set of transforms. This paper describes a Bayesian framework for adaptive training and inference. This framework addresses some limitations of standard maximum-likelihood approaches. In contrast to the standard approach, the adaptively trained system can be directly used in unsupervised inference, rather than having to rely on initial hypotheses being present. In addition, for limited adaptation data, robust recognition performance can be obtained. The limited data problem often occurs in testing as there is no control over the amount of the adaptation data available. In contrast, for adaptive training, it is possible to control the system complexity to reflect the available data. Thus, the standard point estimates may be used. As the integral associated with Bayesian adaptive inference is intractable, various marginalization approximations are described, including a variational Bayes approximation. Both batch and incremental modes of adaptive inference are discussed. These approaches are applied to adaptive training of maximum-likelihood linear regression and evaluated on a large-vocabulary speech recognition task. Bayesian adaptive inference is shown to significantly outperform standard approaches. Index Terms—Adaptive training, Bayesian adaptation, Bayesian inference, incremental, variational Bayes.

### Citations

1921 |
Pattern Classification
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ... estimate of transform for the target domain. Equation (15) may then be re-expressed as (18) where is the entropy of . For all point estimates of , the entropy of the Dirac delta function is the same =-=[27]-=-. As is a negative constant with infinite value, it can be ignored without affecting the rank ordering of the lower bound. The rank ordering of the lower bound is then determined by MAP (19) Equation ... |

1163 |
Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...aptation [3]. In this paper, supervised mode will not be further discussed as there is no supervision data available for the tasks considered. In recognition with standard HMMs, the Viterbi algorithm =-=[19]-=- is usually used to efficiently calculate the likelihood of observation sequence This relies on the conditional independence assumption of HMMs to make the inference efficient. However, this condition... |

903 | Monte Carlo Statistical Methods
- Robert, Casella
- 2004
(Show Context)
Citation Context ...s do not involve an iterative process and approximate the marginal likelihood directly. Hence, they are referred to as direct approximations. Sampling approaches are one form of direct approximations =-=[10]-=-. The FI assumption has previously been investigated for adaptation and also referred to as Bayesian predictive adaptation [11]–[13]. Though a distribution over the transform parameters, rather than a... |

592 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Comput
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...ihood in inference. An iterative process is used to make this lower bound as tight as possible to the marginal likelihood. Point estimates of transforms, such as maximum a posteriori (MAP) [5] and ML =-=[6]-=-, sit within this class. Variational Bayes (VB) [7] is another lower bound-based Bayesian approximation approach. In VB, a distribution over the parameters, rather than a point estimate is used. This ... |

275 | Variational algorithms for approximate bayesian inference
- Beal
- 2003
(Show Context)
Citation Context ... make this lower bound as tight as possible to the marginal likelihood. Point estimates of transforms, such as maximum a posteriori (MAP) [5] and ML [6], sit within this class. Variational Bayes (VB) =-=[7]-=- is another lower bound-based Bayesian approximation approach. In VB, a distribution over the parameters, rather than a point estimate is used. This should lead to more robust recognition performance ... |

196 | Some statistical issues in the comparison of speech recognition algorithms
- GiIlick, Cox
- 1989
(Show Context)
Citation Context ... significant is used, a pair-wise significance test has been done using NIST-provided software sctk-1.2, which uses a standard approach to conduct significance tests with the significance level of 5% =-=[31]-=-.YU AND GALES: BAYESIAN ADAPTIVE INFERENCE AND ADAPTIVE TRAINING 1941 TABLE II WER (%) COMPARISON BETWEEN 1-BEST AND N -BEST SUPERVISION (N = 150) using the MAP estimation, a 1% absolute gain over th... |

179 |
Minimum phone error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ... discussed in detail in Section III-B. The Bayesian framework described before is based on the likelihood criterion. To obtain state-of-the-art performance, the discriminative criterion is often used =-=[22]-=-. Discriminative adaptive training and inference can also be interpreted from the Bayesian perspective [16]. In this paper, the training procedure adopted is to only discriminatively update the canoni... |

145 | A compact model for speaker-adaptive training
- Anastasakos
- 1996
(Show Context)
Citation Context ...nce is shown to significantly outperform standard approaches. Index Terms—Adaptive training, Bayesian adaptation, Bayesian inference, incremental, variational Bayes. I. INTRODUCTION ADAPTIVE training =-=[1]-=-, [2] has become increasingly popular as greater use has been made of found data, such as broadcast news. For these forms of data, it is not possible to control the “nonspeech” acoustic conditions, su... |

81 | Flexible speaker adaptation using maximum likelihood linear regression,” Eurospeech
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...cumulated. This is the most efficient form. The standard incremental adaptation scheme uses a similar strategy, where the alignments of the previous utterances are fixed and the statistics propagated =-=[29]-=-. However, in the standard approach, only one transform is estimated for decoding the current utterance. In a Bayesian inference framework, a distinct transform is estimated for each possible hypothes... |

75 | Structural maximum a posteriori linear regression for fast HMM adaptation - Siohan, Myrvoll, et al. - 2002 |

70 | An empirical Bayes approach to statistics - Robbins - 1955 |

57 | Cluster Adaptive Training of Hidden Markov Models
- Gales
- 2000
(Show Context)
Citation Context ...s shown to significantly outperform standard approaches. Index Terms—Adaptive training, Bayesian adaptation, Bayesian inference, incremental, variational Bayes. I. INTRODUCTION ADAPTIVE training [1], =-=[2]-=- has become increasingly popular as greater use has been made of found data, such as broadcast news. For these forms of data, it is not possible to control the “nonspeech” acoustic conditions, such as... |

35 |
The empirical Bayes approach to statistical decision problems
- Robbins
- 1964
(Show Context)
Citation Context ...te prior to the complete data set [15]. 2 The second issue is the estimation of the hyper-parameters, once the prior form is determined. They may be estimated using the empirical Bayes approach [17], =-=[18]-=-. The basic idea is to maximize the marginal likelihood in (1) and (2) with respect to the hyper-parameters of both priors. Directly optimizing these equations is highly complex due to the existence o... |

28 | Iterative Unsupervised Adaptation Using Maximum Likelihood Linear Regression
- Woodland, Pye, et al.
- 1996
(Show Context)
Citation Context ... viewpoint of tightening the lower bound during adaptive inference. It is also interesting to compare -best supervision to the standard 1-best supervision adaptation approaches such as iterative MLLR =-=[24]-=-. In iterative MLLR, a transform is estimated using the 1-best hypothesis of the test data as supervision. This transform is then used to calculate inference evidence for all possible hypothesis and t... |

19 |
Robust speech recognition based on a Bayesian prediction approach
- Jiang, Hirose, et al.
- 1999
(Show Context)
Citation Context ...eindependent assumption. This assumption has been implicitly used in the Bayesian prediction approaches for HMM parameters, where the resultant distribution is called Bayesian predictive distribution =-=[28]-=-. In [8] and [9], this approach was used as the inference scheme for parameter distribution trained using VB approach. The assumption has been also investigated for Bayesian adaptation [11], [12], [15... |

19 |
The HTK book (for HTK version 3.0
- Young
- 2000
(Show Context)
Citation Context ...in Fig. 3 corresponds to the unadapted ML-SI baseline. As an additional baseline for incremental adaptive inference, the ML-SI model was also adapted using the standard robust ML adaptation technique =-=[32]-=-. Here, a threshold was used to determine the minimum posterior occupancy to estimate a robust ML transform. This is the SI-ML+Thrd line in Fig. 3. From Fig. 3, the SI-ML+Thrd line always shows better... |

17 | Application of variational Bayesian approach to speech recognition recognition
- Watanabe, Minami, et al.
(Show Context)
Citation Context ...rather than a point estimate is used. This should lead to more robust recognition performance than the point estimates. VB has previously been applied to train distributions over HMM model parameters =-=[8]-=-. As an application to simple adaptation, VB was also used in [9] to train distributions of a mean bias vector and a scaling factor in supervised adaptation on an isolated words recognition task. Howe... |

16 |
The Use of Confidence Measures in Unsupervised Adaptation of Speech Recognizers
- Anastasakos, Balakrishnan
- 1998
(Show Context)
Citation Context ...forms or short sentences as shown in Section IV. A number of other schemes have previously been proposed to address the 1-best bias problem. Two such schemes are lattice MLLR [25] and confidence MLLR =-=[26]-=-. In contrast to the -best supervision framework, these schemes do not directly address the problem, but rather use some form of measure of the confidence of a particular transcription. The disadvanta... |

15 |
Speaker Adaptation Using Lattice-based MLLR
- Uebel, Woodland
- 2001
(Show Context)
Citation Context ...ecially for complex transforms or short sentences as shown in Section IV. A number of other schemes have previously been proposed to address the 1-best bias problem. Two such schemes are lattice MLLR =-=[25]-=- and confidence MLLR [26]. In contrast to the -best supervision framework, these schemes do not directly address the problem, but rather use some form of measure of the confidence of a particular tran... |

11 |
Adaptive training for robust ASR
- Gales
- 2001
(Show Context)
Citation Context ...und training data is thus highly nonhomogeneous with multiple acoustic conditions being present in the training corpus. One approach for building systems on nonhomogeneous data is multistyle training =-=[3]-=-. Here, all training data are treated as a single block to train the hidden Markov models (HMMs), for example, speaker-independent training. These multistyle systems model both speech Manuscript recei... |

9 | Acoustic model adaptation based on coarse-fine training of transfer vectors and its application to speaker adaptation task
- Watanabe, Nakamura
(Show Context)
Citation Context ...bust recognition performance than the point estimates. VB has previously been applied to train distributions over HMM model parameters [8]. As an application to simple adaptation, VB was also used in =-=[9]-=- to train distributions of a mean bias vector and a scaling factor in supervised adaptation on an isolated words recognition task. However, in contrast to this work, the VB approaches in [8] and [9] w... |

8 |
Bayesian adaptation and adaptively trained systems
- Yu, Gales
(Show Context)
Citation Context ...L estimates of transforms are not reliable and may be overly “tuned” to the initial hypothesis. These problems may be addressed by interpreting adaptive training and inference in a Bayesian framework =-=[4]-=-. Here, the parameters of the system are treated as random variables. The likelihood of the observation sequence is then obtained by marginalizing out over the parameter distributions. Though this app... |

8 |
Acoustic factorization
- Gales
- 2001
(Show Context)
Citation Context ...esult in tractable mathematical formulas. For example, for mean-based transform such as MLLR [6], a Gaussian distribution over the transform parameters is the conjugate prior to the complete data set =-=[15]-=-. 2 The second issue is the estimation of the hyper-parameters, once the prior form is determined. They may be estimated using the empirical Bayes approach [17], [18]. The basic idea is to maximize th... |

8 |
Transformation-based Bayesian predictive classification for online environmental learning and robust speech recognition
- Chien, Liao
- 2000
(Show Context)
Citation Context ...s is a special case of the integrated Bayesian inference process. This is discussed in Section III-A. In contrast to some previously investigated Bayesian predictive adaptation (BPA) approaches [11], =-=[21]-=-, Bayesian adaptive inference strictly deals with the Bayesian integral over the whole observation sequence, while the BPA approaches implicitly assume the Bayesian integral is performed at every time... |

7 |
Transformation based Bayesian prediction for adaptation of HMMs
- Surendran, Lee
- 2001
(Show Context)
Citation Context ...pproximations. Sampling approaches are one form of direct approximations [10]. The FI assumption has previously been investigated for adaptation and also referred to as Bayesian predictive adaptation =-=[11]-=-–[13]. Though a distribution over the transform parameters, rather than a point estimate, is used, the transform is allowed to effectively change from frame to frame, possibly limiting performance gai... |

7 |
Linear regression based Bayesian predictive classification for speech recognition
- Chien
(Show Context)
Citation Context ...tion [28]. In [8] and [9], this approach was used as the inference scheme for parameter distribution trained using VB approach. The assumption has been also investigated for Bayesian adaptation [11], =-=[12]-=-, [15]. Using this approximation in (3) yields where (27) (28) is the Bayesian predictive distribution at . With an appropriate form of , this frame-level integral is tractable. For example, in MLLR a... |

6 | Bayesian Adaptation Revisited
- Kenny, Boulianne, et al.
(Show Context)
Citation Context ...imations. Sampling approaches are one form of direct approximations [10]. The FI assumption has previously been investigated for adaptation and also referred to as Bayesian predictive adaptation [11]–=-=[13]-=-. Though a distribution over the transform parameters, rather than a point estimate, is used, the transform is allowed to effectively change from frame to frame, possibly limiting performance gains. T... |

6 |
N-best-based unsupervised speaker adaptation for speech recognition
- Matsui, Furui
- 1998
(Show Context)
Citation Context ...t to get a tight lower bound for . In order to achieve this, it is necessary to optimize the lower bound with respect to every possible hypothesis, respectively, which is similar to -best supervision =-=[23]-=-. In contrast to the work in [23] where no theoretical justification was proposed, the work here motivates it from a viewpoint of tightening the lower bound during adaptive inference. It is also inter... |

4 | Incremental adaptation using Bayesian inference
- Yu, Gales
(Show Context)
Citation Context ...stimate, is used, the transform is allowed to effectively change from frame to frame, possibly limiting performance gains. This paper examines both lower bound and direct approaches. Both incremental =-=[14]-=- and batch modes [4] Bayesian adaptive inference are discussed. These general Bayesian approximations are then applied to a specific transform: maximum-likelihood linear regression (MLLR) [6]. This pa... |

1 | Maximum a-posteriori linear regression with elliptical symmetric matrix variate priors - Chou - 1999 |

1 |
Bayesian adaptation and adaptive training Eng
- Yu, Gales
- 2006
(Show Context)
Citation Context ... issue will be addressed later. The estimation of the transform prior is complicated due to the homogeneity constraint. A separate variational transform 2For discussion about mixture priors, refer to =-=[16]-=-. 3In the general case, where a conjugate prior does not exist, it is not possible to set the KL divergence to zero in the lower bound (5). Optimizing the bound is still valid; however, the optimum wi... |

1 |
et al., “The N -best algorithm: An efficient and exact procedure for finding the N most likely sentence hypotheses
- Schwartz, Chow
- 1990
(Show Context)
Citation Context ...ce assumption is not valid for adaptive HMMs due to the additional dependence on the transform. Hence, the Viterbi algorithm is not suitable for Bayesian adaptive inference. Instead, N-best rescoring =-=[20]-=- is used in this work to reflect the nature of adaptive HMM. Though the -best rescoring may limit the performance gain, and loss, due to the limited number of candidate hypothesis sequences, given suf... |