## Stochastic Feature Transformation with Divergence-Based Out-of-Handset Rejection for Robust Speaker Verification (2004)

Venue: | EURASIP J. on Applied Signal Processing |

Citations: | 10 - 6 self |

### BibTeX

@ARTICLE{Mak04stochasticfeature,

author = {Man-Wai Mak and Chi-leung Tsang and Sun-yuan Kung},

title = {Stochastic Feature Transformation with Divergence-Based Out-of-Handset Rejection for Robust Speaker Verification},

journal = {EURASIP J. on Applied Signal Processing},

year = {2004},

volume = {4},

pages = {452--465}

}

### OpenURL

### Abstract

The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMbased handset selector is trained to identify the most likely handset used by the claimants, and then handset-specific stochastic feature transformations are applied to the distorted feature vectors. This paper also proposes a divergence-based handset selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen di#erence between the selector's output and a constant vector with identical elements. The resulting handset selector is combined with the proposed feature transformation technique for telephone-based speaker verification. Experimental results based on 150 speakers of the HTIMIT corpus show that the handset selector, either with or without OOH rejection capability, is able to identify the `seen' handsets accurately (98.3% in both cases). Results also demonstrate that feature transformation performs significantly better than the classical cepstral mean normalization approach. Finally, by using the transformation parameters of the `seen' handsets to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by cepstral mean subtraction, verification error rates are reduced significantly (from 12.41% to 6.59% on average).

### Citations

823 |
C4.5: Programs for
- Quinlan
- 1993
(Show Context)
Citation Context ...ct these dissimilar, ‘unseen’ handsets enables the verification system to maintain the error rate at a level achievable by the CMS method. We are currently looking at tree-based clustering algorit=-=hms [28] to -=-register any dissimilar, ‘unseen’ handsets into the handset database. With the ability to register new handsets, the speaker verification system will eventually be able to identify almost all hand... |

628 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...include (1) stochastic matching [5] and stochastic additive transformation [6] where the models’ means and variances are adjusted by stochastic biases, (2) maximum likelihood linear regression (MLLR=-=) [7]-=- where the mean vectors of clean speech models are linearly transformed, and (3) the constrained reestimation of Gaussian mixtures [8] where both mean vectors and covariance matrices are transformed. ... |

427 | Maximum likelihood linear transformations for HMM-based speech recognition
- Gales
- 1998
(Show Context)
Citation Context ...(3) the constrained reestimation of Gaussian mixtures [8] where both mean vectors and covariance matrices are transformed. Recently, MLLR has been extended to maximum-likelihood linear transformation =-=[9]-=-, in which the transformation matrices for the variances can be different from those for the mean vectors. Meanwhile, the constrained transformation in [8] has been extended to piecewise-linear stocha... |

409 |
Robust Text-Independent Speaker Identification Using Gaussian Mixture Models
- Reynolds
- 1995
(Show Context)
Citation Context ...old to make a verification decision. In this work, the threshold for each speaker was adjusted to determine an equal error rate (EER), i.e. speaker-dependent thresholds were used. Similar to [25] and =-=[26]-=-, the vector sequence was divided into overlapping segments to increase the resolution of the error rates. B. Results Table II compares different stochastic feature transformation approaches against c... |

267 | The DET curve in assessment of detection task performance
- Martin, Doddington, et al.
- 1997
(Show Context)
Citation Context ...dset were rejected by the selector (for 450 utterances, 369 of them were rejected). To better illustrate the detection performance of our verification system, we plot the DET curves, as introduced in =-=[27], for th-=-e three approaches. The speaker detection performance, using the ‘seen’ handset cb1 and the ‘unseen’ handset cb3 in verification sessions, are shown in Figure 3 and Figure 4 respectively. The ... |

194 |
Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification
- Atal
- 1974
(Show Context)
Citation Context ...approaches attempt to modify the distorted features so that the resulting feaAugust 12, 2003 DRAFT 2stures fit the clean speech models better. These approaches include cepstral mean subtraction (CMS) =-=[1]-=- and signal bias removal [2], which approximate a linear channel by the long-term average of distorted cepstral vectors. These approaches, however, do not consider the effect of background noise. A mo... |

180 | Acoustical and environmental robustness in automatic spech recognition
- Acero
- 1993
(Show Context)
Citation Context ...ackground noise. A more general approach, in which additive noise and convolutive distortion are modeled as codeword-dependent cepstral biases, is the codeword-dependent cepstral normalization (CDCN) =-=[3]-=-. The CDCN, however, only works well when the background noise level is low. When stereo corpora are available, channel distortion can be estimated directly by comparing the clean feature vectors agai... |

110 | A Maximum Likelihood Approach to Stochastic Matching for Robust Speech Recognition
- Sankar, Lee
- 1996
(Show Context)
Citation Context ... rely on the availability of stereo corpora. The requirement of stereo corpora can be avoided by making use of the information embedded in the clean speech models. For example, in stochastic matching =-=[5]-=-, the transformation parameters are determined by maximizing the likelihood of observing the distorted features given the clean models. Instead of transforming the distorted features to fit the clean ... |

64 |
A study on speaker adaptation of the parameters of continuous density hidden Markov models
- Lee, Lin, et al.
- 1991
(Show Context)
Citation Context ...he model parameters via a small number of transformations, they may not be able to capture the fine structure of the distortion. While this limitation can be overcome by the Bayesian techniques [12], =-=[13] whe-=-re model parameters are adjusted “directly”, the Bayesian approach requires a large amount of adaptation data to be effective. As both direct and indirect adaptations have their own strengths and ... |

58 |
On the convexity of some divergence measures based on entropy functions
- Burbea, Rao
- 1982
(Show Context)
Citation Context ...ion recovered features Fig. 1. Speaker verification system with handset identification, out-of-handset rejection, and handsetdependent feature transformation. where J(�α, �r) is the Jensen differ=-=ence [23], [24] between �α and �r (who-=-se values will be discussed next) and ϕ is a decision threshold. J(�α, �r) can be computed as � � �α + �r J(�α, �r) = S − 2 1 [S(�α) + S(�r)] (10) 2 where S(�z), called th... |

49 |
Integrated models of signal and background with application to speaker identication in noise
- Rose, Hofstetter, et al.
- 1994
(Show Context)
Citation Context ...distorted data better. This is known as the model-based transformation in the literature. Influential model-based approaches include (1) stochastic matching [5] and stochastic additive transformation =-=[6] w-=-here the models’ means and variances are adjusted by stochastic biases, (2) maximum likelihood linear regression (MLLR) [7] where the mean vectors of clean speech models are linearly transformed, an... |

49 |
HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects
- Reynolds
- 1997
(Show Context)
Citation Context ...rmation is designed to work with a handset selector for robust speaker verification. Some researchers have proposed to use handset selectors for solving the handset identification problem [20], [21], =-=[22]. Mo-=-st existing handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an ‘unseen’ handset. If a claimant uses a handset that has no... |

38 |
Signal bias removal by maximum likelihood estimation for robust telephone speech recognition
- Rahin, Juang
(Show Context)
Citation Context ... the distorted features so that the resulting feaAugust 12, 2003 DRAFT 2stures fit the clean speech models better. These approaches include cepstral mean subtraction (CMS) [1] and signal bias removal =-=[2]-=-, which approximate a linear channel by the long-term average of distorted cepstral vectors. These approaches, however, do not consider the effect of background noise. A more general approach, in whic... |

35 |
Speaker adaptation using constrained reestimation of gaussian mixtures
- Digalakis, Rtischev, et al.
- 1995
(Show Context)
Citation Context ...tochastic biases, (2) maximum likelihood linear regression (MLLR) [7] where the mean vectors of clean speech models are linearly transformed, and (3) the constrained reestimation of Gaussian mixtures =-=[8]-=- where both mean vectors and covariance matrices are transformed. Recently, MLLR has been extended to maximum-likelihood linear transformation [9], in which the transformation matrices for the varianc... |

30 | On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate
- Huo, Lee
- 1997
(Show Context)
Citation Context ...just the model parameters via a small number of transformations, they may not be able to capture the fine structure of the distortion. While this limitation can be overcome by the Bayesian techniques =-=[12], [1-=-3] where model parameters are adjusted “directly”, the Bayesian approach requires a large amount of adaptation data to be effective. As both direct and indirect adaptations have their own strength... |

23 |
G.C.O’Leary, “Estimation of Handset Nonlinearity With Application to Speaker Recognition
- Quatieri
(Show Context)
Citation Context ... energy-dependent frequency responses [16] for which a linear filter may be a poor approximation. Recently, this problem has been addressed by considering the distortion as a non-linear mapping [17], =-=[18]-=-. However, these methods rely on the availability of stereo corpora with accurate time alignment. To address the above problems, we have proposed a method in which non-linear transformations can be es... |

22 | Estimation of elliptical basis function parameters by the EM algorithm with application to speaker verification
- Mak, Kung
- 2000
(Show Context)
Citation Context ... a threshold to make a verification decision. In this work, the threshold for each speaker was adjusted to determine an equal error rate (EER), i.e. speaker-dependent thresholds were used. Similar to =-=[25]-=- and [26], the vector sequence was divided into overlapping segments to increase the resolution of the error rates. B. Results Table II compares different stochastic feature transformation approaches ... |

20 |
Joint maximum a posteriori adaptation of transformation and HMM parameters
- Siohan, Chesta, et al.
- 2001
(Show Context)
Citation Context ...a to be effective. As both direct and indirect adaptations have their own strengths and weaknesses, a natural extension is to combine them so that these two approaches can complement each other [14], =-=[15]-=-. Although the above methods have been successful in reducing channel mismatches, most of them operate on the assumption that the channel effect can be approximated by a linear filter. Most telephone ... |

18 |
The effect of telephone transmission degradations on speaker recognition performance
- Reynolds, Zissman, et al.
- 1995
(Show Context)
Citation Context ...annel mismatches, most of them operate on the assumption that the channel effect can be approximated by a linear filter. Most telephone handsets, in fact, exhibit energy-dependent frequency responses =-=[16]-=- for which a linear filter may be a poor approximation. Recently, this problem has been addressed by considering the distortion as a non-linear mapping [17], [18]. However, these methods rely on the a... |

12 | Divergencebased out-of-class rejection for telephone handset identification
- Tsang, Mak, et al.
(Show Context)
Citation Context ...near transformation is designed to work with a handset selector for robust speaker verification. Some researchers have proposed to use handset selectors for solving the handset identification problem =-=[20], [2-=-1], [22]. Most existing handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an ‘unseen’ handset. If a claimant uses a handset... |

11 |
Non-linear compensation for stochastic matching
- Surendran, Lee, et al.
- 1999
(Show Context)
Citation Context ...where a collection of linear transformations are shared by all the Gaussians in each mixture. The random bias in [5] has also been replaced by a neural network to compensate for non-linear distortion =-=[11]. Al-=-l these extensions show improvement in recognition accuracy. August 12, 2003 DRAFT 3sAs the above methods “indirectly” adjust the model parameters via a small number of transformations, they may n... |

11 |
Combining stochastic feautre transformation and handset identification for telephone-based speaker verification
- Mak, Kung
(Show Context)
Citation Context ...lity of stereo corpora with accurate time alignment. To address the above problems, we have proposed a method in which non-linear transformations can be estimated under a maximum likelihood framework =-=[19]-=-, thus eliminating the need for accurately aligned stereo corpora. The only requirement is to record a few utterances uttered by a few speakers using different handsets. These speakers do not need to ... |

9 | Maximumlikelihood stochastic-transformation adaptation of hidden Markov models
- Diakoloukas, Digalakis
- 1999
(Show Context)
Citation Context ...formation matrices for the variances can be different from those for the mean vectors. Meanwhile, the constrained transformation in [8] has been extended to piecewise-linear stochastic transformation =-=[10]-=-, where a collection of linear transformations are shared by all the Gaussians in each mixture. The random bias in [5] has also been replaced by a neural network to compensate for non-linear distortio... |

5 |
Probabilistic Optimal Filtering for Robust Speech Recognition”, ICASSP
- Neumeyer, Weintraub
- 1994
(Show Context)
Citation Context ...or example, in SNRdependent cepstral normalization (SDCN) [3], cepstral biases for different signal-to-noise ratios are estimated in a maximum likelihood framework. In probabilistic optimum filtering =-=[4]-=-, the transformation is a set of multi-dimensional least-squares filters whose outputs are probabilistically combined. These methods, however, rely on the availability of stereo corpora. The requireme... |

5 |
Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework
- Mokbel
(Show Context)
Citation Context ...on data to be effective. As both direct and indirect adaptations have their own strengths and weaknesses, a natural extension is to combine them so that these two approaches can complement each other =-=[14]-=-, [15]. Although the above methods have been successful in reducing channel mismatches, most of them operate on the assumption that the channel effect can be approximated by a linear filter. Most tele... |

4 |
On the use of some divergence measures in speaker recognition
- Vergin, O’SHAUGHNESSY
- 1999
(Show Context)
Citation Context ...covered features Fig. 1. Speaker verification system with handset identification, out-of-handset rejection, and handsetdependent feature transformation. where J(�α, �r) is the Jensen difference [=-=23], [24] between �α and �r (whose val-=-ues will be discussed next) and ϕ is a decision threshold. J(�α, �r) can be computed as � � �α + �r J(�α, �r) = S − 2 1 [S(�α) + S(�r)] (10) 2 where S(�z), called the Shan... |

3 | Robust speaker verification over the telephone by feature recuperation
- Li, Mak, et al.
(Show Context)
Citation Context ...xhibit energy-dependent frequency responses [16] for which a linear filter may be a poor approximation. Recently, this problem has been addressed by considering the distortion as a non-linear mapping =-=[17]-=-, [18]. However, these methods rely on the availability of stereo corpora with accurate time alignment. To address the above problems, we have proposed a method in which non-linear transformations can... |

3 | A GMM-based handset selector for channel mismatch compensation with applications to speaker identification
- Yiu, Mak, et al.
- 2001
(Show Context)
Citation Context ...ransformation is designed to work with a handset selector for robust speaker verification. Some researchers have proposed to use handset selectors for solving the handset identification problem [20], =-=[21], [2-=-2]. Most existing handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an ‘unseen’ handset. If a claimant uses a handset that ... |