Results 1 - 10
of
59
Reducing Audible Spectral Discontinuities
, 2001
"... In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon. We first set out to find an objective spectral measure for disconti ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon. We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities. Index Terms---Audible discontinuities, context-sensitive diphones, spectral distance measures. I. INTROD...
Syrdal, “Perceptual and objective detection of discontinuities in concatenative speech synthesis
- in Proceedings IEEE Acoustics, Speech, and Signal Processing (ICASSP
"... Concatenative speech synthesis systems attempt to minimize audible signal discontinuities between two successive concatenated units. An objective distance measure which is able to predict audible discontinuities is therefore very important, particularly in unit selection synthesis, for which units a ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
Concatenative speech synthesis systems attempt to minimize audible signal discontinuities between two successive concatenated units. An objective distance measure which is able to predict audible discontinuities is therefore very important, particularly in unit selection synthesis, for which units are selected from among a large inventory at run time. In this paper, we describe a perceptual test to measure the detection rate of concatenation discontinuity by humans, and then we evaluate 13 different objective distance measures based on their ability to predict the human results. Criteria used to classify these distances include the detection rate, the Bhattacharyya measure of separability of two distributions, and Receiver Operating Characteristic (ROC) curves. Results show that the Kullback-Leibler distance on power spectra has the higher detection rate followed by the Euclidean distance on Mel-Frequency Cepstral Coefficients (MFCC). 1.
HMM-based strategies for enhancement of speech signals embedded in nonstationary noise
- IEEE Trans. on Speech and Audio Processing
, 1998
"... Abstract—An improved hidden Markov model-based (HMMbased) speech enhancement system designed using the minimum mean square error principle is implemented and compared with a conventional spectral subtraction system. The improvements to the system are: 1) incorporation of mixture components in the HM ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Abstract—An improved hidden Markov model-based (HMMbased) speech enhancement system designed using the minimum mean square error principle is implemented and compared with a conventional spectral subtraction system. The improvements to the system are: 1) incorporation of mixture components in the HMM for noise in order to handle noise nonstationarity in a more flexible manner, 2) two efficient methods in the speech enhancement system design that make the system realtime implementable, and 3) an adaptation method to the noise type in order to accommodate a wide variety of noises expected under the enhancement system’s operating environment. The results of the experiments designed to evaluate the performance of the HMM-based speech enhancement systems in comparison with spectral subtraction are reported. Three types of noise—white noise, simulated helicopter noise, and multitalker (cocktail party) noise—were used to corrupt the test speech signals. Both objective (global SNR) and subjective mean opinion score (MOS) evaluations demonstrate consistent superiority of the HMM-based enhancement systems that incorporate the innovations described in this paper over the conventional spectral subtraction method. I.
Enhancement and Recognition of Noisy Speech within an Autoregressive Hidden Markov Model Framework Using Noise Estimates from the Noisy Signal
- In Proc. ICASSP
, 1997
"... This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The probability framework develo ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The probability framework developed is then used to reestimate the noise models from the corrupted speech waveform and the process is repeated. Enhancement is performed using the Wiener filters formed from the final clean speech models and noise estimates. Results are presented for additive stationary Gaussian and coloured noise. 1. INTRODUCTION The task of speech enhancement has been investigated by many researchers [1, 2, 3, 4]. Much of this work requires estimates of the statistics of the clean speech and the interfering noise. While training databases are available to make models of clean speech, the noise may only be available as part of the noisy signal. Recently, researchers have considered estimating the noise dir...
Quality-Enhanced Voice Morphing Using Maximum Likelihood Transformations
- IEEE TRANS. ON SPEECH AND AUDIO PROCESSING
, 2006
"... Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear tran ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear transformations estimated from time-aligned parallel training data are commonly used to achieve this. However, the naive application of envelope transformation combined with the necessary pitch and duration modifications will result in noticeable artifacts. This paper studies the linear transformation approach to voice morphing and investigates these two specific issues. Firstly, a general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches. Secondly, the main causes of artifacts are identified as being due to glottal coupling, unnatural phase dispersion and the high spectral variance of unvoiced sounds, and compensation techniques are developed to mitigate these. The resulting voice morphing system is evaluated using both subjective and objective measures. These tests show that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Furthermore, they do not require carefully prepared parallel training data.
Fast Nearest Neighbor Retrieval for Bregman Divergences
"... We present a data structure enabling efficient nearest neighbor (NN) retrieval for bregman divergences. The family of bregman divergences includes many popular dissimilarity measures including KL-divergence (relative entropy), Mahalanobis distance, and Itakura-Saito divergence. These divergences pre ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
We present a data structure enabling efficient nearest neighbor (NN) retrieval for bregman divergences. The family of bregman divergences includes many popular dissimilarity measures including KL-divergence (relative entropy), Mahalanobis distance, and Itakura-Saito divergence. These divergences present a challenge for efficient NN retrieval because they are not, in general, metrics, for which most NN data structures are designed. The data structure introduced in this work shares the same basic structure as the popular metric ball tree, but employs convexity properties of bregman divergences in place of the triangle inequality. Experiments demonstrate speedups over brute-force search of up to several orders of magnitude. 1.
EVALUATION OF SPEECH DEREVERBERATION ALGORITHMS USING THE MARDY DATABASE
"... Dereverberation is a growing area of research with many new algorithms appearing in the literature. However, there are still no unanimously accepted tools for evaluation of these algorithms. In this paper, we introduce the Multichannel Acoustic Reverberation Database at York (MARDY) containing real ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
Dereverberation is a growing area of research with many new algorithms appearing in the literature. However, there are still no unanimously accepted tools for evaluation of these algorithms. In this paper, we introduce the Multichannel Acoustic Reverberation Database at York (MARDY) containing real measured multichannel room impulse responses. We demonstrate its use for the evaluation of dereverberation algorithms using three recent multichannel methods. Furthermore, psychoacoustic issues regarding the performance evaluation of dereverberation algorithms are discussed. 1.
Robust Feature-Estimation and Objective Quality Assessment for Noisy Speech Recognition using the Credit Card Corpus
, 1994
"... It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously f ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously formulated for speech enhancement, is considered and shown to produce improved feature characterization in a variety of actual noise conditions such as computer fan, large crowd, and voice communications channel noise. In addition, an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage. The four measures considered include (i) NIST SNR, (ii) Itakura-Saito log-likelihood, (iii) log-area-ratio, and (iv) the weighted-spectral slope measure. A continuous distribution, monophone based, hidden Markov model recognition algorithm is used for objective measure based MAP estimator analysis and reco...
Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum
- IEEE Trans. Speech Audio Proc
, 2005
"... Abstract—The traditional minimum mean-square error (MMSE) estimator of the short-time spectral amplitude is based on the minimization of the Bayesian squared-error cost function. The squared-error cost function, however, is not subjectively meaningful in that it does not necessarily produce estimato ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract—The traditional minimum mean-square error (MMSE) estimator of the short-time spectral amplitude is based on the minimization of the Bayesian squared-error cost function. The squared-error cost function, however, is not subjectively meaningful in that it does not necessarily produce estimators that emphasize spectral peak (formants) information or estimators which take into account auditory masking effects. To overcome the shortcomings of the MMSE estimator, we propose in this paper Bayesian estimators of the short-time spectral magnitude of speech based on perceptually motivated cost functions. In particular, we use variants of speech distortion measures, such as the Itakura–Saito and weighted likelihood-ratio distortion measures, which have been used successfully in speech recognition. Three classes of Bayesian estimators of the speech magnitude spectrum are derived. The first class of estimators emphasizes spectral peak information, the second class uses a weighted-Euclidean cost function that implicitly takes into account auditory masking effects, and the third class of estimators is designed to penalize spectral attenuation. Of the three classes of Bayesian estimators, the estimators that implicitly take into account auditory masking effect performed the best in terms of having less residual noise and better speech quality. Index Terms—Minimum mean-square error (MMSE) estimators, perceptually-motivated speech enhancement, speech distortion measures, speech enhancement. I.

