## Continuous Probabilistic Transform for Voice Conversion (1998)

Venue: | IEEE Transactions on Speech and Audio Processing |

Citations: | 129 - 4 self |

### BibTeX

@ARTICLE{Stylianou98continuousprobabilistic,

author = {Yannis Stylianou and Eric Moulines},

title = {Continuous Probabilistic Transform for Voice Conversion},

journal = {IEEE Transactions on Speech and Audio Processing},

year = {1998},

volume = {6},

pages = {131--142}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract — Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic C noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods. I.

### Citations

8143 | Maximum likelihood from incomplete data via the EM algorithm (c/r: P22-37
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...on of Bayes’ rule [8] as Substituting (2) in (3) yields the classic expression The parameters of the GMM are estimated from the set of source vectors using the expectation-maximization (EM) algorithm =-=[6]-=-. The EM algorithm iteratively increases the likelihood of the model parameters by successive maximizations of an intermediate quantity which, in the case of a GMM, is entirely defined by the conditio... |

3928 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...e source vectors under the form of a continuous probability distribution provided by a GMM. A. Gaussian Mixture Model The GMM is a classic parametric model used in many pattern recognition techniques =-=[8]-=- whose efficiency for textindependent speaker recognition has been illustrated by recent studies [39], [40], [47]. The GMM assumes that the probability distribution of the observed parameters takes th... |

1495 |
Fundamentals of Speech Recognition
- Rabiner, JuangBH
- 1993
(Show Context)
Citation Context ... of the observations (in our case the time index ) is believed to be irrelevant. The GMM can thus be thought of as a simplified hidden Markov model (HMM) with Gaussian state-conditional distributions =-=[36]-=- in which all states are connected (ergodic model) and all the transition probabilities leading to a given state are equal. In our case, the choice of the GMM is justified because we are interested in... |

1219 |
An algorithm for vector quantizer design
- Linde, Buzo, et al.
- 1980
(Show Context)
Citation Context ...r all of the clean acoustic space. In [33] optimum probabilistic filtering has been used to map noisy speech features to clean features; the clean feature space is quantized using the Lloyd algorithm =-=[26]-=- and a conditional error is minimized in each VQ region. The method described in this paper is inspired by the mapping codebook approach and attempts to convert the whole spectral envelope without ext... |

808 |
Fundamentals of Statistical Signal Processing: Estimation Theory
- Kay
- 1993
(Show Context)
Citation Context ...umed that the source vectors follow a Gaussian distribution and that the source and target vectors are jointly Gaussian, the minimum mean square error (MMSE) estimate of the target vector is given by =-=[22]-=-, [5] where denotes expectation, and and are, respectively, the mean target vector and the cross-covariance matrix of the source and target vectors where the superscript denotes transposition. In the ... |

508 | Mixture Densities, Maximum Likelihood and the EM Algorithm - Redner, Walker - 1984 |

383 |
Robust textindependent speaker identification using gaussian mixture speaker models
- Reynolds, Rose
- 1995
(Show Context)
Citation Context ...ian Mixture Model The GMM is a classic parametric model used in many pattern recognition techniques [8] whose efficiency for textindependent speaker recognition has been illustrated by recent studies =-=[39]-=-, [40], [47]. The GMM assumes that the probability distribution of the observed parameters takes the following parametric form [8], [39] where denotes the -dimensional normal distribution with mean ve... |

327 |
On the convergence properties of the EM algorithm
- Wu
- 1983
(Show Context)
Citation Context ...9]. An important implementation issue associated with the EM algorithm is its initialization. The EM algorithm is only guaranteed to converge toward a stationary point of the likelihood function [6], =-=[49]-=-. In practice, the initialization of the EM algorithm affects its convergence rate but can also modify the final estimate [37]. For GMM speaker models with diagonal covariance matrices, it was found i... |

205 |
Finite Mixture Distributions
- Everitt, Hand
- 1981
(Show Context)
Citation Context ...ons of an intermediate quantity which, in the case of a GMM, is entirely defined by the conditional probabilities of (4). The EM reestimation formulas in the case of Gaussian mixtures can be found in =-=[10]-=- or [39]. An important implementation issue associated with the EM algorithm is its initialization. The EM algorithm is only guaranteed to converge toward a stationary point of the likelihood function... |

121 |
Introduction to multivariate analysis
- Chatfield, Collin
- 1980
(Show Context)
Citation Context ...hat the source vectors follow a Gaussian distribution and that the source and target vectors are jointly Gaussian, the minimum mean square error (MMSE) estimate of the target vector is given by [22], =-=[5]-=- where denotes expectation, and and are, respectively, the mean target vector and the cross-covariance matrix of the source and target vectors where the superscript denotes transposition. In the joint... |

85 |
Text-Independent Speaker Identification
- Gish, Schmidt
- 1994
(Show Context)
Citation Context ...recognition techniques are based on the characterization of the statistical distribution ofs132 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998 the spectral envelopes [7], =-=[14]-=-, [41]. It is generally admitted that the overall shape of the envelope together with the formant characteristics are the major speaker-identifying features of the spectral envelope [15], [17], [23]. ... |

79 |
Voice Conversion Through Vector Quantization
- Abe, Nakamura, et al.
- 1988
(Show Context)
Citation Context ...teristics, we will refer to the control of the spectral envelope as spectral conversion. One of the earliest approaches to the spectral conversion problem is the mapping codebook method of Abe et al. =-=[1]-=-, [2], which was originally introduced by Shikano et al. for speaker adaptation [43]. In this approach, a clustering procedure—vector quantization (VQ) is applied to the spectral parameters of both th... |

64 |
A Gaussian mixture modeling approach to textindependent speaker identification
- Reynolds
- 1992
(Show Context)
Citation Context ...In practice, the initialization of the EM algorithm affects its convergence rate but can also modify the final estimate [37]. For GMM speaker models with diagonal covariance matrices, it was found in =-=[38]-=-, and [39] that the initialization of the EM algorithm only has a small influence. In the present work, the GMM parameters are initialized by use of a standard binary splitting VQ procedure [36]: the ... |

53 |
Analytical expressions for critical-band rate and critical bandwidth as a function of frequency
- Zwicker, Terhardt
- 1980
(Show Context)
Citation Context ... of the harmonics determined by the HNM analysis are expressed in the log domain. 2) The frequencies of the harmonics are converted to a Bark frequency scale using the analytical formulas reported in =-=[50]-=-. The obtained values are normalized in order to ensure that the upper limit of the band (4 kHz) corresponds to a value of on the normalized warped frequency axis. 3) The real cepstrum parameters that... |

51 |
Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification
- Stylianou
- 1996
(Show Context)
Citation Context ...applying the DTW procedure between the converted envelopes and the target envelopes. The spectral conversion method is tested on speech signals analyzed by the harmonic noise model system (HNM) [24], =-=[44]-=-, [46]. The HNM system performs a time-varying harmonic plus (modulated) noise decomposition which allows for spectral transformations and for time and pitch modifications. The spectral envelope is de... |

43 |
Probabilistic optimum filtering for robust speech recognition
- Newneyer, Weintraub
- 1994
(Show Context)
Citation Context ... [29], [48]. Spectral conversion techniques have been also proposed for speaker/environment adaptation that map speech features of the same speaker between clean and noisy acoustic spaces [16], [30], =-=[33]-=-. In [30], noisy references have been simulated by transforming clean utterances using the linear multiple regression (LMR) algorithm with one translation vector and one rotation matrix for all of the... |

34 |
Speaker recognition—identifying people by their voice
- Doddington
- 1986
(Show Context)
Citation Context ...aker recognition techniques are based on the characterization of the statistical distribution ofs132 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998 the spectral envelopes =-=[7]-=-, [14], [41]. It is generally admitted that the overall shape of the envelope together with the formant characteristics are the major speaker-identifying features of the spectral envelope [15], [17], ... |

33 |
HNS: Speech modification based on a harmonic + noise model
- Laroche, Stylianou, et al.
- 1993
(Show Context)
Citation Context ... by reapplying the DTW procedure between the converted envelopes and the target envelopes. The spectral conversion method is tested on speech signals analyzed by the harmonic noise model system (HNM) =-=[24]-=-, [44], [46]. The HNM system performs a time-varying harmonic plus (modulated) noise decomposition which allows for spectral transformations and for time and pitch modifications. The spectral envelope... |

32 |
Research on individuality features in speech waves and automatic speaker recognition techniques
- Furui
- 1986
(Show Context)
Citation Context ...ng these factors, suprasegmental speech characteristics such as the speaking rate, the pitch contour or the duration of the pauses have been shown to contribute greatly to speaker individuality [17], =-=[12]-=-, [21], [42]. In many cases, it also appears that specific characteristics of the perceived voice are influenced by the linguistic style of the speech [9], [17]. In the current state of our knowledge,... |

30 |
Voice transformation using PSOLA technique
- Valbret, Moulines, et al.
- 1992
(Show Context)
Citation Context ... suggest that a possible way to improve the quality of the converted speech consists of modifying only some specific aspects of the spectral envelope, such as the location of its formants [28], [29], =-=[48]-=-. Spectral conversion techniques have been also proposed for speaker/environment adaptation that map speech features of the same speaker between clean and noisy acoustic spaces [16], [30], [33]. In [3... |

29 |
Acoustic characteristics of speaker individuality: Control and conversion”, Speech communication 16.2
- Kuwabara, Sagisak
- 1995
(Show Context)
Citation Context ...iori. Fortunately, it turns out that the average values of these features (average pitch frequency, overall speech dynamics) already carry a great deal of the speaker-specific information [12], [17], =-=[23]-=-, [42]. There is also strong evidence that distinct speakers can be efficiently discriminated at the segmental level by comparing their respective spectral envelopes [12], [18]. Accordingly, most curr... |

24 |
Recent Research in Automatic Speaker Recognition
- Soong
- 1992
(Show Context)
Citation Context ...ition techniques are based on the characterization of the statistical distribution ofs132 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998 the spectral envelopes [7], [14], =-=[41]-=-. It is generally admitted that the overall shape of the envelope together with the formant characteristics are the major speaker-identifying features of the spectral envelope [15], [17], [23]. Howeve... |

23 |
Selection of Acoustic Features for Speaker Identification
- Sambur
- 1975
(Show Context)
Citation Context ...tors, suprasegmental speech characteristics such as the speaking rate, the pitch contour or the duration of the pauses have been shown to contribute greatly to speaker individuality [17], [12], [21], =-=[42]-=-. In many cases, it also appears that specific characteristics of the perceived voice are influenced by the linguistic style of the speech [9], [17]. In the current state of our knowledge, the process... |

16 |
Voice spectrograms as a function of age, voice disguise, and voice imitation
- Endres, Bambach, et al.
- 1971
(Show Context)
Citation Context ...ibute greatly to speaker individuality [17], [12], [21], [42]. In many cases, it also appears that specific characteristics of the perceived voice are influenced by the linguistic style of the speech =-=[9]-=-, [17]. In the current state of our knowledge, the processing of such features of speech by an automatic system is difficult because high-level considerations are involved. In particular, the fact tha... |

16 |
Solving least squares problems. Englewood Cliffs
- Lawson, Hanson
- 1974
(Show Context)
Citation Context ...he two matrices and . . . . . . . . are the unknown parameters of the conversion function. The form of (11) is that of a standard least-squares problem whose solution is given by the normal equations =-=[25]-=-, [22] or . . . . . . . (14) (15) . . The matrix that is to be inverted [leftmost matrix in (15)] is symmetric and positive definite so that the normal equations can be solved using the Cholesky decom... |

15 |
Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks
- Iwahashi, Sagisaka
- 1995
(Show Context)
Citation Context ..., although it provides voice conversion effect which is sometimes impressive, is plagued by its poor quality and its lack of robustness [29]. The spectral interpolation approach described in [19] and =-=[20]-=- solves these problems by interpolating between the spectra of several speakers to determine the converted spectrum. However, the practical use of this method is limited by the fact that it requires t... |

15 |
Speaker adaptation through vector quantisation
- Shikano, Lee, et al.
- 1986
(Show Context)
Citation Context ...sion. One of the earliest approaches to the spectral conversion problem is the mapping codebook method of Abe et al. [1], [2], which was originally introduced by Shikano et al. for speaker adaptation =-=[43]-=-. In this approach, a clustering procedure—vector quantization (VQ) is applied to the spectral parameters of both the source and the target speakers. The two resulting VQ codebooks are used to obtain ... |

14 | Text Independent Speaker Identification using Automatic Acoustic Segmentation - Rose - 1990 |

13 |
Voice conversion: state of the art and perspectives
- Moulines, Sagisaka
- 1995
(Show Context)
Citation Context ... existing speaker (the so-called target speaker). This problem—how to modify the speech of one speaker so that it sounds as if it was uttered by another speaker—is generally known as voice conversion =-=[32]-=-. In daily life, the individuality of voices is useful because it enables us to differentiate between speakers. If all voices sounded alike it would, for instance, be almost impossible to follow a rad... |

12 |
Regularized estimation of cepstrum envelope from discrete frequency points
- Cappe, Laroche, et al.
- 1995
(Show Context)
Citation Context ... for spectral transformations and for time and pitch modifications. The spectral envelope is determined from the parameters of the HNM model by application of the regularized discrete cepstrum method =-=[3]-=-, [4], using a warped Bark frequency scale. This technique makes it possible to obtain a representation of the signal spectrum that is accurate enough to allow a resynthesis of transparent quality wit... |

12 |
Regularization techniques for discrete cepstrum estimation
- Cappé, Moulines
- 1996
(Show Context)
Citation Context ...spectral transformations and for time and pitch modifications. The spectral envelope is determined from the parameters of the HNM model by application of the regularized discrete cepstrum method [3], =-=[4]-=-, using a warped Bark frequency scale. This technique makes it possible to obtain a representation of the signal spectrum that is accurate enough to allow a resynthesis of transparent quality with a n... |

8 |
Continuous Probabilistic Acoustic MAP for Speaker Recognition
- Tseng, Soong
- 1992
(Show Context)
Citation Context ...Model The GMM is a classic parametric model used in many pattern recognition techniques [8] whose efficiency for textindependent speaker recognition has been illustrated by recent studies [39], [40], =-=[47]-=-. The GMM assumes that the probability distribution of the observed parameters takes the following parametric form [8], [39] where denotes the -dimensional normal distribution with mean vector and cov... |

6 |
Energy Onset Times for Speaker Identification
- Quatieri, Jankowski, et al.
- 1994
(Show Context)
Citation Context ...s features to the individuality of the speaker’s voice. Recent studies suggest that some effective speaker-specific features can also be extracted directly from the speech waveform in the time domain =-=[35]-=-. In this paper, we focus on the control of the spectral envelope characteristics at the segmental level. More specifically, our aim is to represent by an appropriate model, trained from experimental ... |

5 |
Speaker-identifying features based on formant tracks
- Goldstein
- 1976
(Show Context)
Citation Context ...nvelopes [7], [14], [41]. It is generally admitted that the overall shape of the envelope together with the formant characteristics are the major speaker-identifying features of the spectral envelope =-=[15]-=-, [17], [23]. However, some uncertainty remains about the respective contributions of these acoustics features to the individuality of the speaker’s voice. Recent studies suggest that some effective s... |

4 |
On the asymptotic statistical behavior of empirical cepstral coefficients
- Merhav, Lee
- 1993
(Show Context)
Citation Context ...ated with this kind of model [39], [47]. In the case of cepstral parameters, this modification is believed to be appropriate since the correlation between distinct cepstral coefficients is very small =-=[27]-=-, [36]. In our case, the computational load associated with the training of the conversion function is reduced when both the covariance matrices of the GMM and the conversion matrices are constrained ... |

3 |
Generalized functional approximation for source-filter system modeling
- Galas, Rodet
- 1991
(Show Context)
Citation Context ...obtained are similar to the usual MFCC’s [36] except for the fact that they are obtained from the minimization of a discrete set of frequency constraints. Such parameters were originally mentioned in =-=[13]-=- as discrete MFCC’s and are known to provide a better envelope fit (at the specified frequency points) than LPC-based methods [3]. The synthetic signals obtained by use of the envelope representation ... |

3 |
Effects of Acoustical Feature Parameters on Perceptual Speaker Identity
- Itoh, Saito
- 1988
(Show Context)
Citation Context ...information [12], [17], [23], [42]. There is also strong evidence that distinct speakers can be efficiently discriminated at the segmental level by comparing their respective spectral envelopes [12], =-=[18]-=-. Accordingly, most current speaker recognition techniques are based on the characterization of the statistical distribution ofs132 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MAR... |

3 |
Speech spectrum transformation by speaker interpolation
- Iwahashi, Sagisaka
- 1994
(Show Context)
Citation Context ... approach, although it provides voice conversion effect which is sometimes impressive, is plagued by its poor quality and its lack of robustness [29]. The spectral interpolation approach described in =-=[19]-=- and [20] solves these problems by interpolating between the spectra of several speakers to determine the converted spectrum. However, the practical use of this method is limited by the fact that it r... |

3 |
Voice conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt
- Mizuno, Abe
- 1994
(Show Context)
Citation Context ...recent works suggest that a possible way to improve the quality of the converted speech consists of modifying only some specific aspects of the spectral envelope, such as the location of its formants =-=[28]-=-, [29], [48]. Spectral conversion techniques have been also proposed for speaker/environment adaptation that map speech features of the same speaker between clean and noisy acoustic spaces [16], [30],... |

3 |
Speech recognition in adverse environments: speech enhancement and spectral transformations
- Mokbel, Chollet
- 1991
(Show Context)
Citation Context ... [28], [29], [48]. Spectral conversion techniques have been also proposed for speaker/environment adaptation that map speech features of the same speaker between clean and noisy acoustic spaces [16], =-=[30]-=-, [33]. In [30], noisy references have been simulated by transforming clean utterances using the linear multiple regression (LMR) algorithm with one translation vector and one rotation matrix for all ... |

2 |
Speaker identification utilizing selected temporal speech features
- Johnson, Hollien, et al.
- 1984
(Show Context)
Citation Context ...se factors, suprasegmental speech characteristics such as the speaking rate, the pitch contour or the duration of the pauses have been shown to contribute greatly to speaker individuality [17], [12], =-=[21]-=-, [42]. In many cases, it also appears that specific characteristics of the perceived voice are influenced by the linguistic style of the speech [9], [17]. In the current state of our knowledge, the p... |

1 |
conversion through vector quantization
- “Voice
- 1990
(Show Context)
Citation Context ...tics, we will refer to the control of the spectral envelope as spectral conversion. One of the earliest approaches to the spectral conversion problem is the mapping codebook method of Abe et al. [1], =-=[2]-=-, which was originally introduced by Shikano et al. for speaker adaptation [43]. In this approach, a clustering procedure—vector quantization (VQ) is applied to the spectral parameters of both the sou... |

1 |
Iterative normalization for speaker-adaptive training in continuous speech recognition
- Feng, Kubala, et al.
(Show Context)
Citation Context ...g the time-alignment between the converted envelopes and the target envelopes. Iterative procedures have also been used in the literature for speakeradaptive training in continuous speech recognition =-=[11]-=-. These optional “incremental learning” steps are only intended to refine the time alignment path. The GMM estimation and the least squares (LS) optimization are of course always performed using the s... |

1 |
Speaker normalization via a linear transformation on a perceptual feature space and its benefits
- Gu, Mason
- 1989
(Show Context)
Citation Context ...rmants [28], [29], [48]. Spectral conversion techniques have been also proposed for speaker/environment adaptation that map speech features of the same speaker between clean and noisy acoustic spaces =-=[16]-=-, [30], [33]. In [30], noisy references have been simulated by transforming clean utterances using the linear multiple regression (LMR) algorithm with one translation vector and one rotation matrix fo... |

1 |
The Acoustics of Crime—The New Science of Forensic Phonetics
- Hollien
- 1990
(Show Context)
Citation Context ...s. Among these factors, suprasegmental speech characteristics such as the speaking rate, the pitch contour or the duration of the pauses have been shown to contribute greatly to speaker individuality =-=[17]-=-, [12], [21], [42]. In many cases, it also appears that specific characteristics of the perceived voice are influenced by the linguistic style of the speech [9], [17]. In the current state of our know... |

1 |
conversion algorithm based on piecewise linear conversion rule of formant frequency and spectrum tilt
- “Voice
- 1995
(Show Context)
Citation Context ...y VQ [23]. Most authors agree that the mapping codebook approach, although it provides voice conversion effect which is sometimes impressive, is plagued by its poor quality and its lack of robustness =-=[29]-=-. The spectral interpolation approach described in [19] and [20] solves these problems by interpolating between the spectra of several speakers to determine the converted spectrum. However, the practi... |

1 |
Techniques for pitch-scale and timescale transformation of speech, Part I: Nonparametric methods
- Moulines, Laroche
- 1995
(Show Context)
Citation Context ... this mode enables higher quality time-scale and pitch-scale modifications [46]. These modifications are PSOLA-like in that they mostly consist in recomputing the pitch-synchronous synthesis instants =-=[31]-=-. However, an important difference with the usual nonparametric TD-PSOLA (time domain pitch-synchronous overlap-add) processing is that the amplitude of the harmonics are computed explicitly using the... |

1 |
pitch and maximum voiced frequency estimation technique adapted to harmonic models of speech
- “A
- 1996
(Show Context)
Citation Context ... divided into two bands delimited by the so-called maximum voiced frequency. Both the pitch of the signal and the maximum voiced frequency are determined beforehand using a time-domain pitch detector =-=[45]-=-. The lower band of the spectrum (below the maximum voiced frequency) is represented solely by harmonically related sine waves. The upper band is modeled as a noise component modulated by a time-domai... |