Results 1 - 10
of
199
Statistical parametric speech synthesis
- in Proc. ICASSP, 2007
, 2007
"... This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This ..."
Abstract
-
Cited by 179 (18 self)
- Add to MetaCart
This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future. Index Terms — Speech synthesis, hidden Markov models 1. BACKGROUND With the increase in power and resources of computer technology, building natural sounding synthetic voices has progressed from a
Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory
"... In this paper, we describe a novel spectral conversion ..."
Abstract
-
Cited by 113 (34 self)
- Add to MetaCart
In this paper, we describe a novel spectral conversion
Statistical Mapping between Articulatory Movements and Acoustic Spectrum Using a Gaussian Mixture Model
, 2007
"... In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture ..."
Abstract
-
Cited by 62 (5 self)
- Add to MetaCart
In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture model (GMM) based on a parallel acoustic-articulatory speech database. We apply the GMM-based mapping using the minimum mean-square error (MMSE) criterion, which has been proposed for voice conversion, to the two mappings. Moreover, to improve the mapping performance, we apply maximum likelihood estimation (MLE) to the GMM-based mapping method. The determination of a target parameter trajectory having appropriate static and dynamic properties is obtained by imposing an explicit relationship between static and dynamic features in the MLE-based mapping. Experimental results demonstrate that the MLE-based mapping with dynamic features can significantly improve the mapping performance compared with the MMSE-based mapping in both the articulatory-to-acoustic mapping and the inversion mapping.
Speaker Adaptation For Hmm-Based Speech Synthesis System Using Mllr
, 1998
"... This paper describes a voice characteristics conversion technique for an HMM-based text-to-speech synthesis system. The system uses phoneme HMMs as the speech synthesis units, and voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteri ..."
Abstract
-
Cited by 35 (15 self)
- Add to MetaCart
This paper describes a voice characteristics conversion technique for an HMM-based text-to-speech synthesis system. The system uses phoneme HMMs as the speech synthesis units, and voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthetic speech to the target speaker, we apply an MLLR (Maximum Likelihood Linear Regression) technique, one of the speaker adaptation techniques, to the system. From the results of objective and subjective tests, it is shown that the characteristics of synthetic speech is close to target speaker's voice, and the speech generated from the adapted model set using 5 sentences has almost the same DMOS score as that from the speaker dependent model set.
High Quality Voice Morphing
, 2004
"... Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. Most of the recent approaches to voice morphing apply a linear transformation to the spectral envelope and pitch scaling to modify the prosody. Whilst these methods a ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. Most of the recent approaches to voice morphing apply a linear transformation to the spectral envelope and pitch scaling to modify the prosody. Whilst these methods are effective, they also introduce artifacts arising from the effects of glottal coupling, phase incoherence, unnatural phase dispersion and the high spectral variance of unvoiced sounds. A practical voice morphing system must account for these if high audio quality is to be preserved. This paper describes a complete voice morphing system and the enhancements needed for dealing with the various artifacts, including a novel method for synthesising natural phase dispersion. Each technique is assessed individually and the overall performance of the system evaluated using listening tests. Overall it is found that the enhancements significantly improve speaker identification scores and perceived audio quality.
Vulnerability of Speaker Verification Systems Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech
"... Voice conversion – the methodology of automatically converting one’s utterances to sound as if spoken by another speaker – presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks usin ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
(Show Context)
Voice conversion – the methodology of automatically converting one’s utterances to sound as if spoken by another speaker – presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %. Index Terms: speaker verification, voice conversion, security 1.
Subband Based Voice Conversion
- ICSLP 2002
, 2002
"... A new voice conversion method that improves the quality of the voice conversion output at higher sampling rates is proposed. Speaker Transformation Algorithm Using Segmental Codebooks (STASC) is modified to process source and target speech spectra in different subbands. The new method ensures better ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
(Show Context)
A new voice conversion method that improves the quality of the voice conversion output at higher sampling rates is proposed. Speaker Transformation Algorithm Using Segmental Codebooks (STASC) is modified to process source and target speech spectra in different subbands. The new method ensures better conversion at sampling rates above 16KHz. Discrete Wavelet Transform (DWT) is employed for subband decomposition to estimate the speech spectrum better with higher resolution. Faster voice conversion is achieved since the computational complexity decreases at a lower sampling rate. A Voice Conversion System (VCS) is implemented using the proposed algorithm with necessary tools. The performance of the proposed method is demonstrated by both subjective listening tests and applications to film dubbing and looping. In ABX listening tests, the listeners preferred the subband based output by 92.1% as compared to the full-band based output.
Text-independent voice conversion based on unit selection
- Proc. of International Conference on Acoustics, Speech and Signal Processing
, 2006
"... So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require textindependent training, over the last two years, trainin ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
(Show Context)
So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require textindependent training, over the last two years, training techniques that use non-parallel data were proposed. In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training. 1.
Quality-enhanced voice morphing using maximum likelihood transformations,”
- IEEE Trans. Audio, Speech, Lang. Process.,
, 2006
"... ..."
Voice Transformations: From Speech Synthesis to Mammalian Vocalizations
- In Proc. of the EUROSPEECH’01
, 2001
"... This paper describes a phase vocoder based technique for voice transformation. This method provides a flexible way to manipulate various aspects of the input signal, e.g., fundamental frequency of voicing, duration, energy, and formant positions, without explicit £¥ ¤ extraction. The modifications t ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
(Show Context)
This paper describes a phase vocoder based technique for voice transformation. This method provides a flexible way to manipulate various aspects of the input signal, e.g., fundamental frequency of voicing, duration, energy, and formant positions, without explicit £¥ ¤ extraction. The modifications to the signal can be specific to any feature dimensions, and can vary dynamically over time. There are many potential applications for this technique. In concatenative speech synthesis, the method can be applied to transform the speech corpus to different voice characteristics, or to smooth any pitch or formant discontinuities between concatenation boundaries. The method can also be used as a tool for language learning. We can modify the prosody of the student’s own speech to match that from a native speaker, and use the result as guidance for improvements. The technique can also be used to convert other biological signals, such as killer whale vocalizations, to a signal that is more appropriate for human auditory perception. Our initial experiments show encouraging results for all of these applications. 1.