Results 1 - 10
of
62
Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory
"... In this paper, we describe a novel spectral conversion ..."
Abstract
-
Cited by 27 (21 self)
- Add to MetaCart
In this paper, we describe a novel spectral conversion
Speaker Adaptation For Hmm-Based Speech Synthesis System Using Mllr
, 1998
"... This paper describes a voice characteristics conversion technique for an HMM-based text-to-speech synthesis system. The system uses phoneme HMMs as the speech synthesis units, and voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteri ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
This paper describes a voice characteristics conversion technique for an HMM-based text-to-speech synthesis system. The system uses phoneme HMMs as the speech synthesis units, and voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthetic speech to the target speaker, we apply an MLLR (Maximum Likelihood Linear Regression) technique, one of the speaker adaptation techniques, to the system. From the results of objective and subjective tests, it is shown that the characteristics of synthetic speech is close to target speaker's voice, and the speech generated from the adapted model set using 5 sentences has almost the same DMOS score as that from the speaker dependent model set.
Statistical Mapping between Articulatory Movements and Acoustic Spectrum Using a Gaussian Mixture Model
, 2007
"... In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture model (GMM) based on a parallel acoustic-articulatory speech database. We apply the GMM-based mapping using the minimum mean-square error (MMSE) criterion, which has been proposed for voice conversion, to the two mappings. Moreover, to improve the mapping performance, we apply maximum likelihood estimation (MLE) to the GMM-based mapping method. The determination of a target parameter trajectory having appropriate static and dynamic properties is obtained by imposing an explicit relationship between static and dynamic features in the MLE-based mapping. Experimental results demonstrate that the MLE-based mapping with dynamic features can significantly improve the mapping performance compared with the MMSE-based mapping in both the articulatory-to-acoustic mapping and the inversion mapping.
High Quality Voice Morphing
, 2004
"... Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. Most of the recent approaches to voice morphing apply a linear transformation to the spectral envelope and pitch scaling to modify the prosody. Whilst these methods a ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. Most of the recent approaches to voice morphing apply a linear transformation to the spectral envelope and pitch scaling to modify the prosody. Whilst these methods are effective, they also introduce artifacts arising from the effects of glottal coupling, phase incoherence, unnatural phase dispersion and the high spectral variance of unvoiced sounds. A practical voice morphing system must account for these if high audio quality is to be preserved. This paper describes a complete voice morphing system and the enhancements needed for dealing with the various artifacts, including a novel method for synthesising natural phase dispersion. Each technique is assessed individually and the overall performance of the system evaluated using listening tests. Overall it is found that the enhancements significantly improve speaker identification scores and perceived audio quality.
Radial Basis Function Networks for Conversion of Sound Spectra
- Proc. of the DAFX99 Conf
, 1999
"... In many high-level signal processing tasks, such as pitch shifting, voice conversion or sound synthesis, accurate spectral processing is required. Here, the use of Radial Basis Function Networks (RBFN) is proposed for modeling the relationships among sets of spectral envelopes. The identification of ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In many high-level signal processing tasks, such as pitch shifting, voice conversion or sound synthesis, accurate spectral processing is required. Here, the use of Radial Basis Function Networks (RBFN) is proposed for modeling the relationships among sets of spectral envelopes. The identification of such conversion functions is based on a procedure which learns the shape of the conversion from few couples of original target spectra (training set). The generalization properties of RBFNs provides for interpolation with respect to the pitch range. In the construction of the training set, mel-cepstral encoding of the spectrum is used to catch the perceptually most relevant spectral changes. Moreover, singular value decomposition (SVD) is used to reduce the dimension of conversion functions. The RBFN conversion functions introduced are characterized by a perceptually-based fast training procedure, desirable interpolation properties and computational efficiency. 1.
Quality-Enhanced Voice Morphing Using Maximum Likelihood Transformations
- IEEE TRANS. ON SPEECH AND AUDIO PROCESSING
, 2006
"... Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear tran ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and linear transformations estimated from time-aligned parallel training data are commonly used to achieve this. However, the naive application of envelope transformation combined with the necessary pitch and duration modifications will result in noticeable artifacts. This paper studies the linear transformation approach to voice morphing and investigates these two specific issues. Firstly, a general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches. Secondly, the main causes of artifacts are identified as being due to glottal coupling, unnatural phase dispersion and the high spectral variance of unvoiced sounds, and compensation techniques are developed to mitigate these. The resulting voice morphing system is evaluated using both subjective and objective measures. These tests show that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Furthermore, they do not require carefully prepared parallel training data.
Td-Psola Versus Harmonic Plus Noise Model In Diphone Based Speech Synthesis
- in Proc. of the International Conf. on Acoustics, Speech, and Signal Processing
, 1998
"... In an effort to select a speech representation for our next generation concatenative text-to-speech synthesizer, the use of two candidates is investigated; TD-PSOLA and the Harmonic plus Noise Model, HNM. A formal listening test has been conducted and the two candidates have been rated regarding int ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
In an effort to select a speech representation for our next generation concatenative text-to-speech synthesizer, the use of two candidates is investigated; TD-PSOLA and the Harmonic plus Noise Model, HNM. A formal listening test has been conducted and the two candidates have been rated regarding intelligibility, naturalness and pleasantness. Ability for database compression and computational load is also discussed. The results show that HNM consistently outperforms TD-PSOLA in all the above features except for computational load. HNM allows for high-quality speech synthesis without smoothing problems at the segmental boundaries and without buzziness or other oddities observed with TDPSOLA. 1. INTRODUCTION The goal of speech synthesis is to enable a machine to transmit orally information to a user in a man machine communication context [1]. However, in spite of the long history of speech synthesis, no one speech synthesis system available today is able to produce speech that could be...
Subband Based Voice Conversion
- ICSLP 2002
, 2002
"... A new voice conversion method that improves the quality of the voice conversion output at higher sampling rates is proposed. Speaker Transformation Algorithm Using Segmental Codebooks (STASC) is modified to process source and target speech spectra in different subbands. The new method ensures better ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
A new voice conversion method that improves the quality of the voice conversion output at higher sampling rates is proposed. Speaker Transformation Algorithm Using Segmental Codebooks (STASC) is modified to process source and target speech spectra in different subbands. The new method ensures better conversion at sampling rates above 16KHz. Discrete Wavelet Transform (DWT) is employed for subband decomposition to estimate the speech spectrum better with higher resolution. Faster voice conversion is achieved since the computational complexity decreases at a lower sampling rate. A Voice Conversion System (VCS) is implemented using the proposed algorithm with necessary tools. The performance of the proposed method is demonstrated by both subjective listening tests and applications to film dubbing and looping. In ABX listening tests, the listeners preferred the subband based output by 92.1% as compared to the full-band based output.
Nonparallel training for voice conversion based on a parameter adaptation approach
- IEEE Trans. Audio, Speech and Language Processing
, 2006
"... permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania’s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotiona ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania’s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to
Intelligibility of modifications to dysarthric speech
- Proc. of ICASSP 2003
, 2003
"... Dysarthria is a motor speech impairment affecting millions of people. Dysarthric speech can he far less intelligible than that of non-dysarthric speakers, causing significant communication difficulties. The goal of this work is to understand the effect that certain modifications have on the intellig ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Dysarthria is a motor speech impairment affecting millions of people. Dysarthric speech can he far less intelligible than that of non-dysarthric speakers, causing significant communication difficulties. The goal of this work is to understand the effect that certain modifications have on the intelligibility of dysarthric speech. These modifications are designed to identify aspects of the speech signal or signal processing that may he especially relevant to the effectiveness of a system that transforms dysarthric speech to improve its intelligibility. A result of this study is that dysarthric speech can, in the hest case, he modified only at the short-term spectral level to improve intelligibility from 68 % to 87%. A baseline transformation system using standard technology, however, does not show improvement in intelligibility. Prosody also has a significant @ < 0.05) effect on intelligibility. 1.

