Results 1 - 10
of
168
UNIT SELECTION IN A CONCATENATIVE SPEECH SYNTHESIS SYSTEM USING A LARGE SPEECH DATABASE
, 1996
"... One approach to the generation of natural-sounding syn-thesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated wi ..."
Abstract
-
Cited by 227 (24 self)
- Add to MetaCart
One approach to the generation of natural-sounding syn-thesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two meth-ods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.
Trainable Videorealistic Speech Animation
- PROCEEDINGS OF SIGGRAPH 2002, SAN ANTONIO TEXAS
, 2002
"... We describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from th ..."
Abstract
-
Cited by 110 (5 self)
- Add to MetaCart
We describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned. The two key
The Mbrola Project: Towards A Set Of High Quality Speech Synthesizers Free Of Use For Non Commercial Purposes
"... The aim of the MBROLA project, recently initiated by the Faculte Polytechnique de Mons (Belgium), is to obtain a set of speech synthesizers for as many voices, languages and dialects as possible, free of use for non-commercial and non-military applications. The ultimate goal is to boost up academic ..."
Abstract
-
Cited by 59 (0 self)
- Add to MetaCart
The aim of the MBROLA project, recently initiated by the Faculte Polytechnique de Mons (Belgium), is to obtain a set of speech synthesizers for as many voices, languages and dialects as possible, free of use for non-commercial and non-military applications. The ultimate goal is to boost up academic research on speech synthesis, and particularly on prosody generation, known as one of the biggest challenges taken up by Text-to-Speech synthesizers for the years to come.
An Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech
- proceedings of ICASSP-93
, 1993
"... A concept of waveform similarity is proposed for tackling the problem of time-scale modification of speech, and is worked-out in the context of short-time Fourier transform representations. The resulting WSOLA algorithm produces high quality speech output, is algorithmically and computationally effi ..."
Abstract
-
Cited by 52 (7 self)
- Add to MetaCart
A concept of waveform similarity is proposed for tackling the problem of time-scale modification of speech, and is worked-out in the context of short-time Fourier transform representations. The resulting WSOLA algorithm produces high quality speech output, is algorithmically and computationally efficient and robust, and allows for on-line processing with arbitrary timescaling factors that may be specified in a time-varying fashion and that can be chosen over a wide continuous range of values. I.
Musical Sound Signal Analysis/Synthesis: Sinusoidal+Residual and Elementary Waveform Models
, 1998
"... Several versions of Sinusoidal+Residual analysis/synthesis models have been developed for music applications. They have been very successful and are already found in commercial and experimental tools used by musicians as well as researchers. In this paper, we begin by presenting the principles of th ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
Several versions of Sinusoidal+Residual analysis/synthesis models have been developed for music applications. They have been very successful and are already found in commercial and experimental tools used by musicians as well as researchers. In this paper, we begin by presenting the principles of this now classical model. However, the standard version of the model suffers from limitations in various cases. Therefore, we discuss some improvements of the standard method designed in order to overcome these difficulties. We then present and compare other analysis techniques which use Elementary Waveforms, i.e. waveforms localized both on the frequency and the time axis. In particular, the High Resolution Matching Pursuit algorithm is proposed as a potentially successful new direction of research. Keywords: Sound, analysis, synthesis sinusoidal, elementary-waveform. 1. Introduction Some of the first attempts at sound synthesis were based on the method called additive synthesis, that is th...
Visual Speech Synthesis by Morphing Visemes
, 1999
"... We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subjec ..."
Abstract
-
Cited by 44 (7 self)
- Add to MetaCart
We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.
Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis
, 2001
"... This paper describes the application of the harmonic plus noise model (HNM) for concatenative text-to-speech (TTS) synthesis. In the context of HNM, speech signals are represented as a time-varying harmonic component plus a modulated noise component. The decomposition of a speech signal into these t ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
This paper describes the application of the harmonic plus noise model (HNM) for concatenative text-to-speech (TTS) synthesis. In the context of HNM, speech signals are represented as a time-varying harmonic component plus a modulated noise component. The decomposition of a speech signal into these two components allows for more natural-sounding modifications of the signal (e.g., by using different and better adapted schemes to modify each component). The parametric representation of speech using HNM provides a straightforward way of smoothing discontinuities of acoustic units around concatenation points. Formal listening tests have shown that HNM provides high-quality speech synthesis while outperforming other models for synthesis (e.g., TD-PSOLA) in intelligibility, naturalness, and pleasantness.
Natural-Sounding Speech Synthesis Using Variable-Length Units
, 1998
"... The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was ac ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was achievable with word- and phrase-level concatenation. In order to extend the flexibility of this framework, we focused on the problem of generating novel words from a corpus of sub-word units. The design of the sub-word units was motivated by perceptual studies that investigated where speech could be spliced with minimal audible distortion and what contextual constraints were necessary to maintain in order to produce natural sounding speech. The sub-word corpus is searched during synthesis using a Viterbi search which selects a sequence of units based on how well they individually match the input specification and on how well they sound as an ensemble. This concatenative speech synthesis system, ENVOICE, has been used in a conversational information retrieval system in two application domains to convert meaning representations into speech waveforms.
Robust Unit Selection System For Speech Synthesis
- IN 137TH MEETING OF THE ACOUSTICAL SOCIETY OF AMERICA
, 1999
"... There has been much interest for many years in diphone-based concatenative speech synthesis and, recently, a rapidly increasing interest in unit selection based synthesis (as illustrated by the CHATR [2] system). However, the limitations of both types of system are well known. While intelligibility ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
There has been much interest for many years in diphone-based concatenative speech synthesis and, recently, a rapidly increasing interest in unit selection based synthesis (as illustrated by the CHATR [2] system). However, the limitations of both types of system are well known. While intelligibility is generally very high for diphone based systems, the resulting signals do not sound completely natural. This happens for several reasons, amongst them the limited number of phone variants present in a typical system, and the potential artifacts introduced by concatenating at diphone boundaries. For unit selection synthesis, typically phone-based, it is possible to produce sentences that sound surprisingly natural and intelligible from a large database. However, quality is often inconsistent, and the main difficulties appear to be selecting acoustically appropriate units with the correct prosodic characteristics. Also, note that typically no prosody modification is done to achieve the highes...
Generating F0 contours from ToBI labels using linear regression
- In ICSLP 96
, 1996
"... This paper describes a method for generating F 0 contours from ToBI labelled utterances. The method uses linear regression to predict F 0 target values for the start, mid-vowel and end of every syllable, using features representing the ToBI labels, stress and syllable position. Contours generated by ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
This paper describes a method for generating F 0 contours from ToBI labelled utterances. The method uses linear regression to predict F 0 target values for the start, mid-vowel and end of every syllable, using features representing the ToBI labels, stress and syllable position. Contours generated by this method for an English database have a correlation of 0.62 and 34.8 Hz RMS error when compared with originals from test data. These results are significant improvements on a previous rule driven method (0.40 and 44.7), and the new method contours are preferred byhuman listeners. The technique has also been successfully applied to Japanese ToBI with similar improvements.

