Results 1 -
4 of
4
A hidden Markov-model-based trainable speech synthesizer
, 1999
"... This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters representing each clustered state are obtained completely automatically through training on a 1 hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. The system produces speech which, though in a monotone, is both natural sounding and highly intelligible. A Modified Rhyme Test conducted to measure segmental intelligibility yielded a 50% error rate. The speech produced by the system mimics the voice of the speaker used to record the training database. The system can be retrained on...
Corpus-Based Speech Synthesis: Methods and Challenges
"... Corpus-based approaches to speech synthesis have been advocated to overcome the limitations of concatenative synthesis from a xed acoustic unit inventory. The frequency of unit concatenations in, e.g., diphone synthesis has been argued to contribute to the perceived lack of naturalness of synthetic ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Corpus-based approaches to speech synthesis have been advocated to overcome the limitations of concatenative synthesis from a xed acoustic unit inventory. The frequency of unit concatenations in, e.g., diphone synthesis has been argued to contribute to the perceived lack of naturalness of synthetic speech. The key idea of corpus-based synthesis, or unit selection, is to use an entire speech corpus as the acoustic inventory and to select at run-time from this corpus the longest available strings of phonetic segments that match a sequence of target speech sounds in the utterance to be synthesized, thereby minimizing the number of concatenations and reducing the need for signal processing. This paper reviews the assumptions underlying this synthesis strategy and the dierent approaches to unit selection, as well as the major challenges encountered by corpus-based methods. One of the biggest problems to date is the relative weighting of acoustic distance measures. We further argue agains...
Segment Pre-Selection In Decision-Tree Based Speech Synthesis Systems
, 2000
"... Corpus based approaches to unit selection for concatenative speech synthesis have become popular in recent years due to their improved sensitivity to unit context over their more simple predecessors. These systems usually make use of large speech databases and employ sophisticated search algorithm ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Corpus based approaches to unit selection for concatenative speech synthesis have become popular in recent years due to their improved sensitivity to unit context over their more simple predecessors. These systems usually make use of large speech databases and employ sophisticated search algorithms to determine the optimal unit sequence to use to synthesise each sentence. For many applications it is not possible to have the entire database, which may be as large as several hundred megabytes, available to the synthesiser at runtime. What is required is some form of off-line pre-selection algorithm to determine which subset of the database enables the highest quality speech synthesis to be performed for a given runtime system size. This paper describes a pre-selection algorithm developed at IBM for use with decision-tree-based concatenative speech synthesisers. 1. INTRODUCTION In recent years corpus based approaches to unit selection for concatenative speech synthesis have become inc...
High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion
, 2003
"... Text-to-Speech (TTS) is a useful technology that converts any text into a speech signal. It can be utilized for various purposes, e.g. car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading. Corpus-based TTS makes it possible to dramatically im ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Text-to-Speech (TTS) is a useful technology that converts any text into a speech signal. It can be utilized for various purposes, e.g. car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading. Corpus-based TTS makes it possible to dramatically improve the naturalness of synthetic speech compared with the early TTS. However, no general-purpose TTS has been developed that can consistently synthesize su#- ciently natural speech. Furthermore, there is not yet enough flexibility in corpusbased TTS.

