Results 1 - 10
of
13
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Whistler: A Trainable Text-To-Speech System
- Proc. ICSLP
"... We introduce Whistler, a trainable Text-to-Speech (TTS) system, that automatically learns the model parameters from a corpus. Both prosody parameters and concatenative speech units are derived through the use of probabilistic learning methods that have been successfully used for speech recognition. ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We introduce Whistler, a trainable Text-to-Speech (TTS) system, that automatically learns the model parameters from a corpus. Both prosody parameters and concatenative speech units are derived through the use of probabilistic learning methods that have been successfully used for speech recognition. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style.
Recent improvements on microsoft’s trainable text-to-speech synthesizer: Whistler
- In ICASSP-97, volume II
, 1997
"... Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data [7]. This paper will focus on recent improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data [7]. This paper will focus on recent improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style. Whisper TTS engine supports Microsoft Speech API [10] and requires less than 3 MB of working memory. 1.
Prosody Prediction For Speech Synthesis Using Transformational Rule-Based Learning
, 1998
"... Current speech synthesis systems produce intelligible output under conditions with low background noise and low cognitive load. However, the quality is far from natural and intelligibility degrades significantly under less than ideal conditions. One widely agreed upon area of improvement in speech s ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Current speech synthesis systems produce intelligible output under conditions with low background noise and low cognitive load. However, the quality is far from natural and intelligibility degrades significantly under less than ideal conditions. One widely agreed upon area of improvement in speech synthesis output is prosody. Prosody includes the acoustic characteristics of speech that communicate important syntactic, semantic, and discourse information about the utterance. The acoustic correlate of prosody are the pauses, fundamental frequency contours, energy, and duration changes of utterances. Typically, prosody synthesis is a two step process, where symbolic prosodic labels such as phrase boundaries and relative emphasis are predicted from annotated text and then the acoustic correlates are predicted from these labels combined with phonetic information. The goal of this research is to improve the prediction of symbolic prosodic labels for text-to-speech systems, specifically, location of phrase boundaries and phrase-level emphasis (i.e. pitch accents). To date, the most successful algorithms for predicting symbolic prosodic labels are based on either handwritten rules or statistical methods. This research will adopt and modify an alternative algorithm: transformational rule-based learning (TRBL), which has had success in many natural language processing tasks. This learning algorithm is automatically trainable like statistical methods, but is less sensitive to sparse training data conditions than these methods. A second contribution of the thesis is an analysis of the interaction of phrase and accent symbols in prediction. Previous approaches have predicted these prosodic events in a serial fashion, but the order is not agreed upon. In this study, we compare seria...
A Stochastic Model Of Intonation For Text-To-Speech Synthesis
- Proceedings Eurospeech '97 (Rhodes
, 1998
"... This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F 0 curve from the abstract prosodic labels. This model differs ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F 0 curve from the abstract prosodic labels. This model differs from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large 1 This paper is based on a communication presented at Eurospeech'97 (Vronis et al. 1997) and has been recommended by the Editorial Board of Speech Communication. 2 corpora or several corpora of different speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on part-of-speech tagging. The results were validated on French by means of a perception test. Listeners did not perceive a signif...
Generating Synthetic Speech Prosody with Lazy Learning in Tree Structures
- in CoNLL-2000 and LLL-2000, 2000
, 2000
"... We present ongoing work on prosody prediction for speech synthesis. This approach con- siders sentences as tree structures and infers the prosody from a corpus of such structures using machine learning techniques. The prediction is achieved from the prosody of the closest sentence of the corpus thro ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We present ongoing work on prosody prediction for speech synthesis. This approach con- siders sentences as tree structures and infers the prosody from a corpus of such structures using machine learning techniques. The prediction is achieved from the prosody of the closest sentence of the corpus through tree similarity measurements, using either the nearest neighhour algorithm or an analogy-based approach. We introduce two different tree structure representations, the tree similarity metrics considered, and then we discuss the different prediction methods. Experiments axe currently under process to qualify this approach.
Objective Methods For Evaluating Synthetic Intonation
, 1999
"... This paper describes the development and evaluation of objective methods for testing synthetic intonation. While subjective methods are available for assessing the quality of synthetic intonation, such tests consume time and resources, and are not useful for day-to-day model development. Therefore, ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper describes the development and evaluation of objective methods for testing synthetic intonation. While subjective methods are available for assessing the quality of synthetic intonation, such tests consume time and resources, and are not useful for day-to-day model development. Therefore, objective measures of F0 modelling are necessary. Currently, objective evaluation of synthetic intonation involves the use of Root Mean Squared Error and Correlation. However, it is unclear how large an improvement in either score must be before it is reflected perceptually. It is also unclear how detailed an analysis these metrics provide. Therefore, two other metrics are to be tested, both of which are similar to a basic RMSE measurement. All of the evaluation results are compared to a perceptual study in order to determine how the objective measures relate to perceived differences in the contours. 1. INTRODUCTION One difficulty in building models for synthesizing intonation is determinin...
A Survey Of Machine Learning Methods For Predicting Prosody In Radio Speech
, 2004
"... this paper, Lee found that important words in speech tended to be stressed. The stressed words tended to be verbs, adjectives, adverbs, and nouns while the more predictable words were reduced. These grammatically predictable words included articles and prepositions. The first category contains the " ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
this paper, Lee found that important words in speech tended to be stressed. The stressed words tended to be verbs, adjectives, adverbs, and nouns while the more predictable words were reduced. These grammatically predictable words included articles and prepositions. The first category contains the "content" words while the second contain the "function" words
Flexible Speech Synthesis Using Weighted Finite State Transducers
, 1996
"... The main focus of this thesis is on improving the quality of concatenative speech synthesis by taking advantage of the natural (allowable) variability in spoken language, namely, the fact that there are multiple ways of uttering a given sentence and there are several word sequences that can represen ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The main focus of this thesis is on improving the quality of concatenative speech synthesis by taking advantage of the natural (allowable) variability in spoken language, namely, the fact that there are multiple ways of uttering a given sentence and there are several word sequences that can represent a given concept. An architecture for speech generation for constrained domain applications is proposed that tightly integrates language generation and speech synthesis, allowing the choice of words and desired intonation in the system's response to be optimized jointly with the speech output quality. Experiments with a travel planning dialog system have demonstrated that by expanding the space of candidate responses and possible prosodic realizations we achieve higher quality speech output.
Rule Based Generation of Fundamental Frequency Contours for German Utterances
, 1995
"... Evaluations of text-to-speech systems have shown that systems with sophisticated control of prosody sound more natural than those having a very good control of the sound structure but not so elaborated prosody. This calls for a broad research in the field of prosody, which in the past has received m ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Evaluations of text-to-speech systems have shown that systems with sophisticated control of prosody sound more natural than those having a very good control of the sound structure but not so elaborated prosody. This calls for a broad research in the field of prosody, which in the past has received much less interest than the study of segmental sound structure in the areas of both general phonetics and speech technology. Among the different aspects of prosody (intonation, accent and rhythm) the study of intonation, and its expression fundamental frequency (F0), plays an outstanding role. In this paper a method is described which allows the generation of F0 contours close to natural ones from an abstract linguistic model. The model`s principles define a set of labels from which F0 contours are generated by means of rules. The linguistic assumptions of the prosody generation method discussed in this paper are based on the Tone-Sequence-Model (TSM), an established theory of prosodic phono...

