Results 1 - 10
of
14
A hidden Markov-model-based trainable speech synthesizer
, 1999
"... This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters representing each clustered state are obtained completely automatically through training on a 1 hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. The system produces speech which, though in a monotone, is both natural sounding and highly intelligible. A Modified Rhyme Test conducted to measure segmental intelligibility yielded a 50% error rate. The speech produced by the system mimics the voice of the speaker used to record the training database. The system can be retrained on...
"Blind" Speech Segmentation: Automatic Segmentation of Speech without Linguistic Knowledge
"... A new automatic speech segmentation procedure, called the "Blind" speech segmentation, is presented. This procedure allows a speech sample to be segmented into sub-word units without the knowledge of any linguistic information (such as, orthographic or phonetic transcription). Hence, this procedure ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
A new automatic speech segmentation procedure, called the "Blind" speech segmentation, is presented. This procedure allows a speech sample to be segmented into sub-word units without the knowledge of any linguistic information (such as, orthographic or phonetic transcription). Hence, this procedure involves finding the optimal number of sub-word segments in the given speech sample, before locating the 1.
A Comparison of Different Approaches to Automatic Speech Segmentation
- Proceedings of TSD 2002
, 2002
"... Abstract. We compare different methods for obtaining accurate speech segmentations starting from the corresponding orthography. The complete segmentation process can be decomposed into two basic steps. First, a phonetic transcription is automatically produced with the help of large vocabulary contin ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. We compare different methods for obtaining accurate speech segmentations starting from the corresponding orthography. The complete segmentation process can be decomposed into two basic steps. First, a phonetic transcription is automatically produced with the help of large vocabulary continuous speech recognition (LVCSR). Then, the phonetic information and the speech signal serve as input to a speech segmentation tool. We compare two automatic approaches to segmentation, based on the Viterbi and the Forward-Backward algorithm respectively. Further, we develop different techniques to cope with biases between automatic and manual segmentations. Experiments were performed to evaluate the generation of phonetic transcriptions as well as the different speech segmentation methods.
Trainable Speech Synthesis Based On Trajectory Modeling Of Line Spectrum Pair Frequencies
- In IEEE Nordic Signal Processing Symposium
, 1998
"... In this paper we present a novel speaker-dependent speech synthesis algorithm based on modeling temporal trajectories of the speech Line Spectrum Pair Frequencies (LSFs). The overall approach is integrated into a pitch-synchronous analysis/synthesis framework and is shown to allow the synthesis of s ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we present a novel speaker-dependent speech synthesis algorithm based on modeling temporal trajectories of the speech Line Spectrum Pair Frequencies (LSFs). The overall approach is integrated into a pitch-synchronous analysis/synthesis framework and is shown to allow the synthesis of speaker-dependent voice characteristics through an automatic parameter learning algorithm. 1. INTRODUCTION In recent years there has been wide interest in developing flexible speech synthesis algorithms which automatically learn to synthesize a particular voice or vocal characteristic from a set of training data. Such algorithms are useful for personalization of text-to-speech synthesizers as well as for providing very low bit-rate speech coding. To date, the majority of approaches accomplish this task by either building statistical models of the synthesis parameters [1, 2] or by concatenating prestored waveforms [3, 4, 5]. In the work of Donovan and Woodland [1], a trainable speech synthes...
Automatic Labeling of Corpora for Speech Synthesis Development
, 1994
"... One of the bottlenecks in the development of text-to-speech synthesizers based on segment concatenation is the need for large, segmented and labeled corpora. Consequently, as manual segmentation and labeling is a tedious and time consuming task, there is a strong demand for automatic labeling system ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
One of the bottlenecks in the development of text-to-speech synthesizers based on segment concatenation is the need for large, segmented and labeled corpora. Consequently, as manual segmentation and labeling is a tedious and time consuming task, there is a strong demand for automatic labeling systems which can label speech from many languages. Several systems have been proposed already, but they usually require hand labeled training utterances before they can be used for a new language. We have developed a system that adapts to a new language without the need for any hand labeled utterances of that language. Our system contains a segmentation and a broad phonetic classification network, which was originally trained on one task (Flemish continuous speech), and which is subsequently adapted to the new task using an embedded training procedure. The training requires the phonetic transcriptions of the utterances, some structural models describing the phoneme realizations in terms of subpho...
Design, Collection, and Annotation of a Romanian Speech Database
"... Speech databases are essential resources for the acquisition of linguistic knowledge and speech technology developments, and both can be facilitated if the collected signal is accompanied by some form of annotation. This paper gives an overall view on the design, collection, and annotation of a Roma ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Speech databases are essential resources for the acquisition of linguistic knowledge and speech technology developments, and both can be facilitated if the collected signal is accompanied by some form of annotation. This paper gives an overall view on the design, collection, and annotation of a Romanian speech database including over 10 hours of speech from 100 speakers, labeled in part at the broad phonetic level. 1.
LABELING A ROMANIAN SPEECH DATABASE
"... Speech databases are essential for acquisition of linguistic knowledge and speech technology developments, and both can be facilitated if the collected signal is accompanied by some form of annotation. Following the design and collection of a Romanian speech database including more than 10 hours of ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Speech databases are essential for acquisition of linguistic knowledge and speech technology developments, and both can be facilitated if the collected signal is accompanied by some form of annotation. Following the design and collection of a Romanian speech database including more than 10 hours of speech from 100 speakers, we are now working to label the signal files. This is done at a broad phonetic level, in a semiautomatic manner, including three steps: manual transcription of the signal files, automatic alignment of the phoneme labels in the transcriptions with the signals, and manual verification and correction of the aligned labels. 1.
之自動切音研究 A Study on Automatic Phonetic Segmentation for Mandarin Speech/Singing Voice Synthesis
"... model) 的強制性比對方法去進行初始切音的工作。另一方面,對於歌聲語料庫, 除了採用前者的方法之外,我們也加入了動態時間扭曲演算 法 (dynamic time warping)。由於這兩種初始切音的準確度並不高,於是我們使用一個後處理的切音 ..."
Abstract
- Add to MetaCart
model) 的強制性比對方法去進行初始切音的工作。另一方面,對於歌聲語料庫, 除了採用前者的方法之外,我們也加入了動態時間扭曲演算 法 (dynamic time warping)。由於這兩種初始切音的準確度並不高,於是我們使用一個後處理的切音
Automatic Prosodic Break Labeling For Mandarin Chinese Speech Data
"... For corpus-based speech synthesis, large quantities of labeled speech are required. Manually labeling speech data is quite labor-intensive. Therefore, automatic speech labeling is highly desired. Prosodic break detection is one of the tasks for automatic speech labeling. In the paper, we propose an ..."
Abstract
- Add to MetaCart
For corpus-based speech synthesis, large quantities of labeled speech are required. Manually labeling speech data is quite labor-intensive. Therefore, automatic speech labeling is highly desired. Prosodic break detection is one of the tasks for automatic speech labeling. In the paper, we propose an automatic break detection algorithm for mandarin Chinese speech. In this approach, we use energy contour to normalize duration of syllables and use the concept of normalized transition time to represent the time interval between two syllables. Recursive algorithm is used to select locally longer intervals as pauses. Language specific constraint rules are used to make a better judgment. The automatic break labeling results are proved to be good. 1.
Explicit Segmentation Of Speech Using Gaussian Models
- in ICSLP
, 1996
"... In this paper we investigate an automatic method to segment labeled speech. The method needs an initial estimation of the segmentation which is provided by an alignment based on HMM. Afterwards, the boundaries are refined moving the frontier frames to the segment which is more similar to the speech ..."
Abstract
- Add to MetaCart
In this paper we investigate an automatic method to segment labeled speech. The method needs an initial estimation of the segmentation which is provided by an alignment based on HMM. Afterwards, the boundaries are refined moving the frontier frames to the segment which is more similar to the speech frame. Gaussian pdf are used as a similarity measure. The performance of the method is evaluated using the TIMIT database. If boundary deviation (from the reference position) larger than 20 ms. are counted as errors, then the replacement of the boundaries reduces the error in a 30%. Additional experiments show how the proposed method turns the performance quite independent of the speaker dependent or speaker independent data used to estimate the HMM.

