Results 1 - 10
of
10
Evaluating the Pronunciation Component of Text-to-Speech Systems for English: A Performance Comparison of Different Approaches
- IN SPEECH AND LANGUAGE TECHNOLOGY (SALT) CLUB WORKSHOP ON EVALUATION IN SPEECH AND LANGUAGE TECHNOLOGY
, 1997
"... The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system dictionary. Data-driven methods, based on machine learning of the regularities implicit in a large pronouncing dictionary, have received considerable attention recently but are generally thought to perform less well. However, these tentative beliefs are at best uncertain without powerful methods for comparing text-to-phoneme subsystems. This paper contributes to the development of such methods by comparing the performance of four representative approaches to automatic phonemisation on the same test dictionary. As well as rule-based approaches, three data-driven techniques are evaluated: pronunciation by analogy (PbA), NETspeak and IB1-IG (a modified k-nearest neighbour method). Issues involved in comparative evaluation are detailed and elucidated. The data-driven techniques outperform rules in accuracy of letter-to-phoneme translation by a very significant margin but require aligned text-phoneme training data and are slower. Best translation results are obtained with PbA at approximately 72% words correct on a reasonably large pronouncing dictionary, compared to something like 26% words correct for the rules, indicating that automatic pronunciation of text is not a solved problem.
Bi-directional Conversion Between Graphemes and Phonemes Using a Joint N-gram Model
, 2001
"... We present in this paper a statistical model for languageindependent bi-directional conversion between spelling and pronunciation, based on joint grapheme/phoneme units 1 extracted from automatically aligned data. The model is evaluated on spelling-to-pronunciation and pronunciation-tospelling conv ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
We present in this paper a statistical model for languageindependent bi-directional conversion between spelling and pronunciation, based on joint grapheme/phoneme units 1 extracted from automatically aligned data. The model is evaluated on spelling-to-pronunciation and pronunciation-tospelling conversion on the NetTalk database and the CMU dictionary. We also study the effect of including lexical stress in the pronunciation. Although a direct comparison is difficult to make, our model's performance appears to be as good or better than that of other data-driven approaches that have been applied to the same tasks. 1.
A Statistical Text-To-Phone Function Using Ngrams And Rules
- in ICASSP
, 1999
"... Adopting concepts from statistical language modeling and rulebased transformations can lead to effective and efficient text-tophone (TTP) functions. We present here the methods and results of one such effort, resulting in a relatively compact and fast set of TTP rules that achieves 94.5% segmental p ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Adopting concepts from statistical language modeling and rulebased transformations can lead to effective and efficient text-tophone (TTP) functions. We present here the methods and results of one such effort, resulting in a relatively compact and fast set of TTP rules that achieves 94.5% segmental phonemic accuracy. 1.
Improving Pronunciation Accuracy of Proper Names with Language Origin Classes
- in Proc. of the Seventh ESSLLI Student Session
, 2001
"... I would like to thank my advisor Alan Black for all his support and dedication, without him this thesis would not have been possible; Kenji Sagae for the insightful discussions about this thesis and, most importantly, for his patience and support; Guy Lebanon and Christian Monson, LTI colleagues, fo ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
I would like to thank my advisor Alan Black for all his support and dedication, without him this thesis would not have been possible; Kenji Sagae for the insightful discussions about this thesis and, most importantly, for his patience and support; Guy Lebanon and Christian Monson, LTI colleagues, for the discussion about unsupervised clustering; and Toni Badia for having introduced me to the field of Natural Language Processing and for his support during all these years. This work was supported by a “La Caixa ” Fellowship. ii Table of Contents Abbreviations...................................................................................................................... v Abstract.............................................................................................................................. vi
Corpus-based unit selection for natural-sounding speech synthesis
, 2003
"... Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge.
TreeTalk: Memory-based word phonemisation
- In Data-Driven Techniques in Speech Synthesis, Kluwer
, 2001
"... We propose a memory-based (similarity-based) approach to learning the mapping of words into phonetic representations for use in speech synthesis systems. The main advantage of memory-based data mining techniques is their high accuracy, the main disadvantage is processing speed. We introduce a hyb ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We propose a memory-based (similarity-based) approach to learning the mapping of words into phonetic representations for use in speech synthesis systems. The main advantage of memory-based data mining techniques is their high accuracy, the main disadvantage is processing speed. We introduce a hybrid between memory-based and decision-tree-based learning (TRIBL) which optimises the trade-off between efficiency and accuracy. TRIBL was used in TREETALK, a methodology for fast engineering of word-to-phonetics conversion systems. We also show that for English,a single TRIBL classifier trained on predicting phonetic transcription and word stress at the same time performs better than a `modular' approach in which different classifiers corresponding to linguistically relevant representations such as morphological and syllable structure are separately trained and integrated.
Re-Engineering Letter-to-Sound Rules
, 2001
"... Using finite-state automata for the text analysis component in a text-to-speech system is problematic in several respects: the rewrite rules from which the automata are compiled are difficult to write and maintain, and the resulting automata can become very large and therefore inefficient. Convertin ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Using finite-state automata for the text analysis component in a text-to-speech system is problematic in several respects: the rewrite rules from which the automata are compiled are difficult to write and maintain, and the resulting automata can become very large and therefore inefficient. Converting the knowledge represented explicitly in rewrite rules into a more efficient format is difficult. We take an indirect route, learning an efficient decision tree representation from data and tapping information contained in existing rewrite rules, which increases performance compared to learning exclusively from a pronunciation lexicon.
Computational Complexity of a Fast Viterbi Decoding Algorithm for Stochastic Letter-Phoneme Transduction
, 1998
"... This paper describes a modification to, and a fast implementation of, the Viterbi algorithm for use in stochastic letter-to-phoneme conversion. A straightforward (but unrealistic) implementation of the Viterbi algorithm has a linear time complexity with respect to the length of the letter string, bu ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper describes a modification to, and a fast implementation of, the Viterbi algorithm for use in stochastic letter-to-phoneme conversion. A straightforward (but unrealistic) implementation of the Viterbi algorithm has a linear time complexity with respect to the length of the letter string, but quadratic complexity if we additionally consider the number of letter-tophoneme correspondences to be a variable determining the problem size. Since the number of correspondences can be large, processing time is long. If the correspondences are precompiled to a deterministic finite-state automaton to simplify the process of matching to determine state survivors, execution time is reduced by a large multiplicative factor. Speedup is inferred indirectly since the straightforward implementation of Viterbi decoding is too slow for practical comparison, and ranges between about 200 and 4000 depending upon the number of letters processed and the particular correspondences employed in the transdu...
Issues in Building General Letter to Sound Rules
, 1998
"... In general text-to-speech systems, it is not possible to guarantee that a lexicon will contain all words found in the text, therefore some system for predicting pronunciation from the word itself is necessary. ..."
Abstract
- Add to MetaCart
In general text-to-speech systems, it is not possible to guarantee that a lexicon will contain all words found in the text, therefore some system for predicting pronunciation from the word itself is necessary.

