Results 1 -
9 of
9
A Multi-Strategy Approach to Improving Pronunciation by Analogy
"... Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included `full' pattern matching between input letter string and d ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included `full' pattern matching between input letter string and dictionary entries, as well as including lexical stress in letter-to-phoneme conversion. Second, we have extended the method to phonemeto -letter conversion. Third, and most important, we have experimented with multiple, different strategies for scoring the candidate pronunciations. Individual scores for each strategy are obtained on the basis of rank and either multiplied or summed to produce a final, overall score. Five strategies have been studied and results obtained from all 31 possible combinations. The two combination methods perform comparably, with the product rule only very marginally superior to the sum rule. Nonparametric statistical analysis reveals that performance improves as more strategies are included in the combination: this trend is very highly significant ( p 0 0005). Accordingly for letter-to-phoneme conversion, best results are obtained when all five strategies are combined: word accuracy is raised to 65.5% relative to 61.7% for our best previous result and 63.0% for the best-performing single strategy. These improvements are very highly significant ( p 0 and p 0 00011 respectively). Similar results were found for phoneme-to-letter and letter-to-stress conversion, although the former was an easier problem for PbA than letter-to-phoneme conversion and the latter was harder. The main sources of error for the multi-strategy approach are very similar to those for the best single strategy, and mostly involve vowel letters and phonemes. 1
Evaluating the Pronunciation Component of Text-to-Speech Systems for English: A Performance Comparison of Different Approaches
- IN SPEECH AND LANGUAGE TECHNOLOGY (SALT) CLUB WORKSHOP ON EVALUATION IN SPEECH AND LANGUAGE TECHNOLOGY
, 1997
"... The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system dictionary. Data-driven methods, based on machine learning of the regularities implicit in a large pronouncing dictionary, have received considerable attention recently but are generally thought to perform less well. However, these tentative beliefs are at best uncertain without powerful methods for comparing text-to-phoneme subsystems. This paper contributes to the development of such methods by comparing the performance of four representative approaches to automatic phonemisation on the same test dictionary. As well as rule-based approaches, three data-driven techniques are evaluated: pronunciation by analogy (PbA), NETspeak and IB1-IG (a modified k-nearest neighbour method). Issues involved in comparative evaluation are detailed and elucidated. The data-driven techniques outperform rules in accuracy of letter-to-phoneme translation by a very significant margin but require aligned text-phoneme training data and are slower. Best translation results are obtained with PbA at approximately 72% words correct on a reasonably large pronouncing dictionary, compared to something like 26% words correct for the rules, indicating that automatic pronunciation of text is not a solved problem.
Pronunciation by Analogy: Impact of Implementational Choices on Performance
, 1997
"... Pronunciation by analogy (PbA) is an emerging, data-driven technique with potential application in text-to-speech (TTS) systems, as well as being an influential psychological model of reading aloud. The underlying idea is that a pronunciation for an unknown word (i.e. one not in the dictionary, or l ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
Pronunciation by analogy (PbA) is an emerging, data-driven technique with potential application in text-to-speech (TTS) systems, as well as being an influential psychological model of reading aloud. The underlying idea is that a pronunciation for an unknown word (i.e. one not in the dictionary, or lexicon, of the human or machine `reader') is assembled by matching substrings of the input to substrings of known, lexical words, hypothesising a partial pronunciation for each matched substring from the lexical knowledge of the `reader', and concatenating the partial pronunciations. This paper assesses the capability of PbA to derive pronunciations for unknown words of English. As a psychological model, PbA is `underspecified', i.e. the implementor of a simulation of the process faces detailed choices which can only be resolved by trial and error. One goal for this paper is to explore the impact of certain basic implementational choices on the performance of PbA systems. The variables stud...
Phonological Parsing for Bi-directional Letterto-Sound/Sound-to-Letter Generation
- Journal of Speech Communication
, 1995
"... In this paper, we describe a reversible letter-to-sound/sound-to-letter generation system based on an approach which com-bines a rule-based formalism with data-driven techniques. We adopt a probabilistic parsing strategy to provide a hierarchical lexical analysis of a word, including information suc ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In this paper, we describe a reversible letter-to-sound/sound-to-letter generation system based on an approach which com-bines a rule-based formalism with data-driven techniques. We adopt a probabilistic parsing strategy to provide a hierarchical lexical analysis of a word, including information such as mor-phology, stress, syllabification, phonemics and graphemics. Long-distance constraints are propagated by enforcing local constraints throughout the hierarchy. Our training and test-ing corpora are derived from the high-frequency portion of the Brown Corpus (10,000 words), augmented with markers indicating stress and word morphology. We evaluated our performance based on an unseen test set. The percentage of nonparsable words for letter-to-sound and sound-to-letter generation were 6 % and 5 % respectively. Of the remaining words our system achieved a word accuracy of 71.8~0 and a phoneme accuracy of 92.5 % for letter-to-sound generation, and a word accuracy of 55.8 % and letter accuracy of 89.4% for sound-to-letter generation. We also compared our hierar-chical approach with an alternative, single-layer approach to demonstrate how the hierarchy provides a parsimonious de-scription for English orthographic-phonological regularities, while simultaneously attaining competitive generation accu-racy.
A Recurrent Network That Learns To Pronounce English Text
, 1996
"... Previous attempts to derive connectionist models for text-tophoneme conversion -- such as NETtalk and NETspeak -- have generally used pre-aligned training data and purely feedforward networks, both of which represent simplifications of the problem. In this work, we explore the potential of recurrent ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Previous attempts to derive connectionist models for text-tophoneme conversion -- such as NETtalk and NETspeak -- have generally used pre-aligned training data and purely feedforward networks, both of which represent simplifications of the problem. In this work, we explore the potential of recurrent networks to perform the conversion task when trained on non-aligned data. Initially, our use of a single recurrent network produced disappointing results. This led to the definition of a two-phase model in which the hidden-unit representation of an autoassociative network was fed forward to a recurrent network. Although this model currently does not perform as well as NETspeak, it is solving a harder problem. Also, we propose several possible avenues for improvement.
Pronunciation Modeling in Speech Synthesis
, 1998
"... iii ACKNOWLEDGMENTS I am very pleased to have had the encouragement and support of a committee of three linguists for whom I have the greatest respect and admiration: Mark Liberman, William Labov and Eugene Buckley. Each of them made my transition back to Penn pleasant after what seemed like a long ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
iii ACKNOWLEDGMENTS I am very pleased to have had the encouragement and support of a committee of three linguists for whom I have the greatest respect and admiration: Mark Liberman, William Labov and Eugene Buckley. Each of them made my transition back to Penn pleasant after what seemed like a long absence. It was a great pleasure to have Mark Randolph both as an external reader and as a colleague at Motorola. Mark’s work at MIT a decade ago has served as an inspiration to me. Orhan Karaali made this dissertation possible in this millennium. As my manager for over two years at Motorola, Orhan insisted on making my dissertation a priority at work. Harry Bliss provided his voice to this project and our whole group is very grateful for his patience and cooperation. My colleagues at Motorola listened to my ideas and provided technical and theoretical assistance at every turn: Noel
TreeTalk: Memory-based word phonemisation
- In Data-Driven Techniques in Speech Synthesis, Kluwer
, 2001
"... We propose a memory-based (similarity-based) approach to learning the mapping of words into phonetic representations for use in speech synthesis systems. The main advantage of memory-based data mining techniques is their high accuracy, the main disadvantage is processing speed. We introduce a hyb ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We propose a memory-based (similarity-based) approach to learning the mapping of words into phonetic representations for use in speech synthesis systems. The main advantage of memory-based data mining techniques is their high accuracy, the main disadvantage is processing speed. We introduce a hybrid between memory-based and decision-tree-based learning (TRIBL) which optimises the trade-off between efficiency and accuracy. TRIBL was used in TREETALK, a methodology for fast engineering of word-to-phonetics conversion systems. We also show that for English,a single TRIBL classifier trained on predicting phonetic transcription and word stress at the same time performs better than a `modular' approach in which different classifiers corresponding to linguistically relevant representations such as morphological and syllable structure are separately trained and integrated.
A Diphone-Based Text-to-Speech System for Scottish Gaelic
, 1997
"... In this thesis, a diphone--based text--to--speech system for Scottish Gaelic, a language spoken by about 80.000 native speakers in Scotland and Canada, is presented. Text-- to--speech systems convert orthographic text input into speech output. The present system consists of two main parts: ffl an a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this thesis, a diphone--based text--to--speech system for Scottish Gaelic, a language spoken by about 80.000 native speakers in Scotland and Canada, is presented. Text-- to--speech systems convert orthographic text input into speech output. The present system consists of two main parts: ffl an automatic phonetic transcription module which produces an orthophonic transcription of the orthographic input text ffl a speech synthesis module which synthesizes an utterance from its transcription by concatenating and modifying previously recorded speech units. Diphones, speech units that cover two sounds and the transition between them, form the basis of the synthesis module. Duration and intonation are modelled on the basis of simple heuristics. The diphone inventory was designed for the Gaelic of Bayble, Lewis. Scottish Gaelic distinguishes four main phonetic settings: velarised, palatalised, nasalised, and neutral. As the domain of these settings is the syllable, they are difficult t...
Computational Complexity of a Fast Viterbi Decoding Algorithm for Stochastic Letter-Phoneme Transduction
, 1998
"... This paper describes a modification to, and a fast implementation of, the Viterbi algorithm for use in stochastic letter-to-phoneme conversion. A straightforward (but unrealistic) implementation of the Viterbi algorithm has a linear time complexity with respect to the length of the letter string, bu ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper describes a modification to, and a fast implementation of, the Viterbi algorithm for use in stochastic letter-to-phoneme conversion. A straightforward (but unrealistic) implementation of the Viterbi algorithm has a linear time complexity with respect to the length of the letter string, but quadratic complexity if we additionally consider the number of letter-tophoneme correspondences to be a variable determining the problem size. Since the number of correspondences can be large, processing time is long. If the correspondences are precompiled to a deterministic finite-state automaton to simplify the process of matching to determine state survivors, execution time is reduced by a large multiplicative factor. Speedup is inferred indirectly since the straightforward implementation of Viterbi decoding is too slow for practical comparison, and ranges between about 200 and 4000 depending upon the number of letters processed and the particular correspondences employed in the transdu...

