Results 1 - 10
of
87
AIBO's first words. The social learning of language and meaning
, 2001
"... This paper explores the hypothesis that language communication in its very first stage is bootstrapped in a social learning process under the strong influence of culture. A concrete framework for social learning has been developed based on the notion of a language game. Autonomous robots have been p ..."
Abstract
-
Cited by 88 (9 self)
- Add to MetaCart
This paper explores the hypothesis that language communication in its very first stage is bootstrapped in a social learning process under the strong influence of culture. A concrete framework for social learning has been developed based on the notion of a language game. Autonomous robots have been programmed to behave according to this framework. We show experiments that demonstrate why there has to be a causal role of language on category acquisition; partly by showing that it leads effectively to the bootstrapping of communication and partly by showing that other forms of learning do not generate categories usable in communication or make information assumptions which cannot be satisfied.
The Mbrola Project: Towards A Set Of High Quality Speech Synthesizers Free Of Use For Non Commercial Purposes
"... The aim of the MBROLA project, recently initiated by the Faculte Polytechnique de Mons (Belgium), is to obtain a set of speech synthesizers for as many voices, languages and dialects as possible, free of use for non-commercial and non-military applications. The ultimate goal is to boost up academic ..."
Abstract
-
Cited by 59 (0 self)
- Add to MetaCart
The aim of the MBROLA project, recently initiated by the Faculte Polytechnique de Mons (Belgium), is to obtain a set of speech synthesizers for as many voices, languages and dialects as possible, free of use for non-commercial and non-military applications. The ultimate goal is to boost up academic research on speech synthesis, and particularly on prosody generation, known as one of the biggest challenges taken up by Text-to-Speech synthesizers for the years to come.
The German Text-to-Speech synthesis system MARY: A tool for research, development and teaching
- International Journal of Speech Technology
, 2001
"... Abstract. This paper introduces the German text-to-speech synthesis system MARY. The system’s main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the us ..."
Abstract
-
Cited by 42 (13 self)
- Add to MetaCart
Abstract. This paper introduces the German text-to-speech synthesis system MARY. The system’s main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.
Reducing Audible Spectral Discontinuities
, 2001
"... In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon. We first set out to find an objective spectral measure for disconti ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon. We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities. Index Terms---Audible discontinuities, context-sensitive diphones, spectral distance measures. I. INTROD...
A Multi-Strategy Approach to Improving Pronunciation by Analogy
"... Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included `full' pattern matching between input letter string and d ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included `full' pattern matching between input letter string and dictionary entries, as well as including lexical stress in letter-to-phoneme conversion. Second, we have extended the method to phonemeto -letter conversion. Third, and most important, we have experimented with multiple, different strategies for scoring the candidate pronunciations. Individual scores for each strategy are obtained on the basis of rank and either multiplied or summed to produce a final, overall score. Five strategies have been studied and results obtained from all 31 possible combinations. The two combination methods perform comparably, with the product rule only very marginally superior to the sum rule. Nonparametric statistical analysis reveals that performance improves as more strategies are included in the combination: this trend is very highly significant ( p 0 0005). Accordingly for letter-to-phoneme conversion, best results are obtained when all five strategies are combined: word accuracy is raised to 65.5% relative to 61.7% for our best previous result and 63.0% for the best-performing single strategy. These improvements are very highly significant ( p 0 and p 0 00011 respectively). Similar results were found for phoneme-to-letter and letter-to-stress conversion, although the former was an easier problem for PbA than letter-to-phoneme conversion and the latter was harder. The main sources of error for the multi-strategy approach are very similar to those for the best single strategy, and mostly involve vowel letters and phonemes. 1
Evaluating the Pronunciation Component of Text-to-Speech Systems for English: A Performance Comparison of Different Approaches
- IN SPEECH AND LANGUAGE TECHNOLOGY (SALT) CLUB WORKSHOP ON EVALUATION IN SPEECH AND LANGUAGE TECHNOLOGY
, 1997
"... The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system dictionary. Data-driven methods, based on machine learning of the regularities implicit in a large pronouncing dictionary, have received considerable attention recently but are generally thought to perform less well. However, these tentative beliefs are at best uncertain without powerful methods for comparing text-to-phoneme subsystems. This paper contributes to the development of such methods by comparing the performance of four representative approaches to automatic phonemisation on the same test dictionary. As well as rule-based approaches, three data-driven techniques are evaluated: pronunciation by analogy (PbA), NETspeak and IB1-IG (a modified k-nearest neighbour method). Issues involved in comparative evaluation are detailed and elucidated. The data-driven techniques outperform rules in accuracy of letter-to-phoneme translation by a very significant margin but require aligned text-phoneme training data and are slower. Best translation results are obtained with PbA at approximately 72% words correct on a reasonably large pronouncing dictionary, compared to something like 26% words correct for the rules, indicating that automatic pronunciation of text is not a solved problem.
Diphone Concatenation using a Harmonic plus Noise Model of Speech
- Proc. EUROSPEECH
, 1997
"... In this paper we present a high-quality text-to-speech system using diphones. The system is based on a Harmonic plus Noise (HNM) representation of the speech signal. HNM is a pitch-synchronous analysis-synthesis system but does not require pitch marks to be determined as necessary in PSOLA-based met ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
In this paper we present a high-quality text-to-speech system using diphones. The system is based on a Harmonic plus Noise (HNM) representation of the speech signal. HNM is a pitch-synchronous analysis-synthesis system but does not require pitch marks to be determined as necessary in PSOLA-based methods. HNM assumes the speech signal to be composed of a periodic part and a stochastic part. As a result, different prosody and spectral envelope modification methods can be applied to each part, yielding more natural-sounding synthetic speech. The fully parametric representation of speech using HNM also provides a straightforward way of smoothing diphone boundaries. Informal listening tests, using natural prosody, have shown that the synthetic speech quality is close to the quality of the original sentences, without smoothing problems and without buzziness or other oddities observed with other speech representations used for TTS. 1. INTRODUCTION Many current Text-To-Speech (TTS) systems a...
Lifelike Gesture Synthesis and Timing for Conversational Agents
- Proc. of Gesture Wokshop 2001
, 2001
"... Synthesis of lifelike gesture is finding growing attention in human-computer interaction. ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Synthesis of lifelike gesture is finding growing attention in human-computer interaction.
"May I talk to you? :-)" -- Facial Animation from Text
- IN PROCEEDINGS OF PACIFIC GRAPHICS 2002
, 2002
"... We introduce a facial animation system that produces real-time animation sequences including speech synchronization and non-verbal speech-related facial expressions from plain text input. A state-of-the-art text-to-speech synthesis component performs linguistic analysis of the text input and creates ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We introduce a facial animation system that produces real-time animation sequences including speech synchronization and non-verbal speech-related facial expressions from plain text input. A state-of-the-art text-to-speech synthesis component performs linguistic analysis of the text input and creates a speech signal from phonetic and intonation information. The phonetic transcription is additionally used to drive a speech synchronization method for the physically based facial animation. Further high-level information from the linguistic analysis such as different types of accents or pauses as well as the type of the sentence is used to generate non-verbal speech-related facial expressions such as movement of head, eyes, and eyebrows or voluntary eye blinks. Moreover, emoticons are translated into XML markup that triggers emotional facial expressions.
Generating Multilingual Personalized Descriptions of Museum Exhibits - The M-PIRO Project
- In Proceedings of the International Conference on Computer Applications and Quantitative Methods in Archaeology
, 2001
"... This paper provides an overall presentation of the M-PIRO project. M-PIRO is developing technology that will allow museums to generate automatically textual or spoken descriptions of exhibits for collections available over the Web or in virtual reality environments. The descriptions are generated in ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper provides an overall presentation of the M-PIRO project. M-PIRO is developing technology that will allow museums to generate automatically textual or spoken descriptions of exhibits for collections available over the Web or in virtual reality environments. The descriptions are generated in several languages from information in a language-independent database and small fragments of text, and they can be tailored according to the backgrounds of the users, their ages, and their previous interaction with the system. An authoring tool allows museum curators to update the system's database and to control the language and content of the resulting descriptions. Although the project is still in progress, a Web-based demonstrator that supports English, Greek and Italian is already available, and it is used throughout the paper to highlight the capabilities of the emerging technology.

