Results 1 - 10
of
48
Functional Phonology -- Formalizing the interactions between articulatory and perceptual drives
, 1998
"... ..."
Speaking In Shorthand -- A Syllable-Centric Perspective For Understanding Pronunciation Variation
, 1998
"... Current-generation automatic speech recognition (ASR) systems model spoken discourse as a linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ("canonical") way, ASR systems often depend on a multi-pronunciation lexicon to match an a ..."
Abstract
-
Cited by 93 (12 self)
- Add to MetaCart
Current-generation automatic speech recognition (ASR) systems model spoken discourse as a linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ("canonical") way, ASR systems often depend on a multi-pronunciation lexicon to match an acoustic sequence with a lexical unit. Since there are, in practice, many different ways for a word to be pronounced, this standard approach adds a layer of complexity and ambiguity to the decoding process which, if modified, could potentially improve recognition performance. Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is systematic at the level of the syllable. Syllabic onsets are realized in canonical form far more frequently than either coda or nuclear constituents. Prosodic stress also plays an important role in pronunciation. The governing mechanism is likely to involve the informationa...
Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production
- Psychological Review
, 1995
"... This article describes a neural network model of speech motor skill acquisition and speech production that explains a wide range of data on variability, motor equivalence, coarticulation, and rate effects. Model parameters are learned during a babbling phase. To explain how infants learn language-sp ..."
Abstract
-
Cited by 52 (21 self)
- Add to MetaCart
This article describes a neural network model of speech motor skill acquisition and speech production that explains a wide range of data on variability, motor equivalence, coarticulation, and rate effects. Model parameters are learned during a babbling phase. To explain how infants learn language-specific variability limits, speech sound targets take the form of convex regions, rather than points, in orosensory coordinates. Reducing target size for better accuracy during slower speech leads to differential effects for vowels and consonants, as seen in experiments previously used as evidence for separate control processes for the 2 sound types. Anticipatory coarticulation arises when targets are reduced in size on the basis of context; this generalizes the well-known look-ahead model of coarticulation. Computer simulations verify the model's properties. The primary goal of the modeling work described in this article is to provide a coherent theoretical framework that provides explanations for a wide range of data concerning the articulator movements used by humans to produce speech sounds. This is carried out by formulating a model that transforms strings of phonemes into continuous articulator movements for
The challenge of spoken language systems: Research directions for the nineties
- IEEE Transactions on Speech and Audio Processing
, 1995
"... Footnote This article is based on a February, 1992workshop sponsored by the National Science ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Footnote This article is based on a February, 1992workshop sponsored by the National Science
Control of Spectral Dynamics in Concatenative Speech Synthesis
- IEEE Trans. Speech and Audio Processing
, 2001
"... Current speech synthesis methods based on the concatenation of waveform units can produce highly intelligible speech capturing the identity of a particular speaker. However, the quality of concatenated speech often suffers from discontinuities between the acoustic units, due to contextual difference ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
Current speech synthesis methods based on the concatenation of waveform units can produce highly intelligible speech capturing the identity of a particular speaker. However, the quality of concatenated speech often suffers from discontinuities between the acoustic units, due to contextual differences and variations in speaking style across the database. In this paper, we present methods to spectrally modify speech units in a concatenative synthesizer to correspond more closely to the acoustic transitions observed in natural speech. First, a technique called "unit fusion" is proposed to reduce spectral mismatch between units. In addition to concatenation units, a second, independent tier of units is selected that de nes the desired spectral dynamics at concatenation points. Both unit tiers are "fused" to obtain natural transitions throughout the synthesized utterance. The unit fusion method is further extended to control the perceived degree of articulation of concatenated units. In the...
The Elements of Functional Phonology
"... Phonological structures and processes are determined by the functional principles of minimization of articulatory effort and maximization of perceptual contrast. We can solve many hitherto controversial issues if we are aware of the different roles of articulation and perception in phonology. Trad ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
Phonological structures and processes are determined by the functional principles of minimization of articulatory effort and maximization of perceptual contrast. We can solve many hitherto controversial issues if we are aware of the different roles of articulation and perception in phonology. Traditionally separate devices like the segment, spreading, licensing, underspecification, feature geometry, and OCP effects, are surface phenomena created by the interaction of more fundamental principles.
Pitch targets and their realization: Evidence from Mandarin Chinese
, 2001
"... In this paper we propose a preliminary framework for accounting for certain surface F 0 variations in speech. The framework consists of definitions for pitch targets and rules of their implementation. Pitch targets are defined as the smallest operable units associated with linguistically functional ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
In this paper we propose a preliminary framework for accounting for certain surface F 0 variations in speech. The framework consists of definitions for pitch targets and rules of their implementation. Pitch targets are defined as the smallest operable units associated with linguistically functional pitch units, and they are comparable to segmental phones. The implementation rules are based on possible articulatory constraints on the production of surface F 0 contours. Due to these constraints, the implementation of a simple pitch target may result in surface F 0 forms that only partially reflect the underlying pitch targets. We will also discuss possible implications of this framework on our understanding of various observed F 0 patterns, including carryover and anticipatory variations, downstep, declination, and F 0 peak alignment. Finally, we will consider possible interactions between local and non-local pitch targets. 1.0 Introduction To understand the acoustic manifestation of s...
Fast Speakers In Large Vocabulary Continuous Speech Recognition: Analysis Antidotes
- Proceedings of the Eurospeech Conference, Madrid
, 1995
"... The performance of automatic speech recognizers (ASR) typically degrades for test speakers with "outlier" characteristics, for example, speakers with foreign accent and fast speaking rate. In this work, we concentrate on the latter. Consistent with other researchers, we have observed that for speake ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
The performance of automatic speech recognizers (ASR) typically degrades for test speakers with "outlier" characteristics, for example, speakers with foreign accent and fast speaking rate. In this work, we concentrate on the latter. Consistent with other researchers, we have observed that for speakers with exceptionally high speaking rate, the word recognition error is significantly higher. We have investigated two possible causes for this effect. Inherent spectral differences may cause the extracted features for these outliers to be significantly different from that of normal speech. Also, due to phone omissions and duration reduction, the normal word-models may not be suitable for fast speech. Based on our exploratory experiments on TIMIT and WSJ corpora, we believe the spectral differences and duration reduction are both significant sources of the increased error. By adapting our MLP phonetic probability estimator to fast speech, and employing fast speaker word-models, we have been ...
Structured speech modeling
- IEEE Transactions on Audio, Speech and Language Processing (Special Issue on Rich Transcription
, 2006
"... Abstract—Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structu ..."
Abstract
-
Cited by 19 (11 self)
- Add to MetaCart
Abstract—Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structure is exploited to represent long-distance relationships among words [5], the structured speech model described in this paper makes use of the dynamic structure in the hidden vocal tract resonance space to characterize long-span contextual influence among phonetic units. A general overview is provided first on hierarchically classified types of dynamic speech models in the literature. A detailed account is then given for a specific model type called the hidden trajectory model, and we describe detailed steps of model construction and the parameter estimation algorithms. We show how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects. Experiments on phonetic recognition evaluation demonstrate superior recognizer performance over a modern hidden Markov model-based system. Error analysis shows that the greatest performance gain occurs within the sonorant speech class. Index Terms—Hidden dynamics, hidden trajectory, long span modeling, maximum-likelihood, nonlinear prediction, parameter learning, structured modeling, vocal tract resonance. I.
Automatic Prosodic Analysis for Computer Aided Pronunciation Teaching
, 1994
"... Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for p ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech...

