Results 1 - 10
of
22
A Syllable, Articulatory-Feature, and Stress-Accent Model of Speech Recognition
, 2002
"... Current-generation automatic speech recognition #ASR# systems assume that words are readily decomposable into constituent phonetic components ##phonemes"#. A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic #beads-on-a-string" app ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Current-generation automatic speech recognition #ASR# systems assume that words are readily decomposable into constituent phonetic components ##phonemes"#. A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic #beads-on-a-string" approach is of limited utility, particularly with respect to informal, conversational material. The study shows that there is a signi#cantgapbetween the observed data and the pronunciation models of current ASR systems. It also shows that many important factors a#ecting recognition performance are not modeled explicitly in these systems.
Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU summer workshop
- Johns Hopkins University Center for
, 2007
"... We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In th ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In the area of observation modeling, we use the outputs of a set of multilayer perceptron articulatory feature classifiers (1) directly, in an extension of hybrid HMM/ANN models, and (2) as part of the observation vector in a standard Gaussian mixture-based model, an extension of the now popular “tandem ” approach. In the area of pronunciation modeling, we explore models consisting of multiple hidden streams of states, each corresponding to a different articulatory feature and having soft synchrony constraints, for both audio-only and audio-visual speech recognition. Our models are implemented as dynamic Bayesian networks, and our
Statistical Modelling in Continuous Speech Recognition (CSR)
- IN CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE
, 2001
"... Automatic continuous speech recognition (CSR) is sufficiently ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Automatic continuous speech recognition (CSR) is sufficiently
Techniques for modelling Phonological Processes in Automatic Speech Recognition
, 2001
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does not exceed 29,500 words and includes no more than 40 figures. 1 Systems which automatically transcribe carefully dictated speech are now commercially available, but their performance degrades dramatically when the speaking style of users becomes more relaxed or conversational. This dissertation focuses on techniques that aim to improve the robustness of statistical speech transcription systems to conversational speaking styles. The dissertation shows first that the performance degradation occuring as speech becomes more conversational is severe and is partially attributable to differences in the acoustic realizations of sentences. Hypothesizing that the quantifiably wider range of
PARSING SPEECH INTO ARTICULATORY EVENTS
"... In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. However, we extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.
Structural Representation of Speech for Phonetic Classification
- In: Proc. 17th ICPR. Volume 3
, 2004
"... This paper explores the issues involved in using symbolic metric algorithms for automatic speech recognition (ASR), via a structural representation of speech. This representation is based on a set of phonological distinctive features which is a linguistically well-motivated alternative to the "beads ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper explores the issues involved in using symbolic metric algorithms for automatic speech recognition (ASR), via a structural representation of speech. This representation is based on a set of phonological distinctive features which is a linguistically well-motivated alternative to the "beads-on-a-string" view of speech that is standard in current ASR systems. We report the promising results of phoneme classification experiments conducted on a standard continuous speech task.
Reaching over the gap: A review of efforts to link human and automatic speech recognition research
, 2007
"... ..."
Capturing fine-phonetic variation in speech through automatic classification of articulatory features
- In: Proceedings of the workshop on Speech Recognition and Intrinsic Variation
, 2006
"... The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create re ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create reliable and accurate transcriptions of the articulatory behaviour encoded in the acoustic speech signal. In the experiments reported here, we compared support vector machines (SVMs) with multilayer perceptrons (MLPs). MLPs have been widely (and rather successfully) used for the task of multi-value articulatory feature classification, while (to the best of our knowledge) SVMs have not. This paper compares the performances of the two classifiers and analyses the results in order to better understand the articulatory representations. It was found that the MLPs outperformed the SVMs, but it is concluded that both classifiers exhibit similar behaviour in terms of patterns of errors. 1.
The Functional Load of Phonological Contrasts
, 2003
"... this paper is broader than standard, encompassing phoneme oppositions (binary or not), distinctive features (again, binary or not), suprasegmental features and even phonological rules such as phoneme deletion in certain contexts. This permits researchers with the appropriate corpora to answer questi ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
this paper is broader than standard, encompassing phoneme oppositions (binary or not), distinctive features (again, binary or not), suprasegmental features and even phonological rules such as phoneme deletion in certain contexts. This permits researchers with the appropriate corpora to answer questions like these: Is it more important to correctly hear the tone or the vowel in Cantonese? 8 Does Hindi make more use of aspiration or voicing? How much information is lost due to vowel reduction in unstressed syllables? If second-language speakers have trouble learning contrasts that are not present in their native language, e.g. the [l]-[r] distinction in English for Japanese speakers, how badly o are they?
An Elitist Approach to Automatic Articulatory-Acoustic Feature Classification for . . .
"... A novel framework for automatic articulatory-acoustic feature extraction has been developed for enhancing the accuracy of place- and manner-of-articulation classification in spoken language. The "elitist" approach provides a principled means of selecting frames for which multilayer perceptron, neura ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A novel framework for automatic articulatory-acoustic feature extraction has been developed for enhancing the accuracy of place- and manner-of-articulation classification in spoken language. The "elitist" approach provides a principled means of selecting frames for which multilayer perceptron, neural-network classifiers are highly confident. Using this method it is possible to achieve a framelevel accuracy of 93% on "elitist" frames for manner classification on a corpus of American English sentences passed through a telephone network (NTIMIT). Place-of-articulation information is extracted for each manner class independently, resulting in an appreciable gain in place-feature classification relative to performance for a manner-independent system. A comparable enhancement in classification performance for the elitist approach is evidenced when applied to a Dutch corpus of quasi-spontaneous telephone interactions (VIOS). The elitist framework provides a potential means of automatically annotating a corpus at the phonetic level without recourse to a word-level transcript and could thus be of utility for developing training materials for automatic speech recognition and speech synthesis applications, as well as aid the empirical study of spoken language.

