Results 1 - 10
of
10
Incorporating Information From Syllable-length Time Scales into Automatic Speech Recognition
- In ICASSP
, 1998
"... Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the ex ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the explicit use of such long-timespan units is comparatively unusual in automatic speech recognition systems for English. The work described in this thesis explored the utility of information collected over syllable-related time-scales. The first approach involved integrating syllable segmentation information into the speech recognition process. The addition of acoustically-based syllable onset estimates [184] resulted in a 10% relative reduction in word-error rate. The second approach began with developing four speech recognition systems based on long-time-span features and units, including modulation spectro- gram features [80]. Error analysis suggested the strategy of combining, which led to the implementation of methods that merged the outputs of syllable-based recognition systems with the phone-oriented baseline system at the frame level, the syllable level and the whole-utterance level. These combined systems exhibited relative improvements of 20-40% compared to the baseline system for clean and reverberant speech test cases.
Hidden Markov Models and Neural Networks for Speech Recognition
, 1998
"... The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as spee ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ...
Named Entity Tagged Language Models
- IN PROCEEDINGS OF ICASSP-99, VOL. I
, 1999
"... We introduce Named Entity (NE) Language Modelling, a stochastic finite state machine approach to identifying both words and NE categories from a stream of spoken data. We provide an overview of our approach to NE tagged language model (LM) generation together with results of the application of such ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
We introduce Named Entity (NE) Language Modelling, a stochastic finite state machine approach to identifying both words and NE categories from a stream of spoken data. We provide an overview of our approach to NE tagged language model (LM) generation together with results of the application of such a LM to the task of out-of-vocabulary (OOV) word reduction in large vocabulary speech recognition. Using the Wall Street Journal and Broadcast News corpora, it is shown that the tagged LM was able to reduce the overall word error rate by 14%, detecting up to 70% of previously OOV words. We also describe an example of the direct tagging of spoken data with NE categories.
Hidden Neural Networks: Application To Speech Recognition
- In Proc. IEEE ICASSP
, 1998
"... In this paper we evaluate the Hidden Neural Network HMM/NN hybrid presented at last years ICASSP on two speech recognition benchmark tasks; 1) task independent isolated word recognition on the PHONEBOOK database, and 2) recognition of broad phoneme classes in continuous speech from the TIMIT databas ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we evaluate the Hidden Neural Network HMM/NN hybrid presented at last years ICASSP on two speech recognition benchmark tasks; 1) task independent isolated word recognition on the PHONEBOOK database, and 2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how Hidden Neural Networks (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural networks. 1. INTRODUCTION Although the HMM is good at capturing the temporal nature of processes such as speech it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data. This is primarily due to the first order state process and the assumption of state conditional observation inde...
MT and Topic-Based Techniques to Enhance Speech Recognition Systems for Professional Translators
"... Our principle ohjcctive was to reduce tile error rate of speech recognition systems used by professional translators. Our work concentrated oil Spanish4o-English translation. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Our principle ohjcctive was to reduce tile error rate of speech recognition systems used by professional translators. Our work concentrated oil Spanish4o-English translation.
INtegrating SPEech acoustic and linguistic ConsTraints: Baseline System Development
, 1999
"... . In this report, we discuss the initial issues addressed in a research project aiming at the development of an advanced natural speech recognition system for the automatic processing of telephone directory requests. This multi-faceted project involves (1) text processing (labeling and tagging) of a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
. In this report, we discuss the initial issues addressed in a research project aiming at the development of an advanced natural speech recognition system for the automatic processing of telephone directory requests. This multi-faceted project involves (1) text processing (labeling and tagging) of a large database of telephone-based natural voice requests (including all kinds of peculiarities), (2) development of robust acoustic models, (3) integrating advanced natural language (syntactic and semantic) constraints, (4) detecting and dealing with a large number of out-of-vocabulary words (proper names), and (5) testing of the resulting system on natural queries. All this work will be performed on the basis of a database containing prompted (read) speech and (simulated) natural requests to information service. This report describes the initial steps that were required to set up a reasonable baseline system and a good research and evaluation framework. More specically, a signicant amou...
MT and Topic-Based Techniques to Enhance Speech Recognition Systems for Professional Translators
, 2000
"... Our principle objective was to reduce the error rate of speech recognition systems used by professional translators. Our work concentrated on Spanish-to-English translation. In a baseline study we estimated the error rate of an off-the-shelf recognizer to be 9.98%. In this paper we describe tw ..."
Abstract
- Add to MetaCart
Our principle objective was to reduce the error rate of speech recognition systems used by professional translators. Our work concentrated on Spanish-to-English translation. In a baseline study we estimated the error rate of an off-the-shelf recognizer to be 9.98%. In this paper we describe two independent methods of improving speech recognizers: a machine translation (MT) method and a topic-based one. An evaluation of the MT method suggests that the vocabulary used for recognition cannot be completely restricted to the set of translations produced by the MT system and a more sophisticated constraint system must be used. An evaluation of the topic-based method showed significant error rate reduction, to 5.07%. Introduction Our goal is to improve the throughput of professional translators by using speech recognition. The problem with using current offthe -shelf speech recognition systems is that these systems have high error rates for similar tasks. If the task is sim...
Presented at ICASSP-97, Munich, vol. 2 pp. 987-990.
- In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
, 1997
"... In this paper we examine the proposition that knowledge of the timing of syllabic onsets may be useful in improving the performance of speech recognition systems. A method of estimating the location of syllable onsets derived from the analysis of energy trajectories in critical band channels has bee ..."
Abstract
- Add to MetaCart
In this paper we examine the proposition that knowledge of the timing of syllabic onsets may be useful in improving the performance of speech recognition systems. A method of estimating the location of syllable onsets derived from the analysis of energy trajectories in critical band channels has been developed, and a syllable-based decoder has been designed and implemented that incorporates this onset information into the speech recognition process. For a small, continuous speech recognition task the addition of artificial syllabic onset information (derived from advance knowledge of the word transcriptions) lowers the word error rate by 38%. Incorporating acoustically-derived syllabic onset information reduces the word error rate by 10% on the same task. The latter experiment has highlighted representational issues on coordinating acoustic and lexical syllabifications, a topic we are beginning to explore.
A Study of the Use and Evaluation of Confidence . . .
, 1998
"... Confidence measures have been found to be useful for a number tasks within the field of Automatic Speech Recognition (ASR). For example, the use of confidence measures has been reported in the utterance verification, keyword spotting and Out-of-Vocabulary (OOV) word spotting literature. In this repo ..."
Abstract
- Add to MetaCart
Confidence measures have been found to be useful for a number tasks within the field of Automatic Speech Recognition (ASR). For example, the use of confidence measures has been reported in the utterance verification, keyword spotting and Out-of-Vocabulary (OOV) word spotting literature. In this report, it is shown that so called 'hybrid Artificial Neural Network/Hidden Markov Model' (HMM/ANN) systems are well suited to the task of generating confidence measures, due to their ability to provide local phone class posterior probability estimates which may be used to generate confidence measures in a computationally efficient manner. A number of evaluation metrics are also described and the performance of five confidence measures derived from the ABBOT hybrid HMM/ANN system for the tasks of utterance verification and OOV word spotting are evaluated using these metrics. Besides the tasks described above, confidence measures may also be used for tasks such as filtering the acoustics for a nu...
Informing Multisource Decoding in Robust Automatic Speech Recognition
, 2008
"... Listeners are remarkably adept at recognising speech in natural multisource environments, while most Automatic Speech Recognition (ASR) technology fails in these conditions. It has been proposed that this human ability is governed by Auditory Scene Analysis (ASA) processes, in which a sound mixture ..."
Abstract
- Add to MetaCart
Listeners are remarkably adept at recognising speech in natural multisource environments, while most Automatic Speech Recognition (ASR) technology fails in these conditions. It has been proposed that this human ability is governed by Auditory Scene Analysis (ASA) processes, in which a sound mixture is segregated into perceptual packages, called ‘streams’, by a combination of bottom-up and top-down processing. This thesis examines a novel ASR framework based on the ASA account, Speech Fragment Decoding (SFD). A ‘fragment ’ is a spectro-temporal region where energy from a single sound source dominates. SFD employs techniques developed from knowledge about the auditory system to identify fragments. A decoding process using statistical speech models is applied to the fragment representation to simultaneously identify speech evidence and recognise speech. In this study three techniques for improving SFD are investigated. Firstly, explicit duration modelling is exploited to combat the corruption of acoustic data which often causes the decoder to produce word matches with unrealistic durations. Secondly, it is argued that the top-down information in recognition models may be insufficient to mediate the speech

