Results 1 -
9 of
9
Loosely Coupled HMMs for ASR
- In Intl. Conf. on Acoustics, Speech and Signal Proc
, 2000
"... Hidden Markov Models (HMMs) have been successful for modelling the dynamics of carefully dictated speech, but their performance degrades severely when used to model conversational speech. This paper presents a preliminary feasibility study of an alternative class of models: loosely coupled HMMs. Sin ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Hidden Markov Models (HMMs) have been successful for modelling the dynamics of carefully dictated speech, but their performance degrades severely when used to model conversational speech. This paper presents a preliminary feasibility study of an alternative class of models: loosely coupled HMMs. Since speech is produced by a system of loosely coupled articulators, stochastic models explicitly representing this parallelism may have advantages for automatic speech recognition (ASR), particularly when trying to model the phonological effects inherent in casual spontaneous speech. The paper evaluates one coupled model on a simple ASR task, using both exact and approximate estimation schemes. We conclude such models merit further investigation. 1. INTRODUCTION Hidden Markov Models (HMMs) have been successful for modelling the dynamics of carefully dictated speech. However, their performance degrades severely when they are used to model conversational speech, and it has been widely hypothe...
Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments
, 1998
"... Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific pr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific properties of the auditory representation of speech likely contribute to reliable human speech recognition under such conditions. This dissertation explores the use of perceptually inspired signal-processing strategies -- critical-band-like frequency analysis, an emphasis of slow changes in the spectral structure of the speech signal, adaptation, integration of phonetic information over syllabic durations, and use of multiple signal representations for...
A Syllable, Articulatory-Feature, and Stress-Accent Model of Speech Recognition
, 2002
"... Current-generation automatic speech recognition #ASR# systems assume that words are readily decomposable into constituent phonetic components ##phonemes"#. A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic #beads-on-a-string" app ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Current-generation automatic speech recognition #ASR# systems assume that words are readily decomposable into constituent phonetic components ##phonemes"#. A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic #beads-on-a-string" approach is of limited utility, particularly with respect to informal, conversational material. The study shows that there is a signi#cantgapbetween the observed data and the pronunciation models of current ASR systems. It also shows that many important factors a#ecting recognition performance are not modeled explicitly in these systems.
Techniques for modelling Phonological Processes in Automatic Speech Recognition
, 2001
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does not exceed 29,500 words and includes no more than 40 figures. 1 Systems which automatically transcribe carefully dictated speech are now commercially available, but their performance degrades dramatically when the speaking style of users becomes more relaxed or conversational. This dissertation focuses on techniques that aim to improve the robustness of statistical speech transcription systems to conversational speaking styles. The dissertation shows first that the performance degradation occuring as speech becomes more conversational is severe and is partially attributable to differences in the acoustic realizations of sentences. Hypothesizing that the quantifiably wider range of
Comparison of HMM experts with MLP experts in the Full Combination Multi-Band Approach to Robust ASR
, 2000
"... In this paper we apply the Full Combination (FC) multi-band approach, which has originally been introduced in the framework of posterior-based HMM/ANN (Hidden Markov Model/Articial Neural Network) hybrid systems, to systems in which the ANN (or Multilayer Perceptron (MLP)) is itself replaced by a Mu ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper we apply the Full Combination (FC) multi-band approach, which has originally been introduced in the framework of posterior-based HMM/ANN (Hidden Markov Model/Articial Neural Network) hybrid systems, to systems in which the ANN (or Multilayer Perceptron (MLP)) is itself replaced by a Multi Gaussian HMM (MGM). Both systems represent the most widely used statistical models for robust ASR (automatic speech recognition). It is shown how the FC formula for the likelihood-based MGMs can easily be derived from the posterior-based approach by simply applying Bayes' Rule. The experiments show that the Full Combination multiband system with MGM experts performs better, in all noise conditions tested, than the simple sum and product rules which are normally used. As compared to the baseline full-band system, the FC system shows increased robustness mainly on band-limited noise. The goal of this article is not a performance comparison between Multilayer Perceptrons and Multi Gaussi...
Sooner Or Later: Exploring Asynchrony In Multi-Band Speech Recognition
- Proceedings of Eurospeech-99, Budapest
, 1999
"... Multi-band speech recognition is an exploratory paradigm in which each frequency region is treated as a distinct source of information and the streams are combined after each is processed independently. A number of researchers have hypothesized that it is advantageous to combine the sub-frequency in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Multi-band speech recognition is an exploratory paradigm in which each frequency region is treated as a distinct source of information and the streams are combined after each is processed independently. A number of researchers have hypothesized that it is advantageous to combine the sub-frequency information in an asynchronous manner. This paper examines this hypothesis, using two different approaches in relaxing synchrony constraints: HMM decomposition/recombination [19] and two-level dynamic programming (DP) [16]. Drawing on this work and those of others [2, 18], we conclude that relaxing the synchrony constraints indiscriminately for all phone-to-phone transitions does not consistently and significantly reduce the word error rate. The optimal permissible asynchrony must depend on both the phone-class transitions and the training-data statistics. 1. INTRODUCTION Multi-band approaches have generated a great deal of interest in the automatic speech recognition (ASR) community [9, 2,...
Transformation Streams and the HMM Error Model
- Computer Speech and Language
, 2001
"... The most popular model used in automatic speech recognition is the hidden Markov model (HMM). Though good performance has been obtained with such models there are well known limitations for its ability to model speech. A variety of modications to the standard HMM topology have been proposed to handl ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The most popular model used in automatic speech recognition is the hidden Markov model (HMM). Though good performance has been obtained with such models there are well known limitations for its ability to model speech. A variety of modications to the standard HMM topology have been proposed to handle these problems. One such scheme is the factorial HMM. This paper introduces a new form of factorial HMM which makes use of transformation streams. This new scheme is a generalisation of the standard factorial HMM and other related schemes in speech processing. A particular form of this model, the HMM error model (HEM) is described in detail. The HEM is evaluated on two standard large vocabulary speaker independent speech recognition tasks. On both tasks signicant reductions in word error rate are obtained over standard HMM-based systems. 2 1
PUBLISHED AS
, 2003
"... for his love and his continuous support in good and bad times throughout this thesis To Laura Lou for her smiles and the energy they gave me when I needed it most To my parents for their perspective about the relative importance of a thesis and other things in life ii State-of-the-art automatic spee ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
for his love and his continuous support in good and bad times throughout this thesis To Laura Lou for her smiles and the energy they gave me when I needed it most To my parents for their perspective about the relative importance of a thesis and other things in life ii State-of-the-art automatic speech recognition (ASR) techniques are typically based on hidden Markov models (HMMs) for the modeling of temporal sequences of feature vectors extracted from the speech

