Results 1 - 10
of
18
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Graphical models and automatic speech recognition
- Mathematical Foundations of Speech and Language Processing
, 2003
"... Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recog ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recognition techniques commonly used as part of a speech recognition system can be described by a graph – this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models. Moreover, this paper shows that many advanced models for speech recognition and language processing can also be simply described by a graph, including many at the acoustic-, pronunciation-, and language-modeling levels. A number of speech recognition techniques born directly out of the graphical-models paradigm are also surveyed. Additionally, this paper includes a novel graphical analysis regarding why derivative (or delta) features improve hidden Markov model-based speech recognition by improving structural discriminability. It also includes an example where a graph can be used to represent language model smoothing constraints. As will be seen, the space of models describable by a graph is quite large. A thorough exploration of this space should yield techniques that ultimately will supersede the hidden Markov model.
Language Modeling With Sentence-Level Mixtures
, 1994
"... Language models play an important role in improving the accuracy of a continuous speech recognizer. In this thesis, we introduce a new statistical language model which captures long term topic dependencies of words within and across sentences. The model includes two main contributions. First, we dev ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Language models play an important role in improving the accuracy of a continuous speech recognizer. In this thesis, we introduce a new statistical language model which captures long term topic dependencies of words within and across sentences. The model includes two main contributions. First, we develop a topic-dependent sentence-level mixture language model which takes advantage of the topic constraints in a sentence or a paragraph. Since this language model is not Markov and has a large search space, it is used only in the last stage of a multi-pass search strategy in the recognizer. Second, we introduce topic-dependent dynamic adaptation techniques in the framework of the mixture model. During the course of this thesis, we also investigate robust parameter estimation techniques, which are extremely important in light of the sparse data problems in language modeling. The model is implemented in the BU speech recognition system and provides a significant improvement in recognition accuracy. An important advantage of the framework of our model is that it is a simple extension of existing language modeling techniques that can easily be integrated with other language modeling advances.
What HMMs can do
, 2002
"... Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabil ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial analyzes HMMs by exploring a novel way in which an HMM can be defined, namely in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no theoretical limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM for ASR, we should rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Probabilistic-trajectory Segmental HMMs. Computer Speech and Language
, 1999
"... “Segmental hidden Markov models ” (SHMMs) are intended to overcome important speech-modelling limitations of the conventional-HMM approach by representing sequences (or segments) of features and incorporating the concept of trajectories to describe how features change over time. A novel feature of t ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
“Segmental hidden Markov models ” (SHMMs) are intended to overcome important speech-modelling limitations of the conventional-HMM approach by representing sequences (or segments) of features and incorporating the concept of trajectories to describe how features change over time. A novel feature of the approach presented in this paper is that extra-segmental variability between different examples of a sub-phonemic speech segment is modelled separately from intra-segmental variability within any one example. The extra-segmental component of the model is represented in terms of variability in the trajectory parameters, and these models are therefore referred to as “probabilistic-trajectory segmental HMMs ” (PTSHMMs). This paper presents the theory of PTSHMMs using a linear trajectory description characterized by slope and mid-point parameters, and presents theoretical and experimental comparisons between different types of PTSHMMs, simpler SHMMs and conventional HMMs. Experiments have demonstrated that, for any given feature set, a linear PTSHMM can substantially reduce the error rate in comparison with a conventional HMM, both for a connected-digit recognition task and for a phonetic classification task. Performance benefits have been demonstrated from incorporating a linear trajectory description and additionally from modelling variability in the mid-point parameter. c ○ 1999 British Crown Copyright/DERA 1.
A novel learning method for hidden Markov models in speech and audio processing
- In: Proc. IEEE MMSP
, 2006
"... Abstract — in recent years, various discriminative learning techniques for HMMs have consistently yielded significant benefits in speech recognition. In this paper, we present a novel optimization technique using the Minimum Classification Error (MCE) criterion to optimize the HMM parameters. Unlike ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
Abstract — in recent years, various discriminative learning techniques for HMMs have consistently yielded significant benefits in speech recognition. In this paper, we present a novel optimization technique using the Minimum Classification Error (MCE) criterion to optimize the HMM parameters. Unlike Maximum Mutual Information training where an Extended Baum-Welch (EBW) algorithm exists to optimize its objective function, for MCE training the original EBW algorithm cannot be directly applied. In this work, we extend the original EBW algorithm and derive a novel method for MCE-based model parameter estimation. Compared with conventional gradient descent methods for MCE learning, the proposed method gives a solid theoretical basis, stable convergence, and it is well suited for the large-scale batch-mode training process essential in largescale speech recognition and other pattern recognition applications. Evaluation experiments, including model training and speech recognition, are reported on both a small vocabulary task (TI-Digits) and a large vocabulary task (WSJ), where the effectiveness of the proposed method is demonstrated. We expect new future applications and success of this novel learning method in general pattern recognition and multimedia processing, in addition to speech and audio processing applications we present in this paper. Keywords—Speech recognition and audio processing, pattern recogntion, machine learning, discriminative learning, hidden Markov model, rational-function optimization, growth transformation, extended Baum-Welch algorithm Topic area—Multimedia processing (speech and audio). I.
Discriminative Mixture Weight Estimation For Large Gaussian Mixture Models
, 1999
"... This paper describes a new approach to acoustic modeling for large vocabulary continuous speech recognition (LVCSR) systems. Each phone is modeled with a large Gaussian mixture model (GMM) whose context-dependent mixture weights are estimated with a sentence-level discriminative training criterion. ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper describes a new approach to acoustic modeling for large vocabulary continuous speech recognition (LVCSR) systems. Each phone is modeled with a large Gaussian mixture model (GMM) whose context-dependent mixture weights are estimated with a sentence-level discriminative training criterion. The estimation problem is casted in a neural network framework, which enables the incorporation of the appropriate constraints on the mixture weight vectors, and allows a straight-forward training procedure, based on steepest descent. Experiments conducted on the Callhome-English and Switchboard databases show a significant improvement of the acoustic model performance, and a somewhat lesser improvement with the combined acoustic and language models. 1. INTRODUCTION Many factors contribute to the relatively high error rates observed in LVCSR systems (e.g. diversity of speaking styles, pronunciation variants, variable degrees of articulation, noises, channel effects). By enlarging the set ...
Experimental results for baseline speech recognition performance using input acquired from a linear microphone array
- In Notebook of the Fifth DARPA Workshop on Speech and Natuaral Language, Arden
, 1992
"... In this paper, baseline speech recognition performance is determined both for a single remote microphone and for a signal derived from a delay-and-sum beamformer using an eight-microphone linear array. An HMM-based, connected-speech, 38-word vocabulary (alphabet, digits, 'space', 'period'), talker-i ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In this paper, baseline speech recognition performance is determined both for a single remote microphone and for a signal derived from a delay-and-sum beamformer using an eight-microphone linear array. An HMM-based, connected-speech, 38-word vocabulary (alphabet, digits, 'space', 'period'), talker-independent speech recognition system is used for testing performance. Normal performance, with no language model, i.e., raw word-level performance, is currently about 81 % for a set of talkers not in the training set and about 91 % for training set data. The system has been trained and tested using a close-talking bead-mounted microphone. Since a meaningful comparison requires using the same speech, the existing speech database was appropriately pre-filtered, played out through a transducer (speaker) in the room environment, picked-up by the microphone array, and re-stored as a digital file. The resulting file was post-processed and used as input to the recognizer; the recognition performance indicates the effect of the input device. The baseline experiment showed that both a single remote microphone and the beamformed signal reduced performance by 12 % in a room with no other talkers. For the array tested, the error is generally attributable to reverberation off the floor and ceiling. 1.
A Maximum-entropy Solution to the Frame-dependency Problem in Speech Recognition
, 2001
"... The HMM assumption of conditional independence of observations causes a variety of problems for speech-recognition applications. Previous attempts to construct acoustic models that remove this assumption have suffered from a significant increase in the number of parameters to train. Another weakness ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The HMM assumption of conditional independence of observations causes a variety of problems for speech-recognition applications. Previous attempts to construct acoustic models that remove this assumption have suffered from a significant increase in the number of parameters to train. Another weakness of current acoustic models is that they do not account for the origin of derived features (estimated derivatives). We show how to both remove the independence assumption and properly account for derived features, with little or no increase in the number of parameters to train, by applying the principle of maximum entropy. We also show that ignoring the origins of derived features in training HMM acoustic models can lead to severe distortions of the effective language model. Evaluation of our maxent model on a simple problem cuts an already-low error rate in half compared to an equivalent HMM with the same number of parameters.
The Stochastic Segment Model for Continuous Speech Recognition
- In Proceedings The 25th Asilomar Conference on Signals, Systems and Computers
, 1991
"... A new direction in speech recognition via statistical methods is to move from frame-based models, such as Hidden Markov Models (HMMs), to segment-based models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint m ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
A new direction in speech recognition via statistical methods is to move from frame-based models, such as Hidden Markov Models (HMMs), to segment-based models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint model for a sequence of observations, which provides explicit modeling of time correlation as well as a formalism for incorporating segmental features. In this work, the focus is on modeling time correlation within a segment. We consider three Gaussian model variations based on different assumptions about the form of statistical dependency, including a Gauss-Markov model, a dynamical system model and a target state model, all of which can be formulated in terms of the dynamical system model. Evaluation of the different modeling assumptions is in terms of both phoneme classification performance and the predictive power of linear models. 1 Introduction Most of the existing speaker-independent ...

