Results 1 - 10
of
30
Speech Recognition with Dynamic Bayesian Networks
, 1998
"... Dynamic Bayesian networks (DBNs) are a useful tool for representing complex stochastic processes. Recent developments in inference and learning in DBNs allow their use in real-world applications. In this paper, we apply DBNs to the problem of speech recognition. The factored state representation ena ..."
Abstract
-
Cited by 97 (8 self)
- Add to MetaCart
Dynamic Bayesian networks (DBNs) are a useful tool for representing complex stochastic processes. Recent developments in inference and learning in DBNs allow their use in real-world applications. In this paper, we apply DBNs to the problem of speech recognition. The factored state representation enabled by DBNs allows us to explicitly represent long-term articulatory and acoustic context in addition to the phonetic-state information maintained by hidden Markov models (HMMs). Furthermore, it enables us to model the short-term correlations among multiple observation streams within single time-frames. Given a DBN structure capable of representing these long- and short-term correlations, we applied the EM algorithm to learn models with up to 500,000 parameters. The use of structured DBN models decreased the error rate by 12 to 29% on a large-vocabulary isolated-word recognition task, compared to a discrete HMM; it also improved significantly on other published results for the same task. Th...
Hidden-Articulator Markov Models For Speech Recognition
- In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
, 2000
"... In traditional speech recognition using Hidden Markov Models (HMMs), each state represents an acoustic portion of a phoneme. We explore the concept of an articulator based HMM, where each state represents a particular articulatory configuration [Erler 1996]. In this paper, we present a novel articul ..."
Abstract
-
Cited by 70 (16 self)
- Add to MetaCart
In traditional speech recognition using Hidden Markov Models (HMMs), each state represents an acoustic portion of a phoneme. We explore the concept of an articulator based HMM, where each state represents a particular articulatory configuration [Erler 1996]. In this paper, we present a novel articulatory feature mapping and a new technique for model initialization. In addition, we use diphone modeling which allows context dependent training of transition probabilities. Our goal is to confirm that articulatory knowledge can assist speech recognition. We demonstrate this by showing that our mapping of articulatory configurations to phonemes performs better than random mappings. Furthermore, we demonstrate the practicality of the model by showing that, in combination with a standard model, a 12-21% relative word error rate decrease occurs relative to the standard model alone. 1. INTRODUCTION Hidden Markov Models (HMMs) are a popular approach for speech recognition. Commonly, a left-to-r...
Dynamic Bayesian Multinets
, 2000
"... In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how information-theoretic criterion functions can be used to induce spa ..."
Abstract
-
Cited by 54 (14 self)
- Add to MetaCart
In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how information-theoretic criterion functions can be used to induce sparse, discriminative, and classconditional network structures that yield an optimal approximation to the class posterior probability, and therefore are useful for the classification task. Using a new structure learning heuristic, the resulting models are tested on a medium-vocabulary isolated-word speech recognition task. It is demonstrated that these discriminatively structured dynamic Bayesian multinets, when trained in a maximum likelihood setting using EM, can outperform both HMMs and other dynamic Bayesian networks with a similar number of parameters. 1 Introduction While Markov chains are sometimes a useful model for sequences, such simple independence assumptions can lead...
Factored sparse inverse covariance matrices
- In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
, 2000
"... Most HMM-based speech recognition systems use Gaussian mixtures as observation probability density functions. An important goal in all such systems is to improve parsimony. One method is to adjust the type of covariance matrices used. In this work, factored sparse inverse covariance matrices are int ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
Most HMM-based speech recognition systems use Gaussian mixtures as observation probability density functions. An important goal in all such systems is to improve parsimony. One method is to adjust the type of covariance matrices used. In this work, factored sparse inverse covariance matrices are introduced. Based on Í �Í factorization, the inverse covariance matrix can be represented using linear regressive coefficients which 1) correspond to sparse patterns in the inverse covariance matrix (and therefore represent conditional independence properties of the Gaussian), and 2), result in a method of partial tying of the covariance matrices without requiring non-linear EM update equations. Results show that the performance of full-covariance Gaussians can be matched by factored sparse inverse covariance Gaussians having significantly fewer parameters. 1.
Hybrid HMM/ANN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements
, 1997
"... In this paper, we evaluate multi-Gaussian HMM systems and hybrid HMM/ANN systems in the framework of task independent training for small size (75 words) and medium size (600 words) vocabularies. To do this, we use the Phonebook database [6] which is particularly well suited to this kind of experimen ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
In this paper, we evaluate multi-Gaussian HMM systems and hybrid HMM/ANN systems in the framework of task independent training for small size (75 words) and medium size (600 words) vocabularies. To do this, we use the Phonebook database [6] which is particularly well suited to this kind of experiments since (1) it is a very large telephone database and (2) the size and content of the test vocabulary is very flexible. For each system, different HMM topologies are compared to test the influence of state tying (with a number of parameters approximately kept constant) on the recognition performance. Two lexica (Phonebook and CMU) are also compared and it is shown that the CMU lexicon is leading to significantly better performance. Finally, it is shown that with a quite simple system and a few adaptations to the basic HMM/ANN scheme, recognition performance of 98.5% and 94.7% can easily be achieved, respectively on a lexicon of 75 and 600 words (isolated words, telephone speech and lexicon ...
Hidden-Articulator Markov Models: Performance Improvements And Robustness To Noise
- in Proc. ICSLP
, 2000
"... A Hidden-Articulator Markov Model (HAMM) is a Hidden Markov Model (HMM) in which each state represents an articulatory configuration. Articulatory knowledge, known to be useful for speech recognition [4], is represented by specifying a mapping of phonemes to articulatory configurations; vocal tract ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
A Hidden-Articulator Markov Model (HAMM) is a Hidden Markov Model (HMM) in which each state represents an articulatory configuration. Articulatory knowledge, known to be useful for speech recognition [4], is represented by specifying a mapping of phonemes to articulatory configurations; vocal tract dynamics are represented via transitions between articulatory configurations. In previous work [13], we extended the articulatory-feature model introduced by Erler [7] by using diphone units and a new technique for model initialization. By comparing it with a purely random model, we showed that the HAMM can take advantage of articulatory knowledge. In this paper, we extend that work in three ways. First, we decrease the number of parameters, making it comparable in size to standard HMMs. Second, we evaluate our model in noisy contexts, verifying that articulatory knowledge can provide benefits in adverse acoustic conditions. Third, we use a corpus of sideby -side speech and articulator tra...
Hidden Markov Models and Neural Networks for Speech Recognition
, 1998
"... The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as spee ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ...
Modeling Auxiliary Information in Bayesian Network Based ASR
- In 7th European Conference on Speech Communication and Technology
, 2001
"... Automatic speech recognition bases its models on the acoustic features derived from the speech signal. Some have investigated replacing or supplementing these features with information that can not be precisely measured (articulator positions, pitch, gender, etc.) automatically. Consequently, automa ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Automatic speech recognition bases its models on the acoustic features derived from the speech signal. Some have investigated replacing or supplementing these features with information that can not be precisely measured (articulator positions, pitch, gender, etc.) automatically. Consequently, automatic estimations of the desired information would be generated. This data can degrade performance due to its imprecisions. In this paper, we describe a system that treats pitch as an auxiliary information within the framework of Bayesian networks, resulting in improved performance. 1.
Data-Driven Vector Clustering for Low-Memory Footprint ASR
- in ICSLP
, 2002
"... It is important to produce automatic speech recognition (ASR) systems that use as few computational and memory resources as possible, especially in low-memory/low-power environments such as for personal digital assistants. One way to achieve this is through parameter quantization. In this work, we c ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
It is important to produce automatic speech recognition (ASR) systems that use as few computational and memory resources as possible, especially in low-memory/low-power environments such as for personal digital assistants. One way to achieve this is through parameter quantization. In this work, we compare a variety of novel subvector clustering procedures for ASR system parameter quantization. Specifically, we look at systematic data-driven subvector selection techniques based on entropy minimization, and compare performance on a 150-word isolated word speech recognition task. While the optimal entropy-minimizing quantization methods are intractable, we show that although several of our heuristic techniques are elaborate in their attempt to approximate the optimal clustering, a simple scalar quantization scheme using separate codebooks performs remarkably well.
Modeling Linguistic Features in Speech Recognition
, 2003
"... This paper explores a new approach to speech recognition in which su%R ord u9 ts are modeled in terms of lingu stic featu res. Specifically, we have adopted a scheme of modeling separately themanne and articu lation for theseu nits. A novelty of ou r work is theu se of a generalized defin ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
This paper explores a new approach to speech recognition in which su%R ord u9 ts are modeled in terms of lingu stic featu res. Specifically, we have adopted a scheme of modeling separately themanne and articu lation for theseu nits. A novelty of ou r work is theu se of a generalized definition of place of artic u ation that enables u to map both vowels and consonants into a common lingu stic space. Modeling manner and place separately also allowsu s to explore a mu lti-stage recognition architectu]R in which the search space issuF]((% vely redu4% as more detailed models arebrou]( in. In the 8,000 word PhoneBook isolated word telephone speech recognition task, we show that su ch an approach can achieve a recognition WER that is 10% better than that achieved in the best resu ts reported in theliteratu re. This performance gain comes with improvements in search space andcompu ation time as well.

