Results 1  10
of
62
Speech Recognition with Dynamic Bayesian Networks
, 1998
"... Dynamic Bayesian networks (DBNs) are a useful tool for representing complex stochastic processes. Recent developments in inference and learning in DBNs allow their use in realworld applications. In this paper, we apply DBNs to the problem of speech recognition. The factored state representation ena ..."
Abstract

Cited by 123 (9 self)
 Add to MetaCart
(Show Context)
Dynamic Bayesian networks (DBNs) are a useful tool for representing complex stochastic processes. Recent developments in inference and learning in DBNs allow their use in realworld applications. In this paper, we apply DBNs to the problem of speech recognition. The factored state representation enabled by DBNs allows us to explicitly represent longterm articulatory and acoustic context in addition to the phoneticstate information maintained by hidden Markov models (HMMs). Furthermore, it enables us to model the shortterm correlations among multiple observation streams within single timeframes. Given a DBN structure capable of representing these long and shortterm correlations, we applied the EM algorithm to learn models with up to 500,000 parameters. The use of structured DBN models decreased the error rate by 12 to 29% on a largevocabulary isolatedword recognition task, compared to a discrete HMM; it also improved significantly on other published results for the same task. Th...
HiddenArticulator Markov Models For Speech Recognition
 In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
, 2000
"... In traditional speech recognition using Hidden Markov Models (HMMs), each state represents an acoustic portion of a phoneme. We explore the concept of an articulator based HMM, where each state represents a particular articulatory configuration [Erler 1996]. In this paper, we present a novel articul ..."
Abstract

Cited by 98 (20 self)
 Add to MetaCart
(Show Context)
In traditional speech recognition using Hidden Markov Models (HMMs), each state represents an acoustic portion of a phoneme. We explore the concept of an articulator based HMM, where each state represents a particular articulatory configuration [Erler 1996]. In this paper, we present a novel articulatory feature mapping and a new technique for model initialization. In addition, we use diphone modeling which allows context dependent training of transition probabilities. Our goal is to confirm that articulatory knowledge can assist speech recognition. We demonstrate this by showing that our mapping of articulatory configurations to phonemes performs better than random mappings. Furthermore, we demonstrate the practicality of the model by showing that, in combination with a standard model, a 1221% relative word error rate decrease occurs relative to the standard model alone. 1. INTRODUCTION Hidden Markov Models (HMMs) are a popular approach for speech recognition. Commonly, a lefttor...
Dynamic Bayesian Multinets
, 2000
"... In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how informationtheoretic criterion functions can be used to induce spa ..."
Abstract

Cited by 65 (19 self)
 Add to MetaCart
In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how informationtheoretic criterion functions can be used to induce sparse, discriminative, and classconditional network structures that yield an optimal approximation to the class posterior probability, and therefore are useful for the classification task. Using a new structure learning heuristic, the resulting models are tested on a mediumvocabulary isolatedword speech recognition task. It is demonstrated that these discriminatively structured dynamic Bayesian multinets, when trained in a maximum likelihood setting using EM, can outperform both HMMs and other dynamic Bayesian networks with a similar number of parameters. 1 Introduction While Markov chains are sometimes a useful model for sequences, such simple independence assumptions can lead...
Factored sparse inverse covariance matrices
 In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
, 2000
"... Most HMMbased speech recognition systems use Gaussian mixtures as observation probability density functions. An important goal in all such systems is to improve parsimony. One method is to adjust the type of covariance matrices used. In this work, factored sparse inverse covariance matrices are int ..."
Abstract

Cited by 44 (10 self)
 Add to MetaCart
(Show Context)
Most HMMbased speech recognition systems use Gaussian mixtures as observation probability density functions. An important goal in all such systems is to improve parsimony. One method is to adjust the type of covariance matrices used. In this work, factored sparse inverse covariance matrices are introduced. Based on Í �Í factorization, the inverse covariance matrix can be represented using linear regressive coefficients which 1) correspond to sparse patterns in the inverse covariance matrix (and therefore represent conditional independence properties of the Gaussian), and 2), result in a method of partial tying of the covariance matrices without requiring nonlinear EM update equations. Results show that the performance of fullcovariance Gaussians can be matched by factored sparse inverse covariance Gaussians having significantly fewer parameters. 1.
Hybrid HMM/ANN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements
, 1997
"... In this paper, we evaluate multiGaussian HMM systems and hybrid HMM/ANN systems in the framework of task independent training for small size (75 words) and medium size (600 words) vocabularies. To do this, we use the Phonebook database [6] which is particularly well suited to this kind of experimen ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
In this paper, we evaluate multiGaussian HMM systems and hybrid HMM/ANN systems in the framework of task independent training for small size (75 words) and medium size (600 words) vocabularies. To do this, we use the Phonebook database [6] which is particularly well suited to this kind of experiments since (1) it is a very large telephone database and (2) the size and content of the test vocabulary is very flexible. For each system, different HMM topologies are compared to test the influence of state tying (with a number of parameters approximately kept constant) on the recognition performance. Two lexica (Phonebook and CMU) are also compared and it is shown that the CMU lexicon is leading to significantly better performance. Finally, it is shown that with a quite simple system and a few adaptations to the basic HMM/ANN scheme, recognition performance of 98.5% and 94.7% can easily be achieved, respectively on a lexicon of 75 and 600 words (isolated words, telephone speech and lexicon ...
HiddenArticulator Markov Models: Performance Improvements And Robustness To Noise
 in Proc. ICSLP
, 2000
"... A HiddenArticulator Markov Model (HAMM) is a Hidden Markov Model (HMM) in which each state represents an articulatory configuration. Articulatory knowledge, known to be useful for speech recognition [4], is represented by specifying a mapping of phonemes to articulatory configurations; vocal tract ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
(Show Context)
A HiddenArticulator Markov Model (HAMM) is a Hidden Markov Model (HMM) in which each state represents an articulatory configuration. Articulatory knowledge, known to be useful for speech recognition [4], is represented by specifying a mapping of phonemes to articulatory configurations; vocal tract dynamics are represented via transitions between articulatory configurations. In previous work [13], we extended the articulatoryfeature model introduced by Erler [7] by using diphone units and a new technique for model initialization. By comparing it with a purely random model, we showed that the HAMM can take advantage of articulatory knowledge. In this paper, we extend that work in three ways. First, we decrease the number of parameters, making it comparable in size to standard HMMs. Second, we evaluate our model in noisy contexts, verifying that articulatory knowledge can provide benefits in adverse acoustic conditions. Third, we use a corpus of sideby side speech and articulator tra...
Hidden Markov Models and Neural Networks for Speech Recognition
, 1998
"... The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as spee ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ...
Posterior Features Applied to Speech Recognition Tasks with UserDefined Vocabulary
 Proceedings of ICASSP
, 2009
"... This paper presents a novel approach for those applications where vocabulary is defined by a set of acoustic samples. In this approach, the acoustic samples are used as reference templates in a template matching framework. The features used to describe the reference templates and the test utterances ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
(Show Context)
This paper presents a novel approach for those applications where vocabulary is defined by a set of acoustic samples. In this approach, the acoustic samples are used as reference templates in a template matching framework. The features used to describe the reference templates and the test utterances are estimates of phoneme posterior probabilities. These posteriors are obtained from a MLP trained on an auxiliary database. Thus, the speech variability present in the features is reduced by applying the speech knowledge captured by the MLP on the auxiliary database. Moreover, information theoretic dissimilarity measures can be used as local distances between features. When compared to stateoftheart systems, this approach outperforms acousticbased techniques and obtains comparable results to orthographybased methods. The proposed method can also be directly combined with other posteriorbased HMM systems. This combination successfully exploits the complementarity between templates and parametric models. Index Terms — Speech recognition, template matching, posterior features, KullbackLeibler divergence 1.
Modeling Auxiliary Information in Bayesian Network Based ASR
 In 7th European Conference on Speech Communication and Technology
, 2001
"... Automatic speech recognition bases its models on the acoustic features derived from the speech signal. Some have investigated replacing or supplementing these features with information that can not be precisely measured (articulator positions, pitch, gender, etc.) automatically. Consequently, automa ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Automatic speech recognition bases its models on the acoustic features derived from the speech signal. Some have investigated replacing or supplementing these features with information that can not be precisely measured (articulator positions, pitch, gender, etc.) automatically. Consequently, automatic estimations of the desired information would be generated. This data can degrade performance due to its imprecisions. In this paper, we describe a system that treats pitch as an auxiliary information within the framework of Bayesian networks, resulting in improved performance. 1.
EWAVES: an efficient decoding algorithm for lexical tree based speech recognition
 in Proc. of ICSLP
"... We present an optimized implementation of the Viterbi algorithm suitable for small to large vocabulary, and isolated or continuous speech recognition. The Viterbi algorithm is certainly the most popular dynamic programming algorithm used in speech recognition. In this paper we propose a new algorith ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
(Show Context)
We present an optimized implementation of the Viterbi algorithm suitable for small to large vocabulary, and isolated or continuous speech recognition. The Viterbi algorithm is certainly the most popular dynamic programming algorithm used in speech recognition. In this paper we propose a new algorithm that outperforms the Viterbi algorithm in term of complexity and of memory requirements. It is based on the assumption of strictly left to right models and explores the lexical tree in an optimal way, such that bookkeeping computation is minimized. The tree is encoded such that children of a node are placed contiguously and in increasing order of memory heap so that the proposed algorithm also optimizes cache usage. Even though the algorithm is asymptotically two times faster that the conventional Viterbi algorithm, in our experiments