Results 1 - 10
of
21
Speech Recognition using SVMs
- Advances in Neural Information Processing Systems 14
, 2002
"... An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The sco ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space de ned by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the de nition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood.
Support vector machines for speech recognition
- Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Syllable-Based Large Vocabulary Continuous Speech Recognition
- IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING
, 2001
"... Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependent phone as the fundamental acoustic unit. In this paper, we present one of the first robust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speec ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependent phone as the fundamental acoustic unit. In this paper, we present one of the first robust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speech. This effort is motivated by the inherent limitations in phone-based approaches — namely the lack of an easy and efficient way for modeling long-term temporal dependencies. A syllable unit spans a longer time frame, typically three phones, thereby offering a more parsimonious framework for modeling pronunciation variation in spontaneous speech. We present encouraging results which show that a syllable-based system exceeds the performance of a comparable triphone system both in terms of word error rate (WER) and complexity. The WER of the best syllable system reported here is 49.1 % on a standard SWITCHBOARD evaluation, a small improvement over the triphone system. We also report results on a much smaller recognition task, OGI Alphadigits, which was used to validate some of the benefits syllables offer over triphones. The syllable-based system exceeds the performance of the triphone system by nearly 20%, an impressive accomplishment since the alphadigits application consists mostly of phone-level minimal pair distinctions.
Deterministically Annealed Design of Hidden Markov Model Speech Recognizers
, 2001
"... Many conventional speech recognition systems are based on the use of hidden Markov models (HMM) within the context of discriminant-based pattern classification. While the speech recognition objective is a low rate of misclassification, HMM design has been traditionally approached via maximum likelih ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
Many conventional speech recognition systems are based on the use of hidden Markov models (HMM) within the context of discriminant-based pattern classification. While the speech recognition objective is a low rate of misclassification, HMM design has been traditionally approached via maximum likelihood (ML) modeling which is, in general, mismatched with the minimum error objective and hence suboptimal. Direct minimization of the error rate is difficult because of the complex nature of the cost surface, and has only been addressed recently by discriminative design methods such as generalized probabilistic descent (GPD). While existing discriminative methods offer significant benefits, they commonly rely on local optimization via gradient descent whose performance suffers from the prevalence of shallow local minima. As an alternative, we propose the deterministic annealing (DA) design method that directly minimizes the error rate while avoiding many poor local minima of the cost. DA is derived from fundamental principles of statistical physics and information theory. In DA, the HMM classifier's decision is randomized and its expected error rate is minimized subject to a constraint on the level of randomness which is measured by the Shannon entropy. The entropy constraint is gradually relaxed, leading in the limit of zero entropy to the design of regular nonrandom HMM classifiers. An efficient forward--backward algorithm is proposed for the DA method. Experiments on synthetic data and on a simplified recognizer for isolated English letters demonstrate that the DA design method can improve recognition error rates over both ML and GPD methods.
Advances In Alphadigit Recognition Using Syllables
- PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING
, 1998
"... In this paper, we present a set of experiments which explore the use of syllables for recognition of continuous alphadigit utterances. In this system, syllables are used as the primary unit of recognition. This work was motivated by our need to verify and isolate phenomena seen when performing syll ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
In this paper, we present a set of experiments which explore the use of syllables for recognition of continuous alphadigit utterances. In this system, syllables are used as the primary unit of recognition. This work was motivated by our need to verify and isolate phenomena seen when performing syllable-based experiments on the SWITCHBOARD corpus. The performance of our base syllable system is better than a crossword triphone system while requiring a small portion of the resources necessary for triphone systems. All experiments were performed on the OGI Alphadigits corpus, which consists of telephone-bandwidth alphadigit strings. The WER of the best syllable system (context-independent syllables) reported here is 11.1% compared to 12.2% for a crossword triphone system.
Sub-state Tying in Tied Mixture Hidden Markov Models
, 2000
"... An approach is proposed for partial tying of states of tiedmixture hidden Markov models. To facilitate tying at the substate level, the state emission probabilities are constructed in two stages, or equivalently, are viewed as a "mixture of mixtures of Gaussians."This paradigm allows, and is complem ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
An approach is proposed for partial tying of states of tiedmixture hidden Markov models. To facilitate tying at the substate level, the state emission probabilities are constructed in two stages, or equivalently, are viewed as a "mixture of mixtures of Gaussians."This paradigm allows, and is complemented with, an optimization technique to seek the best complexity-accuracy tradeoff solution, which jointly exploits Gaussian density sharing and sub-state tying. Experimental results on the E-set show that the classification error rate is reduced by over 20% compared to standard Gaussian sharing and whole-state tying. The approach is then embedded within the recently developed procedure of combined parameter training and reduction technique. Experiments with the overall technique show that the error rate is further reduced by 8%.
Towards A Compact Speech Recognizer: Subspace Distribution Clustering Hidden Markov Model
, 1998
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 The Problem: Too Many Parameters : : : : : : : : : : : : : : : : : : : : : : 3 1.2 Proposed Solution: It Is Time to ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 The Problem: Too Many Parameters : : : : : : : : : : : : : : : : : : : : : : 3 1.2 Proposed Solution: It Is Time to Share More! : : : : : : : : : : : : : : : : : 4 1.3 Thesis Summary and Outline : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2 Review of Acoustic Modeling Using Hidden Markov Model : : : : : : : 9 2.1 Speech Characteristics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 2.2 Selection of Input Speech Space and Speech Model : : : : : : : : : : : : : : 10 2.2.1 Cepstral Input : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 2.2.2 Hidden Markov Model : : : : : : : : : : : : : : : : : : : : : : : : : : 11 2.2.3 Our Choice of HMM for Acoustic Modeling : : : : : : : : : : : : : : 14 2.3 Speech Unit to Model : : : : : : : : : : : : : : : : : : : : : : : : : : ...
Spanish Recogniser Of Continuously Spelled Names Over The Telephone
- ICSLP
, 2000
"... In the paper, we present an analysis of the spelling task for Spanish and we describe the research and implementation of a Spanish recogniser for continuously spelled names over the telephone. We analyse and compare three different recognition architectures. The first one is a Two level architecture ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In the paper, we present an analysis of the spelling task for Spanish and we describe the research and implementation of a Spanish recogniser for continuously spelled names over the telephone. We analyse and compare three different recognition architectures. The first one is a Two level architecture. This approach consists in two steps. In the first one we obtain the most likely letter sequence using the one-pass algorithm. In the second step, to obtain the name recognised, we align the sequence of letters with the different dictionary names using a Dynamic Programming (DP) algorithm. The second alternative consists on an Integrated Architecture where a constrained grammar is built with all the names from the dictionary. In this case, we have a higher Name Recognition Rate but the time processing increases a lot. Finally, we propose a combined architecture with a good compromise between recognition rate and time consuming. This approach responds to a strategy of Hypothesis and Verification. In the hypothesis stage, we obtain the most likely letter sequence (one-pass algorithm) and then we select N-candidates from the dictionary with a dynamic programming algorithm. In the verification stage, we build a dynamic grammar with the N-candidates and we recognise over it. With this system we obtain a 96.1% Name Recognition Rate in real time for the 1,000 names dictionary and, 92.3% and 89.6% for 5,000 and 10,000 names directories respectively. Keywords: Spelled names recognition, Spanish spelling task, Recognition over the telephone. 1.
Stream Derivation And Clustering Scheme For Subspace Distribution Clustering Hidden Markov Model
- in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop
, 1997
"... In [1], our novel subspace distribution clustering hidden Markov model (SDCHMM) made its debut as an approximation to continuous density HMM(CDHMM). Deriving SDCHMMs from CDHMMs requires a definition of multiple streams and a Gaussian clustering scheme. Previously we have tried 4 and 13 streams, wh ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In [1], our novel subspace distribution clustering hidden Markov model (SDCHMM) made its debut as an approximation to continuous density HMM(CDHMM). Deriving SDCHMMs from CDHMMs requires a definition of multiple streams and a Gaussian clustering scheme. Previously we have tried 4 and 13 streams, which are common but ad hoc choices. Here we present a simple and coherent definition for streams of any dimension: the streams comprise the most correlated features. The new definition is shown to give better performance in two recognition tasks. The clustering scheme in [1] is an O(n 2 ) algorithm which can be slow when the number of Gaussians in the original CDHMMs is large. Now we have devised a modified k-means clustering scheme using the Bhattacharyya distance as the distance measure between Gaussian clusters. Not only is the new clustering scheme faster, when combined with the new stream definitions, we now obtain SDCHMMs which perform at least as well as the original CDHMMs (with bet...
Advances in Speech Recognition Using Sparse Bayesian Methods
- IEEE Transactions on Speech and Audio Processing
, 2003
"... The prominent modeling technique for speech recognition today is the hidden Markov model with Gaussian emission densities. They have suffered, though, from an inability to learn discriminative information and are prone to overfitting and overparameterization. Recent work on machine learning has move ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The prominent modeling technique for speech recognition today is the hidden Markov model with Gaussian emission densities. They have suffered, though, from an inability to learn discriminative information and are prone to overfitting and overparameterization. Recent work on machine learning has moved toward models such as the support vector machine that automatically control generalization and parameterization as part of the overall optimization process. The support vector machine, however, requires ad hoc (and unreliable) methods to couple it to probabilistic speech recognition systems. In this work, we introduce the use of a probabilistic Bayesian learning machine termed the relevance vector machine as the core pattern recognition unit in a speech recognizer. The relevance vector machine system is compared to previous work using support vector machines and is found to outperform the support vector machine system in terms of both accuracy and sparsity on a continuous alphadigit task.

