Results 1  10
of
27
B.K.Mak, “Subspace distribution clustering hidden markov model
 IEEE Trans. on Speech and Audio Processing
, 2001
"... Abstract—Most contemporary laboratory recognizers require too much memory to run, and are too slow for mass applications. One major cause of the problem is the large parameter space of their acoustic models. In this paper, we propose a new acoustic modeling methodology which we call subspace distrib ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Abstract—Most contemporary laboratory recognizers require too much memory to run, and are too slow for mass applications. One major cause of the problem is the large parameter space of their acoustic models. In this paper, we propose a new acoustic modeling methodology which we call subspace distribution clustering hidden Markov modeling (SDCHMM) with the aim at achieving much more compact acoustic models. The theory of SDCHMM is based on tying the parameters of a new unit, namely the subspace distribution, of continuous density hidden Markov models (CDHMMs). SDCHMMs can be converted from CDHMMs by projecting the distributions of the CDHMMs onto orthogonal subspaces, and then tying similar subspace distributions over all states and all acoustic models in each subspace. By exploiting the combinatorial effect of subspace distribution encoding, all original fullspace distributions can be represented by combinations of a small number of subspace distribution prototypes. Consequently, there is a great reduction in the number of model parameters, and thus substantial savings in memory and computation. This renders SDCHMM very attractive in the practical implementation of acoustic models. Evaluation on the Airline Travel Information System (ATIS) task shows that in comparison to its parent CDHMM system, a converted SDCHMM system achieves seven to 18fold reduction in memory requirement for acoustic models, and runs 30%–60 % faster without any loss of recognition accuracy. Index Terms—Distribution clustering, hidden Markov modeling, subspace distribution. I.
Speech Recognition Using Augmented Conditional Random Fields
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT
Generalised linear Gaussian models
, 2001
"... This paper addresses the timeseries modelling of high dimensional data. Currently, the hidden Markov model (HMM) is the most popular and successful model especially in speech recognition. However, there are well known shortcomings in HMMs particularly in the modelling of the correlation between suc ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
This paper addresses the timeseries modelling of high dimensional data. Currently, the hidden Markov model (HMM) is the most popular and successful model especially in speech recognition. However, there are well known shortcomings in HMMs particularly in the modelling of the correlation between successive observation vectors; that is, interframe correlation. Standard diagonal covariance matrix HMMs also lack the modelling of the spatial correlation in the feature vectors; that is, intraframe correlation. Several other timeseries models have been proposed recently especially in the segment model framework to address the interframe correlation problem such as GaussMarkov and dynamical system segment models. The lack of intraframe correlation has been compensated for with transform schemes such as semitied full covariance matrices (STC). All these models can be regarded as belonging to the broad class of generalised linear Gaussian models. Linear Gaussian models (LGM) are popular as many forms may be trained efficiently using the expectation maximisation algorithm. In this paper, several LGMs and generalised LGMs are reviewed. The models can be roughly categorised into four combinations according to two different state evolution and two different observation processes. The state evolution process can be based on a discrete finite state machine such as in the HMMs or a linear firstorder GaussMarkov process such as in the traditional linear dynamical systems. The observation process can be represented as a factor analysis model or a linear discriminant analysis model. General HMMs and schemes proposed to improve their performance such as STC can be regarded as special cases in this framework.
Linear Gaussian models for speech recognition
 CAMBRIDGE UNIVERSITY
, 2004
"... Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete stat ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo
Fast Likelihood Computation Methods For Continuous Mixture Densities In Large Vocabulary Speech Recognition
 In Proc. of the European Conf. on Speech Communication and Technology
, 1997
"... This paper studies algorithms for reducing the computational effort of the mixture density calculations in HMMbased speech recognition systems. These likelihood calculations take about 70 \Gamma 85% of the total recognition time in the RWTH system for large vocabulary continuous speech recognition. ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
This paper studies algorithms for reducing the computational effort of the mixture density calculations in HMMbased speech recognition systems. These likelihood calculations take about 70 \Gamma 85% of the total recognition time in the RWTH system for large vocabulary continuous speech recognition. To reduce the computational cost of the likelihood calculations, we investigate several space partitioning methods. A detailed comparison of these techniques is given on the North American Business Corpus (NAB'94) for a 20 000word task. As a result, the socalled projection search algorithm in combination with the VQ method reduces the cost of likelihood computation by a factor of about 8 with no significant loss in the word recognition accuracy. 1.
Gaussian Selection Applied to TextIndependent Speaker Verification
 In Proc. Speaker Odyssey 2001
, 2001
"... Fast speaker verification systems can be realised by reducing the computation associated with searching of mixture components within the statistical model such as a Gaussian mixture model, GMM. Several improvements regarding computational efficiency have already been proposed ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Fast speaker verification systems can be realised by reducing the computation associated with searching of mixture components within the statistical model such as a Gaussian mixture model, GMM. Several improvements regarding computational efficiency have already been proposed
Feature pruning in likelihood evaluation of HMMbased speech recognition
 in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ’03
, 2003
"... In this work, we present a novel technique to reduce the likelihood computation in ASR systems that use continuous density HMMs. The proposed method, under certain conditions, only evaluates the component likelihoods of certain features, and approximates those of the remaining features by prediction ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this work, we present a novel technique to reduce the likelihood computation in ASR systems that use continuous density HMMs. The proposed method, under certain conditions, only evaluates the component likelihoods of certain features, and approximates those of the remaining features by prediction. We investigate two feature clustering approaches associated with the pruning technique. While the simple sequential clustering works remarkably well, a datadriven approach performs even better in its attempt to save computation while maintaining baseline performance. With the second approach, we can achieve a saving of 45 % in the likelihood evaluation for isolated word recognition, and 50 % for continuous speech recognition using either monophone or triphone models. The technique is easily incorporated into any recognizer and costs only a negligible additional overhead. 1.
The 1999 CMU 10x real time broadcast news transcription system
 Proc. DARPA workshop on Automatic Transcription of Broadcast News
, 2000
"... CMU's 10X real time system is the HMMbased SPHINXIII system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a fir ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
CMU's 10X real time system is the HMMbased SPHINXIII system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a firstpass decoder, capable of generating word lattices. It was designed to optimize speed, recognition accuracy as well as memory requirements. For the 1999 Hub 4 evaluation task, the system used two sets of acoustic models fullbandwidth and narrowbandwidth. The acoustic models were 6000 senone, 32 Gaussians per state, 3state HMMs with no skips permitted across states. The system used a single 39 dimensional feature stream consisting of cepstra and cepstral differences. The lattices generated were rescored using a DAG algorithm. The DAGrescored hypotheses were designated as those of the primary system. The contrastive system consisted of the output of the first pass Viterbi search, with no DAG rescoring of lattices. A trigram language model consisting of 57,000 unigrams, 10 million bigrams and 14.9 million trigrams was used. No adaptation passes were done. In this paper we describe the various components of the primary system. The firstpass word error rate on the 1998 Hub 4 evaluation set was 20.4 % with this system. The overall word error rate scored by NIST for the 1999 Hub 4 evaluation set was 27.6%.
DecisionTree Based Quantization Of The Feature Space Of A Speech Recognizer
 In Proceedings of the European Conference on Speech Communication and Technology
, 1997
"... We present a decisiontree based procedure to quantize the featurespace of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We present a decisiontree based procedure to quantize the featurespace of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region is bounded by a number of hyperplanes. Further, each region is characterized by the occurence of only a small number of the total alphabet of allophones (subphonetic speech units); by identifying the region in which a test feature vector lies, only the gaussians that model the density of allophones that exist in that region need be evaluated. The quantization of the feature space is done in a heirarchical manner using a binary decision tree. Each node of the decision tree represents a region of the feature space, and is further characterized by a hyperplane (a vector v n and a scalar threshold value hn ), that subdivides the region corresponding to the current node into two nonoverlapping...
Abstract Gaussianselectionbased nonoptimal search for speaker identification
, 2005
"... Most speaker identification systems train individual models for each speaker. This is done as individual models often yield better performance and they permit easier adaptation and enrollment. When classifying a speech token, the token is scored against each model and the maximum a priori decision r ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Most speaker identification systems train individual models for each speaker. This is done as individual models often yield better performance and they permit easier adaptation and enrollment. When classifying a speech token, the token is scored against each model and the maximum a priori decision rule is used to decide the classification label. Consequently, the cost of classification grows linearly for each token as the population size grows. When considering that the number of tokens to classify is also likely to grow linearly with the population, the total work load increases exponentially. This paper presents a preclassifier which generates an Nbest hypothesis using a novel application of Gaussian selection, and a transformation of the traditional tail test statistic which lets the implementer specify the tail region in terms of probability. The system is trained using parameters of individual speaker models and does not require the original feature vectors, even when enrolling new speakers or adapting existing ones. As the correct class label need only be in the Nbest hypothesis set, it is possible to prune more Gaussians than in a traditional Gaussian selection application. The Nbest hypothesis set is then evaluated using individual speaker models, resulting in an overall reduction of workload.