Results 1  10
of
14
Speech Recognition Using Augmented Conditional Random Fields
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT
Generalised linear Gaussian models
, 2001
"... This paper addresses the timeseries modelling of high dimensional data. Currently, the hidden Markov model (HMM) is the most popular and successful model especially in speech recognition. However, there are well known shortcomings in HMMs particularly in the modelling of the correlation between suc ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
This paper addresses the timeseries modelling of high dimensional data. Currently, the hidden Markov model (HMM) is the most popular and successful model especially in speech recognition. However, there are well known shortcomings in HMMs particularly in the modelling of the correlation between successive observation vectors; that is, interframe correlation. Standard diagonal covariance matrix HMMs also lack the modelling of the spatial correlation in the feature vectors; that is, intraframe correlation. Several other timeseries models have been proposed recently especially in the segment model framework to address the interframe correlation problem such as GaussMarkov and dynamical system segment models. The lack of intraframe correlation has been compensated for with transform schemes such as semitied full covariance matrices (STC). All these models can be regarded as belonging to the broad class of generalised linear Gaussian models. Linear Gaussian models (LGM) are popular as many forms may be trained efficiently using the expectation maximisation algorithm. In this paper, several LGMs and generalised LGMs are reviewed. The models can be roughly categorised into four combinations according to two different state evolution and two different observation processes. The state evolution process can be based on a discrete finite state machine such as in the HMMs or a linear firstorder GaussMarkov process such as in the traditional linear dynamical systems. The observation process can be represented as a factor analysis model or a linear discriminant analysis model. General HMMs and schemes proposed to improve their performance such as STC can be regarded as special cases in this framework.
Linear Gaussian models for speech recognition
 CAMBRIDGE UNIVERSITY
, 2004
"... Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete stat ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo
Fast Likelihood Computation Methods For Continuous Mixture Densities In Large Vocabulary Speech Recognition
 In Proc. of the European Conf. on Speech Communication and Technology
, 1997
"... This paper studies algorithms for reducing the computational effort of the mixture density calculations in HMMbased speech recognition systems. These likelihood calculations take about 70 \Gamma 85% of the total recognition time in the RWTH system for large vocabulary continuous speech recognition. ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
This paper studies algorithms for reducing the computational effort of the mixture density calculations in HMMbased speech recognition systems. These likelihood calculations take about 70 \Gamma 85% of the total recognition time in the RWTH system for large vocabulary continuous speech recognition. To reduce the computational cost of the likelihood calculations, we investigate several space partitioning methods. A detailed comparison of these techniques is given on the North American Business Corpus (NAB'94) for a 20 000word task. As a result, the socalled projection search algorithm in combination with the VQ method reduces the cost of likelihood computation by a factor of about 8 with no significant loss in the word recognition accuracy. 1.
Gaussian Selection Applied to TextIndependent Speaker Verification
 In Proc. Speaker Odyssey 2001
, 2001
"... Fast speaker verification systems can be realised by reducing the computation associated with searching of mixture components within the statistical model such as a Gaussian mixture model, GMM. Several improvements regarding computational efficiency have already been proposed ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Fast speaker verification systems can be realised by reducing the computation associated with searching of mixture components within the statistical model such as a Gaussian mixture model, GMM. Several improvements regarding computational efficiency have already been proposed
The 1999 CMU 10x real time broadcast news transcription system
 Proc. DARPA workshop on Automatic Transcription of Broadcast News
, 2000
"... CMU's 10X real time system is the HMMbased SPHINXIII system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a firstpa ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
CMU's 10X real time system is the HMMbased SPHINXIII system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a firstpass decoder, capable of generating word lattices. It was designed to optimize speed, recognition accuracy as well as memory requirements. For the 1999 Hub 4 evaluation task, the system used two sets of acoustic models fullbandwidth and narrowbandwidth. The acoustic models were 6000 senone, 32 Gaussians per state, 3state HMMs with no skips permitted across states. The system used a single 39 dimensional feature stream consisting of cepstra and cepstral differences. The lattices generated were rescored using a DAG algorithm. The DAGrescored hypotheses were designated as those of the primary system. The contrastive system consisted of the output of the first pass Viterbi search, with no DAG rescoring of lattices. A trigram language model consisting of 57,000 unigrams, 10 million bigrams and 14.9 million trigrams was used. No adaptation passes were done. In this paper we describe the various components of the primary system. The firstpass word error rate on the 1998 Hub 4 evaluation set was 20.4 % with this system. The overall word error rate scored by NIST for the 1999 Hub 4 evaluation set was 27.6%.
DecisionTree Based Quantization Of The Feature Space Of A Speech Recognizer
 In Proceedings of the European Conference on Speech Communication and Technology
, 1997
"... We present a decisiontree based procedure to quantize the featurespace of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We present a decisiontree based procedure to quantize the featurespace of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region is bounded by a number of hyperplanes. Further, each region is characterized by the occurence of only a small number of the total alphabet of allophones (subphonetic speech units); by identifying the region in which a test feature vector lies, only the gaussians that model the density of allophones that exist in that region need be evaluated. The quantization of the feature space is done in a heirarchical manner using a binary decision tree. Each node of the decision tree represents a region of the feature space, and is further characterized by a hyperplane (a vector v n and a scalar threshold value hn ), that subdivides the region corresponding to the current node into two nonoverlapping...
Abstract Gaussianselectionbased nonoptimal search for speaker identification
, 2005
"... Most speaker identification systems train individual models for each speaker. This is done as individual models often yield better performance and they permit easier adaptation and enrollment. When classifying a speech token, the token is scored against each model and the maximum a priori decision r ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Most speaker identification systems train individual models for each speaker. This is done as individual models often yield better performance and they permit easier adaptation and enrollment. When classifying a speech token, the token is scored against each model and the maximum a priori decision rule is used to decide the classification label. Consequently, the cost of classification grows linearly for each token as the population size grows. When considering that the number of tokens to classify is also likely to grow linearly with the population, the total work load increases exponentially. This paper presents a preclassifier which generates an Nbest hypothesis using a novel application of Gaussian selection, and a transformation of the traditional tail test statistic which lets the implementer specify the tail region in terms of probability. The system is trained using parameters of individual speaker models and does not require the original feature vectors, even when enrolling new speakers or adapting existing ones. As the correct class label need only be in the Nbest hypothesis set, it is possible to prune more Gaussians than in a traditional Gaussian selection application. The Nbest hypothesis set is then evaluated using individual speaker models, resulting in an overall reduction of workload.
A Speech Interface for a Mobile Robot Controlled by GOLOG
, 2000
"... With today's highlevel plan languages like GOLOG or rpl it is possible for mobile robots to cope with complex problems. Unfortunately, instructing the robot what to do or interacting with it is still awkward. Usually, instructions are given by loading the appropriate program and interacting amounts ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
With today's highlevel plan languages like GOLOG or rpl it is possible for mobile robots to cope with complex problems. Unfortunately, instructing the robot what to do or interacting with it is still awkward. Usually, instructions are given by loading the appropriate program and interacting amounts to little more than pressing buttons positioned on the robot itself.
A HighSpeed, LowResource ASR BackEnd Based on Custom Arithmetic
"... Abstract—With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for lowresource systems. This work presents a highspeed, lowresource speech recognition system using custom arithmetic units, where all system variables are ..."
Abstract
 Add to MetaCart
Abstract—With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for lowresource systems. This work presents a highspeed, lowresource speech recognition system using custom arithmetic units, where all system variables are represented by integer indices and all arithmetic operations are replaced by hardwarebased table lookups. To this end, several reordering and rescaling techniques, including two accumulation structures for Gaussian evaluation and a novel method for the normalization of Viterbi search scores, are proposed to ensure low entropy for all variables. Furthermore, a discriminatively inspired distortion measure is investigated for scalar quantization of forward probabilities to maximize the recognition rate. Finally, heuristic algorithms are explored to optimize systemwide resource allocation. Our best bitwidth allocation scheme only requires 59 kB of ROMs to hold the lookup tables, and its recognition performance with various vocabulary sizes in both clean and noisy conditions is nearly as good as that of a system using a 32bit floatingpoint unit. Simulations on various architectures show that, on most modern processor designs, we can expect a cyclecount speedup of at least three times over systems with floatingpoint units. Additionally, the memory bandwidth is reduced by over 70 % and the offline storage for model parameters is reduced by 80%. Index Terms—Alpha recursion, bitwidth allocation, custom arithmetic, discriminative distortion measure, forward probability normalization and scaling, high speed, low resource, normalization, quantization, speech recognition. I.