Results 1 - 10
of
11
Signal modeling techniques in speech recognition
- PROCEEDINGS OF THE IEEE
, 1993
"... We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to norm ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to normalize and decor-relate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal’s spectrum can be estimated in a closed-loop manner. In this paper, we review the signal processing components of these algorithms. These al-gorithms are presented as part of a unified view of the signal parameterization problem in which there are three major tasks: measurement, transformation, and statistical modeling. This paper is by no means a comprehensive survey of all possible techniques of signal modeling in speech recognition. There are far too many algorithms in use today to make an exhaustive survey feasible (and cohesive). Instead, this paper is meant to serve as a tutorial on signal processing in state-of-the-art speech recognition systems and to review those techniques most commonly used. In keeping with this goal, a complete mathematical description of each algorithm has been included in the paper.
Natural Statistical Models for Automatic Speech Recognition
, 1999
"... The performance of state-of-the-art speech recognition systems is still far worse than that of humans. This is partly caused by the use of poor statistical models. In a general statistical pattern classification task, the probabilistic models should represent the statistical structure unique to an ..."
Abstract
-
Cited by 44 (16 self)
- Add to MetaCart
The performance of state-of-the-art speech recognition systems is still far worse than that of humans. This is partly caused by the use of poor statistical models. In a general statistical pattern classification task, the probabilistic models should represent the statistical structure unique to and distinguishing those objects to be classified. In many cases, however, model families are selected without verification of their ability to represent vital discriminative properties. For example, Hidden Markov Models (HMMs) are frequently used in automatic speech recognition systems even though they possess conditional independence properties that might cause inaccuracies when modeling and classifying speech signals. In this work, a new method for automatic speech recognition is developed where the natural statistical properties of speech are used to determine the probabilistic model. Starting from an HMM, new models are created by adding dependencies only if they are not already well captured by the HMM, and only if they increase the
What HMMs can do
, 2002
"... Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabil ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial analyzes HMMs by exploring a novel way in which an HMM can be defined, namely in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no theoretical limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM for ASR, we should rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
Hidden Model Sequence Models for Automatic Speech Recognition
, 2001
"... Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In m ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.
Computations and Evaluations of an Optimal Feature-set for an HMM-based Recognizer
, 1996
"... The benefits of a speech recognition machine would be many, resulting in the improvement of the quality of life for people. The design of a speech recognition system can be divided into two parts, commonly known as the front-end and back-end. The front-end deals with the conversion of the analog sp ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The benefits of a speech recognition machine would be many, resulting in the improvement of the quality of life for people. The design of a speech recognition system can be divided into two parts, commonly known as the front-end and back-end. The front-end deals with the conversion of the analog speech signal into features for classification. This thesis investigates optimal feature-sets for speech recognition. The objectives for an optimal feature-set are improved recognition performance, noise robustness, talker insensitivity and efficiency. Three problems that make it difficult to find optimal features are: 1) the amount of resources (time and computations) required to evaluate the performance of a feature-set, 2) the size of the feature space, and 3) the dependence of features upon some words in t...
Order Analysis Of Combined Features In Speaker Recognition
- ICSP-93 Proceedings
, 1993
"... This paper discusses the performance of several sets of feature combinations in automatic speaker identification (ASI). A signification reduction in dimension, based on linear discriminant analysis (LDA), is obtained without loss of recognition performance. We show that pre-normalization further enh ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper discusses the performance of several sets of feature combinations in automatic speaker identification (ASI). A signification reduction in dimension, based on linear discriminant analysis (LDA), is obtained without loss of recognition performance. We show that pre-normalization further enhances performance. Four different combinations are investigated and the best combination is derived from MFCC, RASTA-PLP and their \Delta's. 1 INTRODUCTION A speaker recognition system typically consists of a feature extraction followed by a statistical pattern classifier. The feature extraction might generate LPC, mel or PLP [1] cepstral features from frames of input speech, each of length typically a few tens of milliseconds and each feature vector typically of size 10 to 14 coefficients. The ASI performance is mainly affected by the feature itself and its order together with the classifier, see for example [2] and [3]. It is common practice to utilise information from different feature...
Explicitly Modelling Undegraded And Degraded Speech In Speaker Recognition
- ICSP-93 Proceedings
, 1993
"... This paper assesses two popular features and their first order dynamic form (\Delta) in terms of their sensitivity to noise mis-match between the test and training conditions. We compare the robustness of individual features and features combined via LDA, using VQ models trained first with homogeneo ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper assesses two popular features and their first order dynamic form (\Delta) in terms of their sensitivity to noise mis-match between the test and training conditions. We compare the robustness of individual features and features combined via LDA, using VQ models trained first with homogeneous data and then with mixed clean and single noise level data, all in the context of speaker identification. It is found that for a single feature RASTA-PLP and its \Delta form give better performance than that of MFCC and \Delta-MFCC under cross conditions, and generally combined features via LDA are better than the individual features in their immunity to noise. We show that single noise level training mixtures give good generalization between clean and the given noise level, and that an LDA feature of MFCC plus RASTA-PLP, with mixed training, can approach the performance of explicitly modeled undegraded and degraded speech. 1 INTRODUCTION Cepstral-based features, which are currently the...
A Gaussian Mixture Model Spectral Representation for Speech Recognition
"... Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies in the vocal tract which form the characteristic shape of the speech spectrum. How-ever, formants are difficult to reliably and robustly estimate from the speech signal and in some cases may not be clearly present. Rather than estimating the resonant frequencies, formant-like features can be used instead. Formant-like features use the characteristics of the spectral peaks to represent the spectrum. In this work, novel features are developed based on estimating a Gaussian mixture model (GMM) from the speech spectrum. This approach has previously been used sucessfully as a speech codec. The EM algorithm is used to estimate the parameters of the GMM. The extracted parameters: the means, standard deviations and component weights can be related to the for-mant locations, bandwidths and magnitudes. As the features directly represent the linear spec-trum, it is possibly to apply techniques for vocal tract length normalisation and additive noise
High Performance Telephone Bandwidth Speaker Independent Continuous Digit Recognition
- In Proceedings Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio
"... The development of a high-performance telephonebandwidth speaker independent connected digit recognizer for Italian is described. The CSLU Speech Toolkit was used to develop and implement the hybrid ANN/HMM system, which is trained on contextdependent categories to account for coarticulatory variati ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The development of a high-performance telephonebandwidth speaker independent connected digit recognizer for Italian is described. The CSLU Speech Toolkit was used to develop and implement the hybrid ANN/HMM system, which is trained on contextdependent categories to account for coarticulatory variation. Various front-end processing and system architecture were compared and, when the best features (MFCC with CMS + ∆) and network (4-layer fully connected feed-forward network) were considered, there was a 98.92 % word recognition accuracy and a 92.62% sentence recognition accuracy) on a test set of the FIELD continuous digits recognition task. 1.

