Results 11 - 20
of
21
Computations and Evaluations of an Optimal Feature-set for an HMM-based Recognizer
, 1996
"... The benefits of a speech recognition machine would be many, resulting in the improvement of the quality of life for people. The design of a speech recognition system can be divided into two parts, commonly known as the front-end and back-end. The front-end deals with the conversion of the analog sp ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The benefits of a speech recognition machine would be many, resulting in the improvement of the quality of life for people. The design of a speech recognition system can be divided into two parts, commonly known as the front-end and back-end. The front-end deals with the conversion of the analog speech signal into features for classification. This thesis investigates optimal feature-sets for speech recognition. The objectives for an optimal feature-set are improved recognition performance, noise robustness, talker insensitivity and efficiency. Three problems that make it difficult to find optimal features are: 1) the amount of resources (time and computations) required to evaluate the performance of a feature-set, 2) the size of the feature space, and 3) the dependence of features upon some words in t...
Speaker Identification Using A Polynomial-Based Classifier
- in International Symposium on Signal Processing and its Applications
, 1999
"... A new set of techniques for using polynomial-based classifiers for speaker identification is examined. This set of techniques makes application of polynomial classifiers practical for speaker identification by enabling discriminative training for large data sets. The training technique is shown to b ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
A new set of techniques for using polynomial-based classifiers for speaker identification is examined. This set of techniques makes application of polynomial classifiers practical for speaker identification by enabling discriminative training for large data sets. The training technique is shown to be invariant to fixed liftering and affine transforms of the feature space. Efficient methods for new class addition, lowcomplexity retraining, and identification across large populations are given. The method is illustrated by application to the YOHO database.
Estimation of the Spectral Envelope of Mixed Spectrum Signals Using a Penalized Likelihood Criterion
- IEEE Trans. Speech and Audio Processing, Juin
, 1997
"... Speech modeling techniques used for analysis and synthesis usually rely on a source-filter representation where the source is a mixed spectrum signal, ie. one which consists of both sinusoidal and wide-band noise-like components. In such models, it is of prime importance to estimate a spectral envel ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Speech modeling techniques used for analysis and synthesis usually rely on a source-filter representation where the source is a mixed spectrum signal, ie. one which consists of both sinusoidal and wide-band noise-like components. In such models, it is of prime importance to estimate a spectral envelope which represents the main features of the speech magnitude spectrum. In this paper, we introduce a new performance criterion for spectral envelope fitting which is based on the statistical analysis of the behavior of the empirical sinusoids and noise parameter estimates. We demonstrate the performance of a penalized version of this criterion where the penalization (or regularization) term is designed to control the smoothness of the estimated envelope. This approach can significantly improve the reliability of sinusoidal models when used in applications such as speech coding or enhancement. Index Terms: Spectral estimation, smoothing, mixed spectrum signals, sinusoidal models, speech cod...
A Gaussian Mixture Model Spectral Representation for Speech Recognition
"... Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies in the vocal tract which form the characteristic shape of the speech spectrum. How-ever, formants are difficult to reliably and robustly estimate from the speech signal and in some cases may not be clearly present. Rather than estimating the resonant frequencies, formant-like features can be used instead. Formant-like features use the characteristics of the spectral peaks to represent the spectrum. In this work, novel features are developed based on estimating a Gaussian mixture model (GMM) from the speech spectrum. This approach has previously been used sucessfully as a speech codec. The EM algorithm is used to estimate the parameters of the GMM. The extracted parameters: the means, standard deviations and component weights can be related to the for-mant locations, bandwidths and magnitudes. As the features directly represent the linear spec-trum, it is possibly to apply techniques for vocal tract length normalisation and additive noise
Improved Noise-Robustness in Distributed Speech Recognition via Perceptually-Weighted Vector Quantisation of Filterbank Energies
"... In this paper, we examine a coding scheme for quantising feature vectors in a distributed speech recognition environment that is more robust to noise. It consists of a vector quantiser that operates on the logarithmic filterbank energies (LFBEs). Through the use of a perceptually-weighted Euclidean ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we examine a coding scheme for quantising feature vectors in a distributed speech recognition environment that is more robust to noise. It consists of a vector quantiser that operates on the logarithmic filterbank energies (LFBEs). Through the use of a perceptually-weighted Euclidean distance measure, which emphasises the LFBEs that represent the spectral peaks, the vector quantiser codebook provides aprioriknowledge of the spectral characteristics of clean speech and is used to quantise features from noise-corrupted speech. Our comparative results from the ETSI Aurora-2 recognition task show that the perceptually-weighted vector quantisation of LFBEs achieves higher recognition accuracies for noisy speech than the unweighted vector quantisation, memoryless and multi-frame GMM-based block quantisation and scalar quantisation of Mel frequency-warped cepstral coefficients. 1.
Flooring the Observation Probability for Robust ASR in Impulsive Noise
"... Impulsive noise usually introduces sudden mismatches between the observation features and the acoustic models trained with clean speech, which drastically degrades the performance of automatic speech recognition (ASR) systems. This paper presents a novel method to directly suppress the adverse effec ..."
Abstract
- Add to MetaCart
Impulsive noise usually introduces sudden mismatches between the observation features and the acoustic models trained with clean speech, which drastically degrades the performance of automatic speech recognition (ASR) systems. This paper presents a novel method to directly suppress the adverse effect of impulsive noise on recognition. In this method, according to the noise sensitivity of each feature dimension, the observation vector is divided into several subvectors, each of which is assigned to a suitable flooring threshold. In recognition stage, observation probability of each feature sub-vector is floored at the Gaussian mixture level. Thus, the unreliable relative probability difference caused by impulsive noise is eliminated, and the expected correct state sequence recovers the priority of being chosen in decoding. Experimental evaluations on Aurora2 database show that the proposed method achieves the average error rate reduction (ERR) of 61.62 % and 84.32 % in simulated impulsive noise and machinegun noise environment, respectively, while maintaining high performance for clean speech recognition. 1.
Variations on Statistical Phoneme Recognition -- A Hybrid Approach
, 1997
"... Automatic speech recognition (ASR) is rapidly becoming a mature technology leading to an increasing number of commercial applications. Although great advances have been made in the state of the art of speech recognition over the last 10 years, the holy grail of ASR, namely large vocabulary speaker ..."
Abstract
- Add to MetaCart
Automatic speech recognition (ASR) is rapidly becoming a mature technology leading to an increasing number of commercial applications. Although great advances have been made in the state of the art of speech recognition over the last 10 years, the holy grail of ASR, namely large vocabulary speaker independent continuous speech recognition with an error rate of less than 1%, still eludes researchers. At the heart of most modern speech recognition systems lies a HMM based phoneme recognition engine which segments and classifies the incoming acoustic signal into a sequence of phonemes. These phonemes are concatenated to form word models which are processed further to arrive at a transcription of the linguistic message encoded in the speech signal. The final recognition accuracy of the speech recognition system can thus be directly linked to the recognition accuracy of the underlying phoneme recogniser. Two types of features extracted from the speech signal is commonly used for phoneme recognition. These are the supra-segmental knowledge-based features derived from phonetic and phonologic theory, and the widely used frame-based cepstral features. Up till now, these features have been used separately by researchers, resulting in the loss of valuable discriminative information.
Cepstral Statistics Within Phonetic Subgroups
"... The identification of aspects of cepstral features that contain a high degree of speaker specificity potentially can simplify the task of speaker recognition. The identification process can be performed both temporally and cepstrally. The temporal analysis determines which phonemes or utterances exh ..."
Abstract
- Add to MetaCart
The identification of aspects of cepstral features that contain a high degree of speaker specificity potentially can simplify the task of speaker recognition. The identification process can be performed both temporally and cepstrally. The temporal analysis determines which phonemes or utterances exhibit the highest degree of speaker specificity, while the cepstral analysis examines individual cepstra within these temporal divisions. This paper aims to compliment work that has already been conducted in the temporal domain, by performing analysis upon the individual cepstral coefficients within phonetic subgroups. 1 Introduction In recent work on speaker recognition, it has been found by Eatock [1], van den Heuvel [2] and Bonastre [3] that certain phonemes contain more speaker specific information than others. The difference of the speaker specificity between phonemes varies noticeably between the different subgroups, but remains more constant within the subgroup. Therefore classificati...
to Microphone Variations
, 1995
"... in partial ful llment of the requirements for the degree of ..."

