Results 1 - 10
of
11
Signal modeling techniques in speech recognition
- PROCEEDINGS OF THE IEEE
, 1993
"... We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to norm ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to normalize and decor-relate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal’s spectrum can be estimated in a closed-loop manner. In this paper, we review the signal processing components of these algorithms. These al-gorithms are presented as part of a unified view of the signal parameterization problem in which there are three major tasks: measurement, transformation, and statistical modeling. This paper is by no means a comprehensive survey of all possible techniques of signal modeling in speech recognition. There are far too many algorithms in use today to make an exhaustive survey feasible (and cohesive). Instead, this paper is meant to serve as a tutorial on signal processing in state-of-the-art speech recognition systems and to review those techniques most commonly used. In keeping with this goal, a complete mathematical description of each algorithm has been included in the paper.
On the decorrelation of filter-bank energies in speech recognition
- Proc. Eurospeech
, 1995
"... Cepstral coefficients are widely used in speech recognition. In this paper, we claim that they are not the best way of representing the spectral envelope, at least for some usual speech recognition systems. In fact, cepstrum has several disadvantages: poor physical meaning, need of transformation, a ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
Cepstral coefficients are widely used in speech recognition. In this paper, we claim that they are not the best way of representing the spectral envelope, at least for some usual speech recognition systems. In fact, cepstrum has several disadvantages: poor physical meaning, need of transformation, and low capacity of adaptation to some recognition systems. In this paper, we propose a new representation that significantly outperforms both mel-cepstrum and LPC-cepstrum techniques in both recognition rate and computational cost. It consists of filtering the frequency sequence of filter-bank energies with an extremely simple filter that equalizes the variance of the cepstral coefficients. Excellent results of the new technique using a continuous observation density HMM recognition system and two very different recognition tasks, connected digits and phone recognition, are presented. 1.
Spectral Signal Processing for ASR
- Proc. ASRU’99
, 1999
"... The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting of melscale cepstrum coefficients and their temporal derivatives, is described. Some variations and extensions of the standard analysis --- PLP, cepstrum correlation methods, LDA, and variants on log power --- are then discussed. These techniques pass the test of having been found useful at multiple sites, especially with noisy speech. The extent to which auditory properties can account for the advantage found for particular techniques is considered. It is concluded that the advantages do not in fact stem from auditory properties, and that there is so far little or no evidence that the study of the human auditory system has contributed to advances in automatic speech recognition. Contributio...
Status Report Of The Finnish Phonetic Typewriter Project
- In Artificial Neural Networks
, 1991
"... In connection to a speech recognizer, the aim of which is to produce phonemic transcriptions of arbitrary spoken utterances, we investigate the combined effect of several improvements at different stages of phoneme recognition. The core of the basic recognition system is Learning Vector Quantization ..."
Abstract
-
Cited by 11 (10 self)
- Add to MetaCart
In connection to a speech recognizer, the aim of which is to produce phonemic transcriptions of arbitrary spoken utterances, we investigate the combined effect of several improvements at different stages of phoneme recognition. The core of the basic recognition system is Learning Vector Quantization (LVQ1) [1]. This algorithm was originally used to classify FFT-based short-time feature vectors into phonemic classes. The phonemic decoding stage was earlier based on simple durational rules [2] [3]. At the feature level, we now study the effect of using mel-scale cepstral features and concatenating consecutive feature vectors to include context. At the output of vector quantization, a comparison of three approaches to take into account the classifications of feature vectors in local context is presented. The rule-based phonemic decoding is compared to decoding employing Hidden Markov Models (HMMs). As earlier, an optional grammatical post-correction method (DEC) is applied. Experiments co...
Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation
, 1995
"... This paper evaluates continuous density hidden Markov models (CDHMM), dynamic time warping (DTW) and distortionbased vector quantisation (VQ) for speaker recognition, emphasising the performance of each model structure across incremental amounts of training data. Text-independent (TI) experiments ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper evaluates continuous density hidden Markov models (CDHMM), dynamic time warping (DTW) and distortionbased vector quantisation (VQ) for speaker recognition, emphasising the performance of each model structure across incremental amounts of training data. Text-independent (TI) experiments are performed with VQ and CDHMMs, and text-dependent (TD) experiments are performed with DTW, VQ and CDHMMs. We show for TI speaker recognition, VQ performs better than an equivalent CDHMM with one training version, but is outperformed by CDHMM when trained with ten training versions. For TD experiments we show that DTW outperforms VQ and CDHMMs for sparse amounts of training data, but with more data, the performance of each model is indistinguishable. The performance of the TD procedures is consistently superior to TI, which is attributed to subdividing the speaker recognition problem into smaller speaker-word problems. We also show a large variation in performance across the differen...
Filtering the Time Sequences of Spectral Parameters for Speech Recognition
, 1997
"... In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions or to compute differential parameters dynamic features which enhance discrimination. In this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that, when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, bot...
Cepstrum Derived From Differentiated Power Spectrum for Robust Speech Recognition
, 2003
"... Inthi paper, cepstral featuresderire from thedifi#BMxRT# power spectrum (DPS) are proposed forirRW#VVx the robustness of a speechrecogniE# i presence of backgroundnoikg These robust features are computed from the speech sieec of agiWB frame through thefollowij four steps. Fips. the short-tiT power s ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Inthi paper, cepstral featuresderire from thedifi#BMxRT# power spectrum (DPS) are proposed forirRW#VVx the robustness of a speechrecogniE# i presence of backgroundnoikg These robust features are computed from the speech sieec of agiWB frame through thefollowij four steps. Fips. the short-tiT power spectrum of speech siechi computed from the speech siech through the fastFouri# transformalgorimR# Second, DPSi obtaij# bydiWEq entiEqxx the power spectrumwic respect to frequency.Thiqu themagni#EE of DPSi projected fromlimRW frequency to the mel scale and smoothed by a filter bank.Fik.Rfi# the outputs of the filter bank are transformed to cepstral coe#cilRW by thedijqVBE cosiV transform after a nonliERT transformatiRi It i shown that thi new feature set can be decomposed as thesuperposifixW of the standard cepstrum and id nonliEjRTfi linliE counterpart.Whin aliMfij liMfij has no e#ect on the conti#VRT densi hisi Markov model based speechrecogniBVVR we show that the proposed feature set embeddedwid a nonlijRT lilijRT transformatiR i quin e#ectio for robust speechrecognifiVBR ForthiW we conduct a number of speechrecogniB#x experiB#x (ieriB# iieri wordrecogniERTfi connecteddinec recogniRWVx and large vocabulary contilary speechrecogniB#fiR i varign operati enviiB#fiR and compare the DPS featureswit the standard mel-frequency cepstral coe#cilR features used wid cepstral meannormaliMRTfij and spectral subtractij techniijq # 2003Elseviq B.V. AllriRjW reserved. Keywords: Robust speech recogniWV n;HijBq Markov model; Diel;R tie power spectrum;Lictr litrum;R# Cepstral mean normaliBfi ior Spectralsubtractix 1.IR oduction Speech siech carri esiRjW matix from many sources. But not allilRj mati ni s relevant or ifiMEEqRT for speechrecogni tiog In speech recogniWVxE the firstcruci l stepi...
A Segment-Based Speaker Verification System Using SUMMIT
, 1997
"... This thesis describes the development of a segment-based speaker verification system. Our investigation is motivated by past observations that speaker-specific cues may manifest themselves differently depending on the manner of articulation of the phonemes. By treating the speech signal as a concate ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This thesis describes the development of a segment-based speaker verification system. Our investigation is motivated by past observations that speaker-specific cues may manifest themselves differently depending on the manner of articulation of the phonemes. By treating the speech signal as a concatenation of phone-sized units, one may be able to capitalize on measurements for such units more readily. A potential side benefit of such an approach is that one may be able to achieve good performance with unit (i.e., phonetic inventory) and feature sizes that are smaller than what would normally be required for a frame-based system, thus deriving the benefit of reduced computation. To carry out our investigation, we started with the segment-based speech recognition system developed in our group called SUMMIT [44], and modified it to suit our needs. The speech signal was first transformed into a hierarchical segment network using frame-based measurements. Next, acoustic models for each speak...
Discriminative Feature Extraction For Speech Recognition
- Proc. IEEE NN-SP Workshop
, 1993
"... Pattern recognition consists of feature extraction and classification over the extracted features. Usually, these two processes are designed separately, entailing that a resulting recognizer is not necessarily optimal in terms of classification accuracy. To overcome this gap in recognizer design, we ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Pattern recognition consists of feature extraction and classification over the extracted features. Usually, these two processes are designed separately, entailing that a resulting recognizer is not necessarily optimal in terms of classification accuracy. To overcome this gap in recognizer design, we introduce in this paper a new design concept, named Discriminative Feature Extraction (DFE). DFE is based on a recent discriminative learning theory, Minimum Classification Error formalization /Generalized Probabilistic Descent method, and provides an innovative way to design the entire process of recognition. A front-end feature extractor as well as a post-end classifier is consistently optimized under a single criterion of minimizing classification errors. The concept is quite general and can be applied to a wide range of pattern recognition tasks. This paper is devoted to the application of DFE to speech recognition. Experiments on a Japanese vowel recognition task show the advantages of...
Using SOMs As Feature Extractors For Speech Recognition
- In International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE
, 1992
"... In this paper we demonstrate that the Self-Organizing Maps of Kohonen can be used as speech feature extractors that are able to take temporal context into account. We have investigated two alternatives to use SOMs as such feature extractors, one based on tracing the location of highest activity on a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we demonstrate that the Self-Organizing Maps of Kohonen can be used as speech feature extractors that are able to take temporal context into account. We have investigated two alternatives to use SOMs as such feature extractors, one based on tracing the location of highest activity on a SOM, the other on integrating the activity of the whole SOM for a period of time. The experiments indicated that an improvement is achievable by using these methods. 1. INTRODUCTION The Self-Organizing Map (SOM) algorithm of Kohonen [4],[6] is one of the best known artificial neural network algorithms. SOMs have the ability to construct topology-preserving mappings of the kind that are expected to happen also in the mammalian cortex. SOMs have been used in automatic speech recognition tasks most commonly to derive a vector quantizer, or to do initial clustering to construct a static pattern classifier by the Learning Vector Quantization (LVQ) algorithm [6]. In reference [9] a system is pr...

