Results 1 - 10
of
10
Signal modeling techniques in speech recognition
- PROCEEDINGS OF THE IEEE
, 1993
"... We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to norm ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to normalize and decor-relate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal’s spectrum can be estimated in a closed-loop manner. In this paper, we review the signal processing components of these algorithms. These al-gorithms are presented as part of a unified view of the signal parameterization problem in which there are three major tasks: measurement, transformation, and statistical modeling. This paper is by no means a comprehensive survey of all possible techniques of signal modeling in speech recognition. There are far too many algorithms in use today to make an exhaustive survey feasible (and cohesive). Instead, this paper is meant to serve as a tutorial on signal processing in state-of-the-art speech recognition systems and to review those techniques most commonly used. In keeping with this goal, a complete mathematical description of each algorithm has been included in the paper.
Extraction of Visual Features for Lipreading
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2002
"... The multi-modal nature of speech is often ignored in human-computer interaction but lip deformation, and other body such as head and arm motion all convey additional infor-mation. We integrate speech cues from many sources and this improves intelligibility, es-pecially when the acoustic signal is de ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
The multi-modal nature of speech is often ignored in human-computer interaction but lip deformation, and other body such as head and arm motion all convey additional infor-mation. We integrate speech cues from many sources and this improves intelligibility, es-pecially when the acoustic signal is degraded. This paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three meth-ods for parameterising lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape, or shape and appearance respectively. The third, bottom-up, method uses a non-linear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multi-talker visual speech recognition task of isolated letters.
Robust Text-Independent Speaker Identification over Telephone Channels
- IEEE Trans. on Speech and Audio Processing
, 1997
"... This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against channel variations, and transforming the speaker models to compensate for channel effects. First, an experimental study shows that optimizing the front end processing of the speech signal can significantly improve speaker recognition performance. A new filterbank design is introduced to improve the robustness of the speech spectrum computation in the front-end unit. Next, a new feature based on spectral slopes is described. Its ability to discriminate between speakers is shown to be superior to that of the traditional cepstrum. This feature can be used alone or combined with the cepstrum. The second part of the paper presents two model transformation methods that further reduce channel effe...
Robust Feature-Estimation and Objective Quality Assessment for Noisy Speech Recognition using the Credit Card Corpus
, 1994
"... It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously f ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously formulated for speech enhancement, is considered and shown to produce improved feature characterization in a variety of actual noise conditions such as computer fan, large crowd, and voice communications channel noise. In addition, an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage. The four measures considered include (i) NIST SNR, (ii) Itakura-Saito log-likelihood, (iii) log-area-ratio, and (iv) the weighted-spectral slope measure. A continuous distribution, monophone based, hidden Markov model recognition algorithm is used for objective measure based MAP estimator analysis and reco...
Acoustic Backing-Off As An Implementation Of Missing Feature Theory
- Speech Communication
, 1999
"... Acoustic backing-off was recently proposed as an operationalisation of missing feature theory for increased recognition robustness. Acoustic backing-off effectively removes the detrimental influence of outlier values from the local decisions in the Viterbi algorithm without any kind of explicit outl ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Acoustic backing-off was recently proposed as an operationalisation of missing feature theory for increased recognition robustness. Acoustic backing-off effectively removes the detrimental influence of outlier values from the local decisions in the Viterbi algorithm without any kind of explicit outlier detection. In the context of connected digit recognition over telephone lines, it is shown that with more than 30% of the static mel-frequency cepstral coefficients disturbed, acoustic backing-off is capable of reducing the word error rate by one order of magnitude. Furthermore, our results indicate that the effectiveness of acoustic backing-off is optimal when dispersion of distortions due to acoustic feature transformations is minimal. 1. INTRODUCTION Recently, it was shown that missing feature theory can be used for improved robustness of automatic speech recognition (ASR) systems [1], [2]. According to missing feature theory, recognition performance in adverse conditions can be mai...
Acoustic Features and Distance Measure to Reduce Vulnerability of ASR Performance Due to the Presence of a Communication Channel and/or Background Noise
, 2001
"... this paper we will only discuss results for off-line experiments ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
this paper we will only discuss results for off-line experiments
ICARUS: Source Generator Based Real-time Recognition of Speech in Noisy Stressful and Lombard Effect Environments
, 1995
"... The problem of real-time automatic speech recognition in an adverse environment is addressed in this paper. Though much research has been performed in the area of speech recognition, only limited success has been demonstrated for real-time recognition in noisy stressful environments. The primary rea ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The problem of real-time automatic speech recognition in an adverse environment is addressed in this paper. Though much research has been performed in the area of speech recognition, only limited success has been demonstrated for real-time recognition in noisy stressful environments. The primary reason for this is that the performance of present day recognition algorithms are predicated on the assumptions of the environmental settings in which the algorithms have been formulated and implemented. In this paper, we discuss the effects of additive background noise on speech quality and recognition parameters, and propose a source generator based framework to address stress and noise. Using this framework, a computationally efficient real-time recognition system called ICARUS is developed. The speech recognition system incorporates direct processing steps to address the effects of additive noise on the speech signal and stress on the speech production system. Central issues which are addre...
A Probabilistic Method for Tracking a Vocalist
, 1998
"... When a musician gives a recital or concert, the music performed generally includes accompaniment. To render a good performance, the soloist and the accompanist must know the musical score and must follow the other musician's performance. Both performing and rehearsing are limited by constraints on t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
When a musician gives a recital or concert, the music performed generally includes accompaniment. To render a good performance, the soloist and the accompanist must know the musical score and must follow the other musician's performance. Both performing and rehearsing are limited by constraints on the time and money available for bringing musicians together. Computer systems that automatically provide musical accompaniment offer an inexpensive, readily available alternative. Effective computer accompaniment requires software that can listen to live performers and follow along in a musical score. This work presents an implemented system and method for automatically accompanying a singer given a musical score. Specifically, I offer a method for robust, real-time detection of a singer's score position and tempo. Robust score following requires combining information obtained both from analyzing a complex signal (the singer's performance) and from processing symbolic notation (the score). Unfortunately, the mapping from the available information to score position does not define a function. Consequently, this work investigated a statistical characterization of a singer's score position and a model that combines the available musical information to produce a probabilistic position estimate. By making
RECNET - the speech recognition system
, 1996
"... words @INIT and @QUIT are desired by the parser. Dictionary should be lexicographically sorted. ..."
Abstract
- Add to MetaCart
words @INIT and @QUIT are desired by the parser. Dictionary should be lexicographically sorted.
MAP-Based Perceptual Modeling for Noisy Speech Recognition
"... This study presents a maximum a posteriori (MAP) based perceptual modeling approach to deal with the issue of recognition degradation in noisy environment. In this approach, MAP-based noise detection is first applied to identify the noise segment in an utterance. Subtractive-type enhancement algorit ..."
Abstract
- Add to MetaCart
This study presents a maximum a posteriori (MAP) based perceptual modeling approach to deal with the issue of recognition degradation in noisy environment. In this approach, MAP-based noise detection is first applied to identify the noise segment in an utterance. Subtractive-type enhancement algorithm with masking properties of the human auditory system is then used to reduce the noise effect. Finally, MAP-based incremental noise model adaptation is developed to overcome the model inconsistencies between training and testing environments. For performance evaluation of the proposed approach, a Mandarin keyword recognition system was constructed. The experimental results show that the proposed approach achieves a better recognition rate compared to the audible noise suppression (ANS) and parallel model combination (PMC) methods.

