Results 1 - 10
of
13
Should recognizers have ears
- Speech Communication
, 1998
"... The paper discusses author’s experience with applying auditory knowledge to automatic recognition of speech. It indirectly argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task. It advances the notion that the reason for applying kno ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
The paper discusses author’s experience with applying auditory knowledge to automatic recognition of speech. It indirectly argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task. It advances the notion that the reason for applying knowledge of human auditory perception in engineering applications should be the ability of perception to suppress some parts of information in the speech message. Three properties of human speech perception: limited spectral resolution, use of information from about syllable-length segments ability to alleviate unreliable cues, are discussed in some detail. Overall, we are advocating selective use of auditory knowledge,optimized on real speechdata. Fig. I A good hard working man. Fig. II A foolish man?
Exploring Temporal Domain for Robustness in Speech Recognition
- Proc. of 15th International Congress on Acoustics
, 1995
"... The paper reviews several techniques which are used in conjunction with the short-term analysis and which are reported to be more robust in presence of noise or other non-linguistic factors. We show that one property common to all such techniques is that they are effectively extracting speech featur ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The paper reviews several techniques which are used in conjunction with the short-term analysis and which are reported to be more robust in presence of noise or other non-linguistic factors. We show that one property common to all such techniques is that they are effectively extracting speech features from segments of speech longer than 10-20 ms. II. Introduction to the Problem The communication channel and its noise level remains most often fixed or varies only rather slowly during the conversation. On the other hand, steady configurations of vocal tract are rare and carry only a little of linguistic information. The description of speech signal as a succession of equally spaced short-term samples originated in speech coding. It assumes that short-term (about 10 - 20 ms) segments of speech are independent samples from different and unrelated stationary processes. Fundamental linguistic unit is likely to be longer than 10 ms and one frame of short-term analysis result provides descri...
Filtering the Time Sequences of Spectral Parameters for Speech Recognition
, 1997
"... In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions or to compute differential parameters dynamic features which enhance discrimination. In this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that, when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, bot...
Model Transformation For Robust Speaker Recognition From Telephone Data
- in ICASSP-97
, 1997
"... In the context of automatic speaker recognition, we propose a model transformation technique that renders speaker models more robust to acoustic mismatches and to data scarcity by appropriately increasing their variances. We use a stereo database containing speech recorded simultaneously under diffe ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In the context of automatic speaker recognition, we propose a model transformation technique that renders speaker models more robust to acoustic mismatches and to data scarcity by appropriately increasing their variances. We use a stereo database containing speech recorded simultaneously under different acoustic conditions to derive a synthetic variance distribution. This distribution is then used to modify the variances of other speaker models from other telephone databases. The technique is illustrated with experiments conducted on a locally collected database and on the NIST'95 and '96 subsets of the Switchboard Corpus. 1. INTRODUCTION Many applications of speaker identification systems (speaker-ID for short) assume that the users access the system remotely. Typically, the channel involved in the communication is that of the telephone. Because the handset and the line can vary from call to call, there is often an acoustic mismatch between the data collected to train the speaker mo...
Auditory Modeling In Automatic Recognition Of Speech
, 1997
"... The paper argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task and advance the notion that the reason for applying knowledge of human auditory perception in engineering applications should be the ability of perception to suppress so ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The paper argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task and advance the notion that the reason for applying knowledge of human auditory perception in engineering applications should be the ability of perception to suppress some parts of information in the speech message. In general, it advocates selective use of auditory knowledge, optimized on real speech data. 1 Ignorance and Knowledge in Handling The Nonlinguistic Variability in ASR With an advent of powerful stochastic classification techniques, many believe that the task of alleviating the irrelevant variability should be left on the classifier. In the training stage the recognizer would be presented with all the information that is available (i.e., both "signal" and "noise"). Such exhaustive training should allow (during recognition) for a separation of the desired signal from the noise. However, for this to be true, the classifier would have to be a mode...
Model-based scene analysis
- Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, chapter 4
, 2006
"... When multiple sound sources are mixed together into a single channel (or a small number of channels) it is in general impossible to recover the exact waveforms that were mixed – indeed, without some kind of constraints on the form of the component signals it is impossible to separate them at all. Th ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
When multiple sound sources are mixed together into a single channel (or a small number of channels) it is in general impossible to recover the exact waveforms that were mixed – indeed, without some kind of constraints on the form of the component signals it is impossible to separate them at all. These constraints could take several
On the Effects of Short-Term Spectrum Smoothing in Channel Normalization
"... We present a simple analysis showing that channel normalization techniques are less effective when applied to spectral energies obtained by (weighted) summation of components of the shorttime Fourier power spectrum of speech. We show that applying channel normalization processing prior to critica ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
We present a simple analysis showing that channel normalization techniques are less effective when applied to spectral energies obtained by (weighted) summation of components of the shorttime Fourier power spectrum of speech. We show that applying channel normalization processing prior to critical band integration or linear predictive all-pole modeling improves the effectiveness of the techniques.
Noise Suppression And Loudness Normalization In An Auditory Model-Based Acoustic Front-End
, 1996
"... It is commonly acknowledged that the presence of additive and convolutional noise and speech level variations can seriously deteriorate the performance of a speech recognizer. In case an auditory model is used as the acoustic front-end, it turns out that compensation techniques such as spectral subt ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
It is commonly acknowledged that the presence of additive and convolutional noise and speech level variations can seriously deteriorate the performance of a speech recognizer. In case an auditory model is used as the acoustic front-end, it turns out that compensation techniques such as spectral subtraction and log-spectral mean subtraction can be outperformed by time-domain techniques operating on the band-pass filtered signals which are supplied to the haircell models. In [1] we showed that additive noise could be removed effectively by means of center clippers put in front of the haircell models. This technique, which was called linear noise magnitude subtraction (NMS), is further improved in this paper. The nonlinear NMS proposed here outperforms the linear one, especially for low Signal-to-Noise Ratios. To compensate for speech level variations and convolutional noise, we have adopted the same filosophy: remove the effects before the signal is supplied to the haircell models. This is accomplished by introducing normalization gains in front of the haircell models. It is shown that this loudness mean normalization (LMN) technique when used in combination whith NMS offers a highly robust speech representation.
A Unified Spectral Transformation Adaptation Approach for Robust Speech Recognition
"... In this paper, Canonical Correlation Based Compensation(CCBC) is proposed as an unified approach to cope with the mismatch between training and test set. The mismatch between training and test conditions can be simply clustered into three classes: differences of speakers, changes of recording channe ..."
Abstract
- Add to MetaCart
In this paper, Canonical Correlation Based Compensation(CCBC) is proposed as an unified approach to cope with the mismatch between training and test set. The mismatch between training and test conditions can be simply clustered into three classes: differences of speakers, changes of recording channel and effects of noisy environment. In previous work, we had used CCBC approach with some modifications to make our speech recognizer robust to the noisy environment successfully[1]. Recently, the same approach has been extended for speaker and channel adaptation. The results of our experiments show that CCBC approach well compensated all three kinds of distortion source between training and test conditions. In order to compare the performance of CCBC with that of some conventional adaptation approaches, the capacities of the techniques of cepstral mean normalization, RASTA and Lin-Log RASTA are tested. We find that CCBC has better performance than them. As an very important problem in CCBC approach, the selection of appropriate reference speech data is also discussed in this paper.
[dsp EDUCATION] Research Developments and Directions in Speech Recognition and Understanding, Part 1
"... To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-conside ..."
Abstract
- Add to MetaCart
To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding. ASR has been an area of great interest and activity to the signal processing and HLT communities over the past several decades. As a first step, this group reviewed major developments in the field and the circumstances that led to their success and then focused on areas it deemed especially fertile for future research. Part 1 of this article will focus on historically significant developments in the ASR area, including several major research efforts that were guided by different funding agencies, and suggest general areas in which to focus research. Part 2 (to appear in the next issue) will explore in more detail several new avenues holding promise for substantial improvements in ASR performance. These entail cross-disciplinary research and specific approaches to address three-to-five-year grand challenges aimed at stimulating advanced research by dealing with realistic tasks of broad interest.

