Results 1 - 10
of
41
Decoding Speech In The Presence Of Other Sources
- SPEECH COMMUNICATION
, 2002
"... Acoustic interference is arguably the most serious problem facing current speech recognisers. The maturation of statistical pattern recognition techniques has brought very low word error rates when both training and test material consist solely of speech. However, in real-world situations, any speec ..."
Abstract
-
Cited by 34 (11 self)
- Add to MetaCart
Acoustic interference is arguably the most serious problem facing current speech recognisers. The maturation of statistical pattern recognition techniques has brought very low word error rates when both training and test material consist solely of speech. However, in real-world situations, any speech signal of interest will be mixed with background noises coming from the full range of sources encountered in our acoustic environment. In this paper,
Decoding Speech In The Presence Of Other Sound Sources
- IN PROC. ICSLP ’00
, 2000
"... Conventional speech recognition is notoriously vulnerable to additive noise, and even the best compensation methods are defeated if the noise is nonstationary. To address this problem, we propose a new integration of bottom-up techniques to identify `coherent fragments' of spectro-temporal energy (b ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
Conventional speech recognition is notoriously vulnerable to additive noise, and even the best compensation methods are defeated if the noise is nonstationary. To address this problem, we propose a new integration of bottom-up techniques to identify `coherent fragments' of spectro-temporal energy (based on local features), with the top-down hypothesis search of conventional speech recognition, extended to search also across possible assignments of each fragment as speech or interference. Initial tests demonstrate the feasibility of this approach, and achieve a reduction in word error rate of more than 25% relative at 5 dB SNR over stationary noise missing data recognition.
Mask estimation for missing data speech recognition based on statistics of binaural interaction
- IEEE Transactions on Speech and Audio Processing
, 2006
"... Abstract—This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with the “missing data ” approach for robust speech recognition in noise. Missing data time–frequency masks are created using probab ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Abstract—This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with the “missing data ” approach for robust speech recognition in noise. Missing data time–frequency masks are created using probability distributions based on estimates of interaural time and level differences (ITD and ILD) for mixed utterances in reverberated conditions; these masks indicate which regions of the spectrum constitute reliable evidence of the target speech signal. A number of experiments compare the relative efficacy of the binaural cues when used individually and in combination. We also investigate the ability of the system to generalize to acoustic conditions not encountered during training. Performance on a continuous digit recognition task using this method is found to be good, even in a particularly challenging environment with three concurrent male talkers. Index Terms—Automatic speech recognition, binaural, computational auditory scene analysis (CASA), interaural level differences (ILD), interaural time differences (ITD), missing data, reverberation. I.
From Missing Data To Maybe Useful Data: Soft Data Modelling For Noise Robust ASR
, 2001
"... Much research has been focused on the problem of achieving automatic speech recognition (ASR) which approaches human recognition performance in its level of robustness to noise and channel distortion. We present here a new approach to data modelling which has the potential to combine complementary e ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Much research has been focused on the problem of achieving automatic speech recognition (ASR) which approaches human recognition performance in its level of robustness to noise and channel distortion. We present here a new approach to data modelling which has the potential to combine complementary existing state-of-theart techniques for speech enhancement and noise adaptation into a single process. In the "missing feature theory" (MFT) based approach to noise robust ASR, misinformative spectral data is detected and then ignored. Recent work has shown that MFT ASR greatly improves when the usual hard decision to exclude data features is softened by a continuous weighting between the likelihood contributions normally used for "good" and "bad" data. The new model presented here can be seen as arising from a generalisation of this "soft missing data" approach, in which the implicit good-bad mixture pdf is modelled explicitly as the data posterior pdf. Initial "soft data" experiments compar...
Enhanced robot speech recognition based on microphone array source separation and missing feature theory
- In Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2005). IEEE
, 2005
"... Abstract — A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not bee ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Abstract — A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel postfilter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42 % (relative). I.
Detection of reliable features for speech recognition in noisy conditions using a statistical criterion
- Proc. CRAC’01
, 2001
"... This paper addresses the problem of integration of missing data theory in the context of robust speech recognition in additive noise. It shows that techniques based on statistical estimation and thresholding of a posteriori signal-to-noise ratio (SNR) can be used for the detection of reliable (not m ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper addresses the problem of integration of missing data theory in the context of robust speech recognition in additive noise. It shows that techniques based on statistical estimation and thresholding of a posteriori signal-to-noise ratio (SNR) can be used for the detection of reliable (not much affected by noise) features as opposed to unreliable or missing (masked by noise) features. In the paper, a statistical detector for reliable features is proposed and tested for several values of deterministic and probabilistic thresholds at very low SNRs (from 20 to-10 dB). The limitations of the detector are also studied and measures for the evaluation of the performance of such a detection are proposed. 1.
On binary and ratio time-frequency masks for robust speech recognition
- In ICSLP, Jeju, Korea
, 2004
"... A time-varying Wiener filter extracts the speech signal from a noisy mixture using the a priori signal-to-noise ratio in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is the ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
A time-varying Wiener filter extracts the speech signal from a noisy mixture using the a priori signal-to-noise ratio in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing data recognizer that operates in the spectral domain using the timefrequency units dominated by speech. For use by the missing data recognizer, the same processor is used to estimate an ideal time-frequency binary mask, which selects the speech signal if it is stronger than the interference in a local time-frequency unit. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is larger. 1.
A Neural Oscillator Sound Separator for Missing Data Speech Recognition
- IN PROC. IJCNN
, 2001
"... In order to recognise speech in a background of other sounds, human listeners must solve two perceptual problems. First, the mixture of sounds reaching the ears must be parsed to recover a description of each acoustic source, a process termed `auditory scene analysis'. Second, recognition of speech ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
In order to recognise speech in a background of other sounds, human listeners must solve two perceptual problems. First, the mixture of sounds reaching the ears must be parsed to recover a description of each acoustic source, a process termed `auditory scene analysis'. Second, recognition of speech must be robust even when the acoustic evidence is missing due to masking by other sounds. This paper describes an automatic speech recognition system that addresses both of these issues, by combining a neural oscillator model of auditory scene analysis with a framework for `missing data' recognition of speech.

