Results 11 - 20
of
20
Audio-Visual Graphical Models for Speech Processing
"... Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered modules. We propose to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data. The video model features a Gaussian mixture model embedded in a linear subspace of a sprite which translates in the video. The system can learn to detect and enhance speech in noise given only a short (30 second) sequence of audio-visual data. We show some results for speech detection and enhancement, and discuss extensions to the model that are under investigation.
Graphical Models for Robust Speech Recognition in Adverse Environments
, 2008
"... Robust speech recognition in acoustic environments that contain multiple speech sources and/or complex non-stationary noise is a difficult problem, but one of great practical interest. The formalism of probabilistic graphical models constitutes a relatively new and very powerful tool for better unde ..."
Abstract
- Add to MetaCart
Robust speech recognition in acoustic environments that contain multiple speech sources and/or complex non-stationary noise is a difficult problem, but one of great practical interest. The formalism of probabilistic graphical models constitutes a relatively new and very powerful tool for better understanding and extending existing models, learning, and inference algorithms; and a bedrock for the creative, quasi-systematic development of new ones. In this thesis a collection of new graphical models and inference algorithms for robust speech recognition are presented. The problem of speech separation using multiple microphones is first treated. A family of variational algorithms for tractably combining multiple acoustic models of speech with observed sensor likelihoods is presented. The algorithms recover high quality estimates of the speech sources even when there are more sources than microphones, and have improved upon the state-of-the-art in terms of SNR gain by over 10 dB. Next the problem of background compensation in non-stationary acoustic environments is treated. A new dynamic noise adaptation (DNA) algorithm for robust noise compensation is presented, and shown to outperform several existing state-of-the-art front-end denoising systems on the new DNA + Aurora II and Aurora II-M extensions of the Aurora II task. Finally, the problem of speech recognition in speech using a single microphone is treated. The Iroquois system for multi-talker speech separation and recognition is presented. The system won the 2006 Pascal International Speech Separation Challenge, and amazingly, achieved super-human recognition performance on a majority of test cases in the task. The result marks a significant first in automatic speech recognition, and a milestone in computing.
1 Audiovisual Speech Recognition: Introduction and an Approach to Multimodal Fusion with Uncertain Features Research Area: Artificial Intelligence and Human-Computer Interaction
"... ..."
(Show Context)
Multimodal Fusion and Learning with Uncertain Features Applied to Audiovisual Speech Recognition
"... Abstract — We study the effect of uncertain feature measurements and show how classification and learning rules should be adjusted to compensate for it. Our approach is particularly fruitful in multimodal fusion scenarios, such as audio-visual speech recognition, where multiple streams of complement ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — We study the effect of uncertain feature measurements and show how classification and learning rules should be adjusted to compensate for it. Our approach is particularly fruitful in multimodal fusion scenarios, such as audio-visual speech recognition, where multiple streams of complementary features whose reliability is time-varying are integrated. For such applications, by taking the measurement noise uncertainty of each feature stream into account, the proposed framework leads to highly adaptive multimodal fusion rules for classification and learning which are widely applicable and easy to implement. We further show that previous multimodal fusion methods relying on stream weights fall under our scheme under certain assumptions; this provides novel insights into their applicability for various tasks and suggests new practical ways for estimating the stream weights adaptively. The potential of our approach is demonstrated in audio-visual speech recognition experiments. I.
Speech Enhancement Using Gaussian Scale Mixture Models
"... Abstract—This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture mod ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress. Index Terms—Gaussian scale mixture model (GSMM), Laplace method, speech enhancement, variational approximation. I.
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition
, 2004
"... We present a probabilistic framework that uses a bone sensor and air microphone to perform speech enhancement for robust speech recognition. The system exploits advantages of both sensors: the noise resistance of the bone sensor, and the linearity of the air microphone. In this paper we describe the ..."
Abstract
- Add to MetaCart
(Show Context)
We present a probabilistic framework that uses a bone sensor and air microphone to perform speech enhancement for robust speech recognition. The system exploits advantages of both sensors: the noise resistance of the bone sensor, and the linearity of the air microphone. In this paper we describe the general properties of the bone sensor relative to conventional air sensors. We propose a model capable of adapting to the noise conditions, and evaluate performance using a commercial speech recognition system. We demonstrate considerable improvements in recognition – from a baseline of 57 % up to nearly 80 % word accuracy – for four subjects on a difficult condition with background speaker interference. 1.
Robust Speech Separation Using Time-Frequency Masking
- Proceedings of the 2003 IEEE Conference on Multimedia and Expo (ICME 2003
, 2003
"... A multi-microphone time-frequency speech masking technique is proposed. This technique utilizes both the timefrequency magnitude and phase information in order to estimate the Signal-to-Noise Ratio (SNR) maximizing masking coefficients for each time-frequency block given that the direction (or alter ..."
Abstract
- Add to MetaCart
A multi-microphone time-frequency speech masking technique is proposed. This technique utilizes both the timefrequency magnitude and phase information in order to estimate the Signal-to-Noise Ratio (SNR) maximizing masking coefficients for each time-frequency block given that the direction (or alternatively, the time-delay of arrival) of the speaker of interest is known. Using this masking algorithm, speech features (such as formants) from the direction of interest are preserved while features from other directions are severely degraded. Digit recognition experiments indicate that the proposed technique can result in a substantial increase in the digit recognition accuracy rate. At 0dB, for example, the proposed technique results in a digit recognition accuracy rate improvement of 26% over the single microphone case and an improvement of 12% over the two microphone superdirective beamforming case.
Glossary
, 2004
"... This report presents uncertainty decoding as a method for robust automatic speech recognition for the Noise Robust Automatic Speech Recognition project funded by Toshiba Research Europe Limited. The effects of noise on speech recognition are reviewed and a general framework for noise robust speech r ..."
Abstract
- Add to MetaCart
(Show Context)
This report presents uncertainty decoding as a method for robust automatic speech recognition for the Noise Robust Automatic Speech Recognition project funded by Toshiba Research Europe Limited. The effects of noise on speech recognition are reviewed and a general framework for noise robust speech recognition introduced. Common and related noise robustness techniques are described in the context of this framework. Uncertainty decoding is also presented in this framework with the goal of providing fast noise compensation through the propagation of uncertainty to the de-coder. Two forms are discussed, the Joint and SPLICE methods, and evaluated on the medium vocabulary Resource Management corpus at a range of artificially produced noise levels. It was found that the uncertainty decoding algorithms did not meet the performance of a matched system, but were more accurate than the baseline SPLICE enhancement technique and low numbers of CMLLR transforms.
VARIATIONAL BAYESIAN LEARNING OF SPEECH GMMS FOR FEATURE ENHANCEMENT BASED ON ALGONQUIN
"... Many feature enhancement methods make use of probabilistic models of speech and noise in order to improve performance of speech recognizers in the presence of background noise. The traditional approach for training such models is maximum likelihood estimation. This paper investigates the novel appli ..."
Abstract
- Add to MetaCart
(Show Context)
Many feature enhancement methods make use of probabilistic models of speech and noise in order to improve performance of speech recognizers in the presence of background noise. The traditional approach for training such models is maximum likelihood estimation. This paper investigates the novel application of variational Bayesian learning for front-end models under the Algonquin denoising framework. Compared to maximum likelihood training, it is shown that variational Bayesian learning has advantages both in terms of increased robustness with respect to choice of model complexity, as well as increased performance.
New Strategies for Single-channel Speech Separation
, 2010
"... Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.? Users may download an ..."
Abstract
- Add to MetaCart
(Show Context)
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.? Users may download and print one copy of any publication from the public portal for the purpose of private study or research.? You may not further distribute the material or use it for any profit-making activity or commercial gain? You may freely distribute the URL identifying the publication in the public portal? Take down policy If you believe that this document breaches copyright please contact us at vbn@aub.aau.dk providing details, and we will remove access to the work immediately and investigate your claim.