Results 1 - 10
of
11
Single microphone source separation using high resolution signal reconstruction
- in Proc. ICASSP ’04
, 2004
"... We present a method for separating two speakers from a single microphone channel. The method exploits the fine structure of male and female speech and relies on a strong high frequency resolution model for the source signals. The algorithm is able to identify the correct combination of male and fema ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We present a method for separating two speakers from a single microphone channel. The method exploits the fine structure of male and female speech and relies on a strong high frequency resolution model for the source signals. The algorithm is able to identify the correct combination of male and female speech that best explains an observation and is able to reconstruct the component signals, relying on prior knowledge to ‘fill in ’ regions that are masked by the other speaker. The two speaker single microphone source separation problem is one of the most challenging source separation scenarios and few quantitative results have been reported in the literature. We provide a test set based on the Aurora 2 data set and report performance numbers on a portion of this set. We achieve results of 6.59 dB average increase in SNR for female speakers and 5.51 dB for male speakers. 1.
Model-based scene analysis
- In
, 2006
"... Abstract The general problem of separating multiple sources mixed into a single channel is illposed; in order to solve it, additional constraints must be applied. One very general class of constraints is prior knowledge of the source signals: limitations on the waveforms possible for each source ca ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Abstract The general problem of separating multiple sources mixed into a single channel is illposed; in order to solve it, additional constraints must be applied. One very general class of constraints is prior knowledge of the source signals: limitations on the waveforms possible for each source can disambiguate the mixture since there may only be one set of 'legal' source signals that would add together to give the observed mixture. A particularly attractive aspect of this approach is that models themselves may be learned from the environment by generalizing across observations of isolated sources. This chapter examines the general idea of using models in source separation including looking at what form these models can take and how they can be acquired, and describes examples of several systems which can be described within this framework.
Speech Enhancement, Gain, and Noise Spectrum Adaptation Using Approximate Bayesian Estimation
"... Abstract—This paper presents a new approximate Bayesian estimator for enhancing a noisy speech signal. The speech model is assumed to be a Gaussian mixture model (GMM) in the log-spectral domain. This is in contrast to most current models in frequency domain. Exact signal estimation is a computation ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract—This paper presents a new approximate Bayesian estimator for enhancing a noisy speech signal. The speech model is assumed to be a Gaussian mixture model (GMM) in the log-spectral domain. This is in contrast to most current models in frequency domain. Exact signal estimation is a computationally intractable problem. We derive three approximations to enhance the efficiency of signal estimation. The Gaussian approximation transforms the log-spectral domain GMM into the frequency domain using minimal Kullback–Leiber (KL)-divergency criterion. The frequency domain Laplace method computes the maximum a posteriori (MAP) estimator for the spectral amplitude. Correspondingly, the log-spectral domain Laplace method computes the MAP estimator for the log-spectral amplitude. Further, the gain and noise spectrum adaptation are implemented using the expectation–maximization (EM) algorithm within the GMM under Gaussian approximation. The proposed algorithms are evaluated by applying them to enhance the speeches corrupted by the speech-shaped noise (SSN). The experimental results demonstrate that the proposed algorithms offer improved signal-to-noise ratio, lower word recognition error rate, and less spectral distortion.
Audio-Visual Graphical Models for Speech Processing
"... Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered modules. We propose to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data. The video model features a Gaussian mixture model embedded in a linear subspace of a sprite which translates in the video. The system can learn to detect and enhance speech in noise given only a short (30 second) sequence of audio-visual data. We show some results for speech detection and enhancement, and discuss extensions to the model that are under investigation.
Non-negative source-filter dynamical system for speech enhancement
, 2014
"... Model-based speech enhancement methods, which rely on separately modeling the speech and the noise, have been shown to be powerful in many different problem settings. When the structure of the noise can be arbitrary, which is often the case in practice, model- based methods have to focus on developi ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Model-based speech enhancement methods, which rely on separately modeling the speech and the noise, have been shown to be powerful in many different problem settings. When the structure of the noise can be arbitrary, which is often the case in practice, model- based methods have to focus on developing good speech models, whose quality will be key to their performance. In this study, we propose a novel probabilistic model for speech enhancement which precisely models the speech by taking into account the underlying speech production process as well as its dynamics. The proposed model follows a source-filter approach where the excitation and filter parts are modeled as non-negative dynamical systems. We present convergence-guaranteed update rules for each latent factor. In order to assess performance, we evaluate our model on a challenging speech enhancement task where the speech is observed under non-stationary noises recorded in a car. We show that our model outperforms state-of-the-art methods in terms of objective measures.
Speech Enhancement by Indirect VTS
, 2012
"... Model-based speech enhancement methods, such as vector-Taylor series-based methods (VTS), share a common methodology: they estimate speech using the expected value of the clean speech given the noisy speech under a statistical model. We show that it may be better to use the expected value of the noi ..."
Abstract
- Add to MetaCart
(Show Context)
Model-based speech enhancement methods, such as vector-Taylor series-based methods (VTS), share a common methodology: they estimate speech using the expected value of the clean speech given the noisy speech under a statistical model. We show that it may be better to use the expected value of the noise under the model and subtract it from the noisy observation to form an indirect estimate of the speech. Interestingly, for VTS, this methodology turns out to be related to the application of an SNR-dependent gain to the direct VTS speech estimate. In results obtained on an automotive noise task, this methodology produces an average improvement of 1.6 dB signalto-noise ratio, relative to conventional methods.
Speech Enhancement Using Gaussian Scale Mixture Models
"... Abstract—This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture mod ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress. Index Terms—Gaussian scale mixture model (GSMM), Laplace method, speech enhancement, variational approximation. I.
unknown title
"... This paper introduces the Laplace algorithm for de-noising in the cepstrum domain with applications to speech recognition. Our method uses Gaussian mixture priors for clean speech and noise cepstra and assumes that speech and noise mix linearly in the spectrum domain. The Laplace algorithm involves ..."
Abstract
- Add to MetaCart
(Show Context)
This paper introduces the Laplace algorithm for de-noising in the cepstrum domain with applications to speech recognition. Our method uses Gaussian mixture priors for clean speech and noise cepstra and assumes that speech and noise mix linearly in the spectrum domain. The Laplace algorithm involves two steps (a) computing the posterior mode of the observed noisy cepstra and (b) Gaussian approximation of the posterior around the mode. We show that the Algonquin algorithm is a special case of our approach where a Newton method is used for (a). Interestingly, this observation also proves that the Algonquin algorithm does not converge in general. We propose the use of the BFGS method for (a) which also allows us to efficiently apply the Laplace algorithm in the cepstral domain. De-noising in the cepstral domain gives more than 31 % relative reduction in word error rate on average on the Aurora 2 task. 1.
unknown title
"... This paper introduces the Laplace algorithm for de-noising in the cepstrum domain with applications to speech recognition. Our method uses Gaussian mixture priors for clean speech and noise cepstra and assumes that speech and noise mix linearly in the spectrum domain. The Laplace algorithm involves ..."
Abstract
- Add to MetaCart
(Show Context)
This paper introduces the Laplace algorithm for de-noising in the cepstrum domain with applications to speech recognition. Our method uses Gaussian mixture priors for clean speech and noise cepstra and assumes that speech and noise mix linearly in the spectrum domain. The Laplace algorithm involves two steps (a) computing the posterior mode of the observed noisy cepstra and (b) Gaussian approximation of the posterior around the mode. We show that the Algonquin algorithm is a special case of our approach where a Newton method is used for (a). Interestingly, this observation also proves that the Algonquin algorithm does not converge in general. We propose the use of the BFGS method for (a) which also allows us to efficiently apply the Laplace algorithm in the cepstral domain. De-noising in the cepstral domain gives more than 31 % relative reduction in word error rate on average on the Aurora 2 task. 1.
Perceptual Inference in Generative Models Thesis Excerpts DRAFT
, 2004
"... Is the problem that we can’t see? Or is it that the problem is beautiful to me? —David C. Berman Everything we learn about the world arrives via the senses. We act in the ..."
Abstract
- Add to MetaCart
(Show Context)
Is the problem that we can’t see? Or is it that the problem is beautiful to me? —David C. Berman Everything we learn about the world arrives via the senses. We act in the