• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A general flexible framework for the handling of prior information in audio source separation (2012)

by A Ozerov, E Vincent, F Bimbot
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 45
Next 10 →

Automatic music transcription: challenges and future directions

by Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, Anssi Klapuri - J INTELL INF SYST , 2013
"... Automatic music transcription is considered by many to be a key enabling technology in music signal processing. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the fi ..."
Abstract - Cited by 15 (4 self) - Add to MetaCart
Automatic music transcription is considered by many to be a key enabling technology in music signal processing. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper we analyse limitations of current methods and identify promising directions for future research. Current transcription methods use general purpose models which are unable to capture the rich diversity found in music signals. One way to overcome the limited performance of transcription systems is to tailor algorithms to specific use-cases. Semi-automatic approaches are another way of achieving a more reliable transcription. Also, the wealth of musical scores and corresponding

Towards scaling up classificationbased speech separation

by Yuxuan Wang, Deliang Wang - IEEE Transactions on Audio, Speech, and Language Processing , 2013
"... Abstract—Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments r ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
Abstract—Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises. Index Terms—Computational auditory scene analysis (CASA), deep belief networks, feature learning, monaural speech separation, support vector machines. I.
(Show Context)

Citation Context

... speech enhancement methods, such as stationarity which is hard to satisfy for general acoustic environments. Recent model-based methods separate target speech by estimating Wiener gains (e.g., [15], =-=[36]-=-), but statistical source models are usually required or need to be adapted. Inspired by human auditory processing, computational auditory scene analysis (CASA) [45] has the potential to deal with mor...

The signal separation evaluation campaign (2007–2010): Achievements and remaining challenges

by Emmanuel Vincent, Shoko Araki, Fabian J. Theis, Guido Nolte, Pau Bofill, Hiroshi Sawada, Alexey Ozerov, B. Vikrham Gowreesunker, Dominik Lutter, Ngoc Duong, Emmanuel Vincent, Shoko Araki, Fabian J. Theis, Guido Nolte, Pau Bofill, Et Al. The Sig, Emmanuel Vincenta, Shoko Arakib, Fabian Theisc, Guido Nolted, Dominik Lutterc, Ngoc Q. K. Duonga - Signal Processing
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract - Cited by 12 (7 self) - Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
(Show Context)

Citation Context

...that were able to separate all sources within all mixtures of the dataset. 10 Performance SASSEC SiSEC SiSEC Binary masking oracle 2007 2008 2010 2008 and 2010 Instantaneous mixtures Method [21] [22] =-=[23]-=- [19] SDR (dB) 10.3 14.0 13.4 10.4 ISR (dB) 19.2 23.3 23.4 19.4 SIR (dB) 16.0 20.4 20.0 21.1 SAR (dB) 12.2 15.4 14.9 11.4 Live recordings with 5 cm microphone spacing Method [25] [26] [25] [19] SDR (d...

AUTOMATIC MUSIC TRANSCRIPTION: BREAKING THE GLASS CEILING

by Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, Anssi Klapuri , 2012
"... Automatic music transcription is considered by many to be the Holy Grail in the field of music signal analysis. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the fi ..."
Abstract - Cited by 11 (6 self) - Add to MetaCart
Automatic music transcription is considered by many to be the Holy Grail in the field of music signal analysis. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper we analyse limitations of current methods and identify promising directions for future research. Current transcription methods use general purpose models which are unable to capture the rich diversity found in music signals. In order to overcome the limited performance of transcription systems, algorithms have to be tailored to specific use-cases. Semiautomatic approaches are another way of achieving a more reliable transcription. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information across different methods and musical aspects.
(Show Context)

Citation Context

... such as the number or types of instruments in the recording, not many published systems explicitly incorporate prior information from a human user. In the context of source separation, Ozerov et al. =-=[36]-=- proposed a framework that enables the incorporation of prior knowledge about the number and types of sources, and the mixing model. The authors showed that by using prior information, a better separa...

Deep unfolding: Model-based inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574

by John R. Hershey, Jonathan Le Roux, Felix Weninger , 2014
"... Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem do-main knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic d ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem do-main knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural net-works are constructed in such a way that inference is straightforward, but their ar-chitectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and unfold the inference iterations as layers in a deep network. Rather than optimizing the original model, we untie the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform ac-curate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factor-ization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters. 1

A general framework for online audio source separation. Latent Variable Analysis and Source Separation. Paper presented at

by Laurent S. R. Simon, Emmanuel Vincent - the 10th International Conference on (LVA/ICA2012) (pp. 364–371).Tel-Aviv , 2012
"... ar ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...he centre and several voices or several guitars are present. 2 Laurent S. R. Simon and Emmanuel Vincent In order to address this issue, we consider the general flexible source separation framework in =-=[9]-=-. This framework generalises a wide range of algorithms such as certain forms of ICA, local Gaussian modeling and NMF, and enables the specification of additional constraints on the sources such as ha...

Coding-based Informed Source Separation: Nonnegative Tensor Factorization Approach

by Alexey Ozerov, Antoine Liutkus, Senior Member, Gaël Richard, Senior Member , 2013
"... Abstract—Informed source separation (ISS) aims at reliably recovering sources from a mixture. To this purpose, it relies on the assumption that the original sources are available during an encoding stage. Given both sources and mixture, a sideinformation may be computed and transmitted along with th ..."
Abstract - Cited by 6 (3 self) - Add to MetaCart
Abstract—Informed source separation (ISS) aims at reliably recovering sources from a mixture. To this purpose, it relies on the assumption that the original sources are available during an encoding stage. Given both sources and mixture, a sideinformation may be computed and transmitted along with the mixture, whereas the original sources are not available any longer. During a decoding stage, both mixture and side-information are processed to recover the sources. ISS is motivated by a number of specific applications including active listening and remixing of music, karaoke, audio gaming, etc. Most ISS techniques proposed so far rely on a source separation strategy and cannot achieve better results than oracle estimators. In this study, we introduce Coding-based ISS (CISS) and draw the connection between ISS and source coding. CISS amounts to encode the sources using not only a model as in source coding but also the observation of the mixture. This strategy has several advantages over conventional ISS methods. First, it can reach any quality, provided sufficient bandwidth is available as in source coding. Second, it makes use of the mixture in order to reduce the bitrate required to transmit the sources, as in classical ISS. Furthermore, we introduce Nonnegative Tensor Factorization as a very efficient model for CISS and report rate-distortion results that strongly outperform the state of the art. Index Terms—Informed source separation, spatial audio object coding, source coding, constrained entropy quantization, probabilistic model, nonnegative tensor factorization. I.
(Show Context)

Citation Context

... to produce I signals (the mixtures) X. The goal of source separation is to estimate the sources S given their mixtures X. Many advances were recently made in the area of audio source separation [5], =-=[6]-=-. However, the problem remains challenging in the undetermined setting (I < J), including the single-channel case (I = 1), and for convolutive mixtures [7]. It is now quite clear that audio source sep...

Uncertainty-based learning of acoustic models from noisy data, in "Computer Speech and Language

by Alexey Ozerov, Mathieu Lagrange, Emmanuel Vincent , 2012
"... We consider the problem of acoustic modeling of noisy speech data, where the uncertainty over the data is given by a Gaussian distribution. While this uncertainty has been exploited at the decoding stage via uncertainty decoding, its usage at the training stage remains limited to static model adapta ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
We consider the problem of acoustic modeling of noisy speech data, where the uncertainty over the data is given by a Gaussian distribution. While this uncertainty has been exploited at the decoding stage via uncertainty decoding, its usage at the training stage remains limited to static model adaptation. We introduce a new Expectation Maximisation (EM) based technique, which we call uncertainty training, that allows us to train Gaussian mixture models (GMMs) or hidden Markov models (HMMs) directly from noisy data with dynamic uncertainty. We evaluate the potential of this technique for a GMM-based speaker recognition task on speech data corrupted by real-world domestic background noise, using a state-of-the-art signal enhancement technique and various uncertainty estimation techniques as a front-end. Compared to conventional training, the proposed training algorithm results in 1 % to 2 % absolute improvement in speaker recognition accuracy by training from either matched, unmatched or multi-condition noisy data. This algorithm is also applicable with minor modifications to maximum a posteriori (MAP) or maximum likelihood linear regression (MLLR) acoustic model adaptation from noisy data and to other data than audio.
(Show Context)

Citation Context

...Apr 2013 4.1.2. Signal enhancement Signalenhancementisperformedviathestate-of-the-artalgorithmofOzerov and Vincent (2011), as implemented using the Flexible Audio Source Separation Toolbox (FASST) 6 (=-=Ozerov et al., 2012b-=-). This toolbox allows the user to specify the desired spectral and spatial signal models for each sound source from a library of models. Contrary to the use of speaker-dependent models in (Ozerov and...

Structured Sparsity Models for Reverberant Speech Separation

by Afsaneh Asaei, Mohammad Golbabaee, Herve Bourlard, Volkan Cevher , 2010
"... We tackle the speech separation problem through modeling the acoustics of the reverberant chambers. Our approach exploits structured sparsity models to perform speech recovery and room acoustic modeling from recordings of concurrent unknown sources. The speakers are assumed to lie on a two-dimension ..."
Abstract - Cited by 5 (5 self) - Add to MetaCart
We tackle the speech separation problem through modeling the acoustics of the reverberant chambers. Our approach exploits structured sparsity models to perform speech recovery and room acoustic modeling from recordings of concurrent unknown sources. The speakers are assumed to lie on a two-dimensional plane and the multipath channel is characterized using the image model. We propose an algorithm for room geometry estimation relying on localization of the early images of the speakers by sparse approximation of the spatial spectrum of the virtual sources in a free-space model. The images are then clustered exploiting the low-rank structure of the spectro-temporal components belonging to each source. This enables us to identify the early support of the room impulse response function and its unique map to the room geometry. To further tackle the ambiguity of the reflection ratios, we propose a novel formulation of the reverberation model and estimate the absorption coefficients through a convex optimization exploiting joint sparsity model formulated upon spatio-spectral sparsity of concurrent speech representation. The acoustic parameters are then incorporated for separating individual speech signals through either structured sparse recovery or inverse filtering the acoustic channels. The experiments conducted on real data recordings of spatially stationary sources demonstrate the effectiveness of the proposed approach for multi-party speech recovery and recognition.

NOISE ROBUST DISTANT AUTOMATIC SPEECH RECOGNITION UTILIZING NMF BASED SOURCE SEPARATION AND AUDITORY FEATURE EXTRACTION

by Niko Moritz, Marc René Schädler, Kamil Adiloglu, Bernd T. Meyer, Tim Jürgens, Timo Gerkmann, Birger Kollmeier, Simon Doclo, Stefan Goetze
"... This paper describes our contribution to the 2 nd CHiME challenge and focuses on the small vocabulary task, i.e. track one. We present a robust system combination that involves source separation, auditory feature extraction and a modified automatic speech recognition back-end. The source separation ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
This paper describes our contribution to the 2 nd CHiME challenge and focuses on the small vocabulary task, i.e. track one. We present a robust system combination that involves source separation, auditory feature extraction and a modified automatic speech recognition back-end. The source separation code is based on a non-negative matrix factorization approach and the presented auditory feature extraction method uses 2D Gabor filter functions to extract spectral, temporal and spectro-temporal information of the speech signals. In addition we describe the modifications to our classification back-end and discuss the achieved results. On the final CHiME test set the proposed system achieves a maximum keyword recognition rate improvement of 50.25 % for the-6 dB SNR condition, for instance. Index Terms — CHiME challenge, non-negative matrix factorization, Gabor feature extraction, source separation, automatic speech recognition 1.
(Show Context)

Citation Context

...ion with variance vj,fn given as s j, fn N 0, v j, fn � � � (2) The source variances vj,fn, which encode the spectral power, are decomposed via an excitation-filter model using a multilevel NMF model =-=[8]-=-. This framework makes it possible to incorporate a wide range of constraints about the sources. For more details about how to constrain spectral and temporal structures, see [8]. In a fully Bayesian ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University