Results 1 - 10
of
127
Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria
- IEEE Trans. On Audio, Speech and Lang. Processing
, 2007
"... Abstract—An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The algorithm is based on factorizing the magnitude spectrogram of an input signal into a sum of components, each of which has a fixed magnitude spectrum and a time-varying gain ..."
Abstract
-
Cited by 189 (30 self)
- Add to MetaCart
(Show Context)
Abstract—An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The algorithm is based on factorizing the magnitude spectrogram of an input signal into a sum of components, each of which has a fixed magnitude spectrum and a time-varying gain. Each sound source, in turn, is modeled as a sum of one or more components. The parameters of the components are estimated by minimizing the reconstruction error between the input spectrogram and the model, while restricting the component spectrograms to be nonnegative and favoring components whose gains are slowly varying and sparse. Temporal continuity is favored by using a cost term which is the sum of squared differences between the gains in adjacent frames, and sparseness is favored by penalizing nonzero gains. The proposed iterative estimation algorithm is initialized with random values, and the gains and the spectra are then alternatively updated using multiplicative update rules until the values converge. Simulation experiments were carried out using generated mixtures of pitched musical instrument samples and drum sounds. The performance of the proposed method was compared with independent subspace analysis and basic nonnegative matrix factorization, which are based on the same linear model. According to these simulations, the proposed method enables a better separation quality than the previous algorithms. Especially, the temporal continuity criterion improved the detection of pitched musical sounds. The sparseness criterion did not produce significant improvements. Index Terms—Acoustic signal analysis, audio source separation, blind source separation, music, nonnegative matrix factorization, sparse coding, unsupervised learning. I.
On ideal binary mask as the computational goal of auditory scene analysis
- in Speech Separation by Humans and Machines
, 2005
"... In a natural environment, a target sound, such as speech, is usually mixed with acoustic interference. A sound separation system that removes or attenuates acoustic interference has many important applications, such as automatic speech recognition (ASR) and speaker identification in real ..."
Abstract
-
Cited by 99 (40 self)
- Add to MetaCart
(Show Context)
In a natural environment, a target sound, such as speech, is usually mixed with acoustic interference. A sound separation system that removes or attenuates acoustic interference has many important applications, such as automatic speech recognition (ASR) and speaker identification in real
Separation of singing voice from music accompaniment for monaural recordings
- IEEE Transactions on Audio, Speech, and Language Processing
, 2006
"... Abstract—Separating singing voice from music accompaniment is very useful in many applications, such as lyrics recognition and alignment, singer identification, and music information retrieval. Although speech separation has been extensively studied for decades, singing voice separation has been lit ..."
Abstract
-
Cited by 41 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Separating singing voice from music accompaniment is very useful in many applications, such as lyrics recognition and alignment, singer identification, and music information retrieval. Although speech separation has been extensively studied for decades, singing voice separation has been little investigated. We propose a system to separate singing voice from music accompaniment for monaural recordings. Our system consists of three stages. The singing voice detection stage partitions and classifies an input into vocal and nonvocal portions. For vocal portions, the predominant pitch detection stage detects the pitch of the singing voice and then the separation stage uses the detected pitch to group the time-frequency segments of the singing voice. Quantitative results show that the system performs the separation task successfully. Index Terms—Predominant pitch detection, singing voice detection, sound separation. I.
Sound Source Separation in Monaural Music Signals
, 2006
"... Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more efficiently and flexibly than polyphonic mixtures. This thesis deals with the separat ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
(Show Context)
Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more efficiently and flexibly than polyphonic mixtures. This thesis deals with the separation of monaural, or, one-channel music recordings. We concentrate on separation methods, where the sources to be separated are not known beforehand. Instead, the separation is enabled by utilizing the common properties of real-world sound sources, which are their continuity, sparseness, and repetition in time and frequency, and their harmonic spectral structures. One of the separation approaches taken here use unsupervised learning and the other uses model-based inference based on sinusoidal modeling. Most of the existing unsupervised separation algorithms are based on a linear instantaneous signal model, where each frame of the input mixture signal is
Separation of synchronous pitched notes by spectral filtering of harmonics
- IEEE Trans. Audio, Speech and Language Processing
"... Abstract—This paper discusses the separation of two or more simultaneously excited pitched notes from a mono sound file into separate tracks. In fact, this is an intermediate stage in the longer-term goal of separating out at least two interweaving melodies of different sound sources from a mono fil ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
Abstract—This paper discusses the separation of two or more simultaneously excited pitched notes from a mono sound file into separate tracks. In fact, this is an intermediate stage in the longer-term goal of separating out at least two interweaving melodies of different sound sources from a mono file. The approach is essentially to filter the set of harmonics of each note from the mixed spectrum in each time frame of audio. A major consideration has been the separation of overlapping harmonics, and three filter designs are proposed for splitting a spectral peak into its constituent partials given the rough frequency and amplitude estimates of each partial contained within. The overall quality of separation has been good for mixes of up to seven orchestral notes and has been confirmed by measured average signal-to-residual ratios of around 10–20 dB. Index Terms—Music note separation, partial extraction, separation of overlapping harmonics. I.
A Computational Auditory Scene Analysis System for Robust Speech Recognition
"... We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stron ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting T-F masks are used in conjunction with missing-data methods for recognition. Systematic evaluations on a speech separation challenge task show significant improvement over the baseline performance. Index Terms: speech segregation, computational auditory scene analysis, binary time-frequency mask, robust speech recognition.
A supervised learning approach to monaural segregation of reverberant speech
- IN PROC. IEEE ICASSP
, 2009
"... A major source of signal degradation in real environments is room reverberation. Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, ..."
Abstract
-
Cited by 27 (18 self)
- Add to MetaCart
A major source of signal degradation in real environments is room reverberation. Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. This paper proposes a supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time–frequency (T–F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The models trained using this objective function yield significantly better T–F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers.
Transforming Binary Uncertainties for Robust Speech Recognition
"... Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with en ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
(Show Context)
Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time–frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions. Index Terms—Binary time–frequency mask, computational auditory scene analysis (CASA), robust automatic speech recognition, spectrogram reconstruction, uncertainty decoding. I.
FEASIBILITY OF SINGLE CHANNEL SPEAKER SEPARATION BASED ON MODULATION FREQUENCY ANALYSIS
"... We explore the use of the modulation frequency domain for single channel speaker separation. We discuss features of the modulation spectrogram of speech signals that suggest that multiple speakers are highly separable in this space. In a preliminary experiment, we separate a target speaker from an i ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
(Show Context)
We explore the use of the modulation frequency domain for single channel speaker separation. We discuss features of the modulation spectrogram of speech signals that suggest that multiple speakers are highly separable in this space. In a preliminary experiment, we separate a target speaker from an interfering speaker by manually masking out modulation spectral features of the interferer. We extend this experiment into a new automatic speaker separation algorithm, and show that it achieves an acceptable level of separation. The new algorithm only needs a rough estimate of the target speaker’s pitch range. Index Terms — Speech enhancement, separation, modulation, spectral analysis, time-varying filters
A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation
"... Abstract—A lot of effort has been made in computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. The performance of current CASA systems on voiced speech segregation is limited by lacking a robust algorithm for pitch estimation. We propose a tandem algorithm that pe ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
Abstract—A lot of effort has been made in computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. The performance of current CASA systems on voiced speech segregation is limited by lacking a robust algorithm for pitch estimation. We propose a tandem algorithm that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively. This algorithm first obtains a rough estimate of target pitch, and then uses this estimate to segregate target speech using harmonicity and temporal continuity. It then improves both pitch estimation and voiced speech segregation iteratively. Novel methods are proposed for performing segregation with a given pitch estimate and pitch determination with given segregation. Systematic evaluation shows that the tandem algorithm extracts a majority of target speech without including much interference, and it performs substantially better than previous systems for either pitch extraction or voiced speech segregation. Index Terms — computational auditory scene analysis (CASA), iterative procedure, pitch estimation, speech segregation, tandem algorithm S I.