Results 1 - 10
of
51
Speech Segregation Based on Sound Localization
- J. Acoust. Soc. Am
, 2003
"... We study the cocktail-party effect, which refers to the ability of a listener to attend to a single talker in the presence of adverse acoustical conditions. It has been observed that this ability improves in the presence of binaural cues. In this paper, we explore a technique for speech segrega ..."
Abstract
-
Cited by 61 (22 self)
- Add to MetaCart
We study the cocktail-party effect, which refers to the ability of a listener to attend to a single talker in the presence of adverse acoustical conditions. It has been observed that this ability improves in the presence of binaural cues. In this paper, we explore a technique for speech segregation based on sound localization cues. The auditory masking phenomenon motivates an "ideal" binary mask in which time-frequency regions that correspond to the weak signal are canceled. In our model we estimate this binary mask by observing that systematic changes of the interaural time differences and intensity differences occur as the energy ratio of the original signals is modified. The performance of our model is comparable with results obtained using the ideal binary mask and it shows a large improvement over existing pitch-based algorithms. 1
Monaural speech segregation based on pitch tracking and amplitude modulation
- IEEE Trans. Neural Networks
, 2004
"... Speech segregation is an important task of auditory scene analysis (ASA), in which the speech of a certain speaker is separated from other interfering signals. Wang and Brown proposed a multistage neural model for speech segregation, the core of which is a two-layer oscillator network. In this paper ..."
Abstract
-
Cited by 55 (23 self)
- Add to MetaCart
Speech segregation is an important task of auditory scene analysis (ASA), in which the speech of a certain speaker is separated from other interfering signals. Wang and Brown proposed a multistage neural model for speech segregation, the core of which is a two-layer oscillator network. In this paper, we extend their model by adding further processes based on psychoacoustic evidence to improve the performance. These processes include pitch tracking and grouping based on amplitude modulation (AM). Our model is systematically evaluated and compared with the Wang-Brown model, and it yields significantly better performance. 1.
A Multi-Pitch Tracking Algorithm for Noisy Speech
- IEEE Transactions on Speech and Audio Processing
, 2002
"... We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency channels, and a hidden Markov model (HMM) for forming continuous p ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency channels, and a hidden Markov model (HMM) for forming continuous pitch tracks, and as a result, our algorithm can reliably track single and double pitch tracks in a noisy environment. The proposed algorithm is evaluated on a database of speech utterances mixed with various interferences and the results show that our algorithm outperforms existing algorithms significantly.
On ideal binary mask as the computational goal of auditory scene analysis
- in Speech Separation by Humans and Machines
, 2005
"... What is the computational goal of auditory scene analysis? This is a key issue to address in the Marrian information-processing framework. It is also an important question for researchers in computational auditory scene analysis (CASA) because it bears directly on how a CASA system should be evaluat ..."
Abstract
-
Cited by 40 (20 self)
- Add to MetaCart
What is the computational goal of auditory scene analysis? This is a key issue to address in the Marrian information-processing framework. It is also an important question for researchers in computational auditory scene analysis (CASA) because it bears directly on how a CASA system should be evaluated. In this chapter I discuss different objectives used in CASA. I suggest as a main CASA goal the use of the ideal time-frequency (T-F) binary mask whose value is one for a T-F unit where the target energy is greater than the interference energy and is zero otherwise. The notion of the ideal binary mask is motivated by the auditory masking phenomenon. Properties of the ideal binary mask are discussed, including their relationship to automatic speech recognition and human speech intelligibility. This CASA goal has led to algorithms that directly estimate the ideal binary mask in monaural and binaural conditions, and these algorithms have substantially advanced the state-of-the-art performance in speech separation. 1.
Separation of sound sources by convolutive sparse coding
- in Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing, 2004. [Online] Available: http://journal.speech.cs.cmu.edu/SAPA2004
, 2004
"... An algorithm for the separation of sound sources is presented. Each source is parametrized as a convolution between a time-frequency magnitude spectrogam and an onset vector. The source model is able to represent several types of sounds, for example repetitive drum sounds and harmonic sounds with mo ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
An algorithm for the separation of sound sources is presented. Each source is parametrized as a convolution between a time-frequency magnitude spectrogam and an onset vector. The source model is able to represent several types of sounds, for example repetitive drum sounds and harmonic sounds with modulations. An iterative algorithm is proposed for the estimation the parameters. The algorithm is based on minimizing the reconstruction error and the number of onsets. The number of onsets is minimized by applying the sparse coding scheme for onset vectors. A way of modeling the loudness perception of the human auditory system is proposed. The method compresses high-energy sources, and enables the separation of lowenergy sources which are perceptually significant. The algorithm is able to separate meaningful sources from real-world signals. Simulation experiments were carried out using mixtures of harmonic instruments. Demonstration signals are available at
Automatic music transcription as we know it today
- Journal of New Music Research
, 2004
"... The aim of this overview is to describe methods for the automatic transcription of Western polyphonic music. The transcription task is here understood as transforming an acoustic musical signal into a MIDI-like symbolic representation. Only pitched musical instruments are considered: recognizing the ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The aim of this overview is to describe methods for the automatic transcription of Western polyphonic music. The transcription task is here understood as transforming an acoustic musical signal into a MIDI-like symbolic representation. Only pitched musical instruments are considered: recognizing the sounds of drum instruments is not discussed. The main emphasis is laid on estimating the multiple fundamental frequencies of several concurrent sounds. Various approaches to solve this problem are discussed, including methods that are based on modelling the human auditory periphery, methods that mimic the human auditory scene analysis function, signal model-based Bayesian inference methods, and data-adaptive methods. Another subproblem addressed is the rhythmic parsing of acoustic musical signals. From the transcription point of view, this amounts to the temporal segmentation of music signals at different time scales. The relationship between the two subproblems and the general structure of the transcription problem is discussed. 1.
A maximum likelihood approach to single-channel source separation
- Journal of Machine Learning Research
, 2003
"... This paper presents a new technique for achieving blind signal separation when given only a single channel recording. The main concept is based on exploiting a priori sets of time-domain basis functions learned by independent component analysis (ICA) to the separation of mixed source signals observe ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
This paper presents a new technique for achieving blind signal separation when given only a single channel recording. The main concept is based on exploiting a priori sets of time-domain basis functions learned by independent component analysis (ICA) to the separation of mixed source signals observed in a single channel. The inherent time structure of sound sources is reflected in the ICA basis functions, which encode the sources in a statistically efficient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis functions. For each time point we infer the source parameters and their contribution factors. This inference is possible due to prior knowledge of the basis functions and the associated coefficient densities. A flexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for simulated mixtures as well as real environment recordings employing mixtures of two different sources.
A Comparison of Auditory and Blind Separation Techniques for Speech Segregation
- IEEE Transactions on Speech and Audio Processing
, 2000
"... A fundamental problem in auditory and speech processing is the segregation of speech from concurrent sounds. This problem has been a focus of study in computational auditory scene analysis (CASA), and it has also been recently investigated from the perspective of blind source separation. Using a sta ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
A fundamental problem in auditory and speech processing is the segregation of speech from concurrent sounds. This problem has been a focus of study in computational auditory scene analysis (CASA), and it has also been recently investigated from the perspective of blind source separation. Using a standard corpus of voiced speech mixed with interfering sounds, we report a comparison between CASA and blind source separation techniques, which have been developed independently. Our comparison reveals that they perform well under very dierent conditions. A number of conclusions are drawn with respect to their relative strengths and weaknesses in speech segregation applications as well as in modeling auditory function. A. J. W. van der Kouwe is with the Nuclear Magnetic Resonance Center, Massachusetts General Hospital, 149 13th Street, Charlestown, MA 02129. D. L. Wang is with the Department of Computer and Information Science and the Center for Cognitive Science, The Ohio State University...
Auditory Segmentation Based on Onset and Offset Analysis
, 2007
"... A typical auditory scene in a natural environment contains multiple sources. Auditory scene analysis (ASA) is the process in which the auditory system segregates a scene into streams corresponding to different sources. Segmentation is a major stage of ASA by which an auditory scene is decomposed int ..."
Abstract
-
Cited by 13 (7 self)
- Add to MetaCart
A typical auditory scene in a natural environment contains multiple sources. Auditory scene analysis (ASA) is the process in which the auditory system segregates a scene into streams corresponding to different sources. Segmentation is a major stage of ASA by which an auditory scene is decomposed into segments, each containing signal mainly from one source. We propose a system for auditory segmentation by analyzing onsets and offsets of auditory events. The proposed system first detects onsets and offsets, and then generates segments by matching corresponding onset and offset fronts. This is achieved through a multiscale approach. A quantitative measure is suggested for segmentation evaluation. Systematic evaluation shows that most of target speech, including unvoiced speech, is correctly segmented, and target speech and interference are well separated into different segments.

