Results 1 - 10
of
12
A real-time srp-phat source location implementation using stochastic region contraction (src) on a large-aperture microphone array
- in Proc. of ICASSP 2007
"... In most microphone array applications, it is essential to localize sources in a noisy, reverberant environment. It has been shown that computing the steered response power(SRP) is more robust than faster, two-stage, direct time-difference of arrival methods. The problem with computing SRP is that th ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
In most microphone array applications, it is essential to localize sources in a noisy, reverberant environment. It has been shown that computing the steered response power(SRP) is more robust than faster, two-stage, direct time-difference of arrival methods. The problem with computing SRP is that the SRP space has many local maxima and thus computationallyintensive grid-search methods are used to find a global maximum. Grid search is too expensive for a real-time system. Several papers have addressed this issue. In this paper we propose using stochastic region contraction(SRC) to make computing the SRP practical. We discuss one important SRP method, computing it from the phase transform (SRP-PHAT), review SRC, and show the computational saving. Using real data from human talkers, we show that SRC saves computation by more than two orders of magnitude with almost no loss in accuracy. Index Terms – Optimization methods, microphones, arrays, acoustic position measurement
Short-Term Spatio–Temporal Clustering Applied to Multiple Moving Speakers
"... Abstract—Distant microphones permit to process spontaneous multiparty speech with very little constraints on speakers, as opposed to close-talking microphones. Minimizing the constraints on speakers permits a large diversity of applications, including meeting summarization and browsing, surveillance ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract—Distant microphones permit to process spontaneous multiparty speech with very little constraints on speakers, as opposed to close-talking microphones. Minimizing the constraints on speakers permits a large diversity of applications, including meeting summarization and browsing, surveillance, hearing aids, and more natural human–machine interaction. Such applications of distant microphones require to determine where and when the speakers are talking. This is inherently a multisource problem, because of background noise sources, as well as the natural tendency of multiple speakers to talk over each other. Moreover, spontaneous speech utterances are highly discontinuous, which makes it difficult to track the multiple speakers with classical filtering approaches, such as Kalman filtering of particle filters. As an alternative, this paper proposes a probabilistic framework to determine the trajectories of multiple moving speakers in the short-term only, i.e., only while they speak. Instantaneous location estimates that are close in space and time are grouped into “short-term clusters ” in a principled manner. Each short-term cluster determines the precise start and end times of an utterance and a short-term spatial trajectory. Contrastive experiments clearly show the benefit of using short-term clustering, on real indoor recordings with seated speakers in meetings, as well as multiple moving speakers. Index Terms—Localization, multiple acoustic sources, short-term clustering, speech segmentation, tracking. I.
Extracting and Re-rendering Structured Auditory Scenes from Field Recordings
"... We present an approach to automatically extract and re-render a structured auditory scene from field recordings obtained with a small set of microphones, freely positioned in the environment. From the recordings and the calibrated position of the microphones, the 3D location of various auditory even ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present an approach to automatically extract and re-render a structured auditory scene from field recordings obtained with a small set of microphones, freely positioned in the environment. From the recordings and the calibrated position of the microphones, the 3D location of various auditory events can be estimated together with their corresponding content. This structured description is reproduction-setup independent. We propose solutions to classify foreground, well-localized sounds and more diffuse background ambiance and adapt our rendering strategy accordingly. Warping the original recordings during playback allows for simulating smooth changes in the listening point or position of sources. Comparisons to reference binaural and B-format recordings show that our approach achieves good spatial rendering while remaining independent of the reproduction setup and offering extended authoring capabilities. 1.
3D-Audio Matting, Post-editing and Re-rendering from Field Recordings
"... Figure 1: Left: We use multiple arbitrarily positioned microphones (circled in yellow) to simultaneously record real-world auditory environments. Middle: We analyze the recordings to extract the positions of various sound components through time. Right: This high-level representation allows for post ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Figure 1: Left: We use multiple arbitrarily positioned microphones (circled in yellow) to simultaneously record real-world auditory environments. Middle: We analyze the recordings to extract the positions of various sound components through time. Right: This high-level representation allows for post-editing and re-rendering the acquired soundscape within generic 3D-audio rendering architectures. We present a novel approach to real-time spatial rendering of realistic auditory environments and sound sources recorded live, in the field. Using a set of standard microphones distributed throughout a real-world environment we record the sound-field simultaneously from several locations. After spatial calibration, we segment from this set of recordings a number of auditory components, together with their location. We compare existing time-delay of arrival estimations techniques between pairs of widely-spaced microphones and introduce a novel efficient hierarchical localization algorithm. Using the high-level representation thus obtained, we can edit and re-render the acquired auditory scene over a variety of listening setups. In particular, we can move or alter the different sound sources and arbitrarily choose the listening position. We can also composite elements of different scenes together in a spatially consistent way. Our approach provides efficient rendering of complex soundscapes which would be challenging to model using discrete point sources and traditional virtual acoustics techniques. We demonstrate a wide range of possible applications for games, virtual and augmented reality and audio-visual post-production.
Parametrization of Linear Systems Using Diffusion Kernels
, 2011
"... Modeling natural and artificial systems has a key role in various applications, and has long been a task that drew enormous efforts. In this work, instead of exploring predefined models, we aim at implicitly identifying the system degrees of freedom. This approach circumvents the dependency of a spe ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Modeling natural and artificial systems has a key role in various applications, and has long been a task that drew enormous efforts. In this work, instead of exploring predefined models, we aim at implicitly identifying the system degrees of freedom. This approach circumvents the dependency of a specific predefined model for a specific task or system, and enables a generic data-driven method to characterize a system based solely on its output observations. We claim that each system can be viewed as a black box controlled by several independent parameters. Moreover, we assume that the perceptual characterization of the system output is determined by these independent parameters. Consequently, by recovering the independent controlling parameters, we find in fact a generic modeling for the system. In this work, we propose a supervised algorithm to recover the controlling parameters of natural and artificial linear systems. The proposed algorithm relies on nonlinear independent component analysis using diffusion kernels and spectral analysis. Employment of the proposed algorithm on both synthetic and real examples has shown accurate recovery of parameters.
DIRECTION ESTIMATION BASED ON SOUND INTENSITY VECTORS
"... The direction of a sound source in an enclosure can be estimated with a microphone array and some proper signal processing. Earlier, in applications and in research the use of time delay estimation methods, such as the cross correlation, has been popular. Recently, techniques for direction estimatio ..."
Abstract
- Add to MetaCart
The direction of a sound source in an enclosure can be estimated with a microphone array and some proper signal processing. Earlier, in applications and in research the use of time delay estimation methods, such as the cross correlation, has been popular. Recently, techniques for direction estimation that involve sound intensity vectors have been developed and used in applications, e.g. in teleconferencing. Unlike in time delay estimation, these methods have not been compared widely. In this article, five methods for direction estimation in the concept of sound intensity vectors are compared with real data from a concert hall. The results of the comparison indicate that the methods that are based on convolutive mixture models perform slightly better than some of the simple averaging methods. The convolutive mixture model based methods are also more robust against additive noise. 1.
II An Investigation into the Use of Inexpensive Audio Equipment as a Method of Real-time 3D Sound Source Localization
, 2011
"... institutions or individuals for the purpose of scholarly research. ..."
Localising Speech, Footsteps and Other Sounds using Resource-Constrained Devices
"... While a number of acoustic localisation systems have been proposed over the last few decades, these have typically either relied on expensive dedicated microphone arrays and workstation-class processing, or have been developed to detect a very specific type of sound in a particular scenario. However ..."
Abstract
- Add to MetaCart
While a number of acoustic localisation systems have been proposed over the last few decades, these have typically either relied on expensive dedicated microphone arrays and workstation-class processing, or have been developed to detect a very specific type of sound in a particular scenario. However, as people live and work indoors, they generate a wide variety of sounds as they interact and move about. These human-generated sounds can be used to infer the positions of people, without requiring them to wear trackable tags. In this paper, we take a practical yet general approach to localising a number of human-generated sounds. Drawing from signal processing literature, we identify methods for resource-constrained devices in a sensor network to detect, classify and locate acoustic events such as speech, footsteps and objects being placed onto tables. We evaluate the classification and time-of-arrival estimation algorithms using a data set of human-generated sounds we captured with sensor nodes in a controlled setting. We show that despite the variety and complexity of the sounds, their localisation is feasible for sensor networks, with typical accuracies of a half metre or better. We specifically discuss the processing and networking considerations, and explore the performance trade-offs which can be made to further conserve resources.
IMPROVING HANDS-FREE SPEECH RECOGNITION IN A CAR THROUGH AUDIO-VISUAL VOICE ACTIVITY DETECTION
, 2011
"... In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization s ..."
Abstract
- Add to MetaCart
In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1 % in combination with delay-and-sum beamforming. This compares to a WER of 8.9 % for speaker localizing with audio-only VAD and 11.6 % without VAD and 15.6 for a single distant channel. Index Terms — microphone arrays, audio-visual systems, acoustic signal detection, time of arrival estimation, automatic speech recognition 1.
DOI: 10.1155/2007/47970 3D-Audio Matting, Post-editing and Re-rendering from Field Recordings
"... Figure 1: Left: We use multiple arbitrarily positioned microphones (circled in yellow) to simultaneously record real-world auditory environments. Middle: We analyze the recordings to extract the positions of various sound components through time. Right: This high-level representation allows for post ..."
Abstract
- Add to MetaCart
Figure 1: Left: We use multiple arbitrarily positioned microphones (circled in yellow) to simultaneously record real-world auditory environments. Middle: We analyze the recordings to extract the positions of various sound components through time. Right: This high-level representation allows for post-editing and re-rendering the acquired soundscape within generic 3D-audio rendering architectures.

