Results 1 - 10
of
21
Analysis of multimodal sequences using geometric video representations
- SIGNAL PROCESSING, IN PRESS, 2006, [ONLINE] AVAILABLE: HTTP://LTS2WWW.EPFL.CH
, 2005
"... This paper presents a novel method to correlate audio and visual data generated by the same physical phenomenon, based on sparse geometric representation of video sequences. The video signal is modeled as a sum of geometric primitives evolving through time, that jointly describe the geometric and mo ..."
Abstract
-
Cited by 15 (13 self)
- Add to MetaCart
This paper presents a novel method to correlate audio and visual data generated by the same physical phenomenon, based on sparse geometric representation of video sequences. The video signal is modeled as a sum of geometric primitives evolving through time, that jointly describe the geometric and motion content of the scene. The displacement through time of relevant visual features, like the mouth of a speaker, can thus be compared with the evolution of an audio feature to assess the correspondence between acoustic and visual signals. Experiments show that the proposed approach allows to detect and track the speaker’s mouth when several persons are present on the scene, in presence of distracting motion, and without prior face or mouth detection.
Extraction of audio features specific to speech production for multimodal speaker detection
- IEEE Trans. Multimedia
, 2008
"... Abstract—A method that exploits an information theoretic framework to extract optimized audio features using video information is presented. A simple measure of mutual information (MI) between the resulting audio and video features allows the detection of the active speaker among different candidate ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract—A method that exploits an information theoretic framework to extract optimized audio features using video information is presented. A simple measure of mutual information (MI) between the resulting audio and video features allows the detection of the active speaker among different candidates. This method involves the optimization of an MI-based objective function. No approximation is needed to solve this optimization problem, neither for the estimation of the probability density functions (pdfs) of the features, nor for the cost function itself. The pdfs are estimated from the samples using a nonparametric approach. The challenging optimization problem is solved using a global method: the differential evolution algorithm. Two information theoretic optimization criteria are compared and their ability to extract audio features specific to speech production is discussed. Using these specific audio features, candidate video features are then classified as member of the “speaker ” or “non-speaker” class, resulting in a speaker detection scheme. As a result, our method achieves a speaker detection rate of 100 % on in-house test sequences, and of 85 % on most commonly used sequences. Index Terms—Audio features, differential evolution, multimodal, mutual information, speaker detection, speech. I.
Learning Multi-Modal Dictionaries
- IEEE Transactions on Image Processing
, 2006
"... Abstract—Real-world phenomena involve complex interactions between multiple signal modalities. As a consequence, humans are used to integrate at each instant perceptions from all their senses in order to enrich their understanding of the surrounding world. This paradigm can be also extremely useful ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract—Real-world phenomena involve complex interactions between multiple signal modalities. As a consequence, humans are used to integrate at each instant perceptions from all their senses in order to enrich their understanding of the surrounding world. This paradigm can be also extremely useful in many signal processing and computer vision problems involving mutually related signals. The simultaneous processing of multimodal data can, in fact, reveal information that is otherwise hidden when considering the signals independently. However, in natural multimodal signals, the statistical dependencies between modalities are in general not obvious. Learning fundamental multimodal patterns could offer deep insight into the structure of such signals. In this paper, we present a novel model of multimodal signals based on their sparse decomposition over a dictionary of multimodal structures. An algorithm for iteratively learning multimodal generating functions that can be shifted at all positions in the signal is proposed, as well. The learning is defined in such a way that it can be accomplished by iteratively solving a generalized eigenvector problem, which makes the algorithm fast, flexible, and free of user-defined parameters. The proposed algorithm is applied to audiovisual sequences and it is able to discover underlying structures in the data. The detection of such audio-video patterns in audiovisual clips allows to effectively localize the sound source on the video in presence of substantial acoustic and visual distractors, outperforming state-of-the-art audiovisual localization algorithms. Index Terms—Audiovisual source localization, dictionary learning, multimodal data processing, sparse representation.
Blind Audio-Visual Source Separation based on Sparse Redundant Representations
"... Abstract—In this paper we propose a novel method which is able to detect and separate audio-visual sources present in a scene. Our method exploits the correlation between the video signal captured with a camera and a synchronously recorded one-microphone audio track. In a first stage, audio and vide ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract—In this paper we propose a novel method which is able to detect and separate audio-visual sources present in a scene. Our method exploits the correlation between the video signal captured with a camera and a synchronously recorded one-microphone audio track. In a first stage, audio and video modalities are decomposed into relevant basic structures using redundant representations. Next, synchrony between relevant events in audio and video modalities is quantified. Based on this co-occurrence measure, audio-visual sources are counted and located in the image using a robust clustering algorithm that groups video structures exhibiting strong correlations with the audio. Next periods where each source is active alone are determined and used to build spectral Gaussian Mixture Models (GMMs) characterizing the sources acoustic behavior. Finally, these models are used to separate the audio signal in periods during which several sources are mixed. The proposed approach has been extensively tested on synthetic and natural sequences composed of speakers and music instruments. Results show that the proposed method is able to successfully detect, localize, separate and reconstruct present audio-visual sources. Index Terms—Audio-visual processing, blind source separation, sparse signal representation, Gaussian Mixture Models. I.
Relevant Feature Selection for Audio-Visual Speech Recognition
"... Abstract — We present a feature selection method based on information theoretic measures, targeted at multimodal signal processing, showing how we can quantitatively assess the relevance of features from different modalities. We are able to find the features with the highest amount of information re ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract — We present a feature selection method based on information theoretic measures, targeted at multimodal signal processing, showing how we can quantitatively assess the relevance of features from different modalities. We are able to find the features with the highest amount of information relevant for the recognition task, and at the same having minimal redundancy. Our application is audio-visual speech recognition, and in particular selecting relevant visual features. Experimental results show that our method outperforms other feature selection algorithms from the literature by improving recognition accuracy even with a significantly reduced number of features. I.
Multimodal speaker localization in a probabilistic framework
- In Proc. of EUSIPCO
, 2006
"... A multimodal probabilistic framework is proposed for the problem of finding the active speaker in a video sequence. We localize the current speaker’s mouth in the image by using the video and the audio channels together. We propose a novel visual feature that is well-suited for the analysis of the m ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A multimodal probabilistic framework is proposed for the problem of finding the active speaker in a video sequence. We localize the current speaker’s mouth in the image by using the video and the audio channels together. We propose a novel visual feature that is well-suited for the analysis of the movement of the mouth. After estimating the joint probability density of the audio and visual features, we can find the most probable location of the current speaker’s mouth in a sequence of images. The proposed method is tested on the CUAVE audio-visual database, yielding improved results, compared to other approaches from the literature. 1.
BLIND AUDIOVISUAL SOURCE SEPARATION USING SPARSE REPRESENTATIONS
"... In this work we present a method to jointly separate active audio and visual structures on a given mixture. Blind Audiovisual Source Separation is achieved exploiting the coherence between a video signal and a one-microphone audio track. The efficient representation of audio and video sequences allo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In this work we present a method to jointly separate active audio and visual structures on a given mixture. Blind Audiovisual Source Separation is achieved exploiting the coherence between a video signal and a one-microphone audio track. The efficient representation of audio and video sequences allows to build relationships between correlated structures on both modalities. Video structures exhibiting strong correlations with the audio signal and that are spatially close are grouped using a robust clustering algorithm that can count and localize audiovisual sources. Using such information and exploiting audio-video correlation, audio sources are also localized and separated. To the best of our knowledge this is the first blind audiovisual source separation algorithm conceived to deal with a video sequence and the corresponding mono audio signal. Index Terms — Audiovisual processing, blind source separation, sparse signal representation. 1.
Dynamic Modality Weighting for Multi-Stream HMMs in Audio-Visual Speech Recognition
"... Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight ada ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight adaptation based on modality confidence estimators. We assume different and time-varying environment noise, as can be encountered in realistic applications, and, for this, adaptive methods are best-suited. Stream reliability is assessed directly through classifier outputs since they are not specific to either noise type or level. The influence of constraining the weights to sum to one is also discussed. Categories and Subject Descriptors I.5.4 [Pattern Recognition]: Applications—signal processing, computer vision
Information theoretic feature extraction for audio-visual speech recognition
- IEEE Transactions on Signal Processing
, 2009
"... Abstract—The problem of feature selection has been thoroughly analyzed in the context of pattern classification, with the purpose of avoiding the curse of dimensionality. However, in the context of multimodal signal processing, this problem has been studied less. Our approach to feature extraction i ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—The problem of feature selection has been thoroughly analyzed in the context of pattern classification, with the purpose of avoiding the curse of dimensionality. However, in the context of multimodal signal processing, this problem has been studied less. Our approach to feature extraction is based on information theory, with an application on multimodal classification, in particular audio–visual speech recognition. Contrary to previous work in information theoretic feature selection applied to multimodal signals, our proposed methods penalize features for their redundancy, achieving more compact feature sets and better performance. We propose two greedy selection algorithms, one that penalizes a proportion of feature redundancy, while the other uses conditional mutual information as an evaluation measure, for the selection of visual features for audio–visual speech recognition. Our features perform better than linear discriminant analysis, the most usual transform for dimensionality reduction in the field, across a wide range of dimensionality values and combined with audio at different quality levels. Index Terms—Audio–visual speech recognition, feature selection, mutual information. I.
TRACKING ATOMS WITH PARTICLES FOR AUDIO-VISUAL SOURCE LOCALIZATION
"... We present a general framework and an efficient algorithm for tracking relevant video structures. The structures to be tracked are implicitly defined by a Matching Pursuit procedure that extracts and ranks the most important image contours. Based on the ranking, the contours are automatically select ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present a general framework and an efficient algorithm for tracking relevant video structures. The structures to be tracked are implicitly defined by a Matching Pursuit procedure that extracts and ranks the most important image contours. Based on the ranking, the contours are automatically selected to initialize a Particle Filtering tracker. The proposed algorithm deals with salient video entities whose behavior has an intuitive meaning, related to the physics of the signal. Moreover, as the interactions between such structures are easily defined, the inference of higher level signal configurations can be made intuitive. The proposed algorithm improves the performance of existing video structures trackers, while reducing the computational complexity. The algorithm is demonstrated on audiovisual source localization. Index Terms — Video signal processing, tracking, feature extraction, audiovisual processing. 1.

