Results 1 - 10
of
29
Data Fusion and Multicue Data Matching by Diffusion Maps
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2006
"... Data fusion and multi-cue data matching are fundamental tasks of high-dimensional data analysis. In this paper, we apply the recently introduced diffusion framework to address these tasks. Our contribution is three-fold. First, we present the Laplace-Beltrami approach for computing density invariant ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Data fusion and multi-cue data matching are fundamental tasks of high-dimensional data analysis. In this paper, we apply the recently introduced diffusion framework to address these tasks. Our contribution is three-fold. First, we present the Laplace-Beltrami approach for computing density invariant embeddings which are essential for integrating different sources of data. Second, we describe a refinement of the Nyström extension algorithm called “geometric harmonics”. We also explain how to use this tool for data assimilation. Finally, we introduce a multi-cue data matching scheme based on nonlinear spectral graphs alignment. The effectiveness of the presented schemes is validated by applying it to the problems of lip-reading and image sequence alignment.
Articulatory Features for Robust Visual Speech Recognition
, 2004
"... Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel SVM classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Real-time face detection and lip feature extraction using field-programmable gate arrays
- IEEE Trans. Syst. Man Cybern. B, Cybern
, 2006
"... Abstract—This paper proposes a new technique for face detection and lip feature extraction. A real-time field-programmable gate array (FPGA) implementation of the two proposed techniques is also presented. Face detection is based on a naive Bayes classifier that classifies an edge-extracted represen ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—This paper proposes a new technique for face detection and lip feature extraction. A real-time field-programmable gate array (FPGA) implementation of the two proposed techniques is also presented. Face detection is based on a naive Bayes classifier that classifies an edge-extracted representation of an image. Using edge representation significantly reduces the model’s size to only 5184 B, which is 2417 times smaller than a comparable statistical modeling technique, while achieving an 86.6 % correct detection rate under various lighting conditions. Lip feature extraction uses the contrast around the lip contour to extract the height and width of the mouth, metrics that are useful for speech filtering. The proposed FPGA system occupies only 15 050 logic cells, or about six times less than a current comparable FPGA face detection system. Index Terms—Edge information, face detection, fieldprogrammable gate array (FPGA), lip feature extraction, lip
Ricci flow for 3D shape analysis
- In Proceedings of ICCV ’07
, 2007
"... Ricci flow is a powerful curvature flow method in geometric analysis. This work is the first application of surface Ricci flow in computer vision. We show that previous methods based on conformal geometries, such as harmonic maps and least-square conformal maps, which can only handle 3D shapes with ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Ricci flow is a powerful curvature flow method in geometric analysis. This work is the first application of surface Ricci flow in computer vision. We show that previous methods based on conformal geometries, such as harmonic maps and least-square conformal maps, which can only handle 3D shapes with simple topology are subsumed by our Ricci flow based method which can handle surfaces with arbitrary topology. Because the Ricci flow method is intrinsic and depends on the surface metric only, it is invariant to rigid motion, scaling, and isometric and conformal deformations. The solution to Ricci flow is unique and its computation is robust to noise. Our Ricci flow based method can convert all 3D problems into 2D domains and offers a general framework for 3D surface analysis. Large non-rigid deformations can be registered with feature constraints, hence we introduce a method that constrains Ricci flow computation using feature points and feature curves. Finally, we demonstrate the applicability of this intrinsic shape representation through standard shape analysis problems, such as 3D shape matching and registration. 1.
Reconstructing tongue movements from audio and video
- in Interspeech, 2006
"... This paper presents an approach to articulatory inversion using audio and video of the user’s face, requiring no special markers. The video is stabilized with respect to the face, and the mouth region cropped out. The mouth image is projected into a learned independent component subspace to obtain a ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This paper presents an approach to articulatory inversion using audio and video of the user’s face, requiring no special markers. The video is stabilized with respect to the face, and the mouth region cropped out. The mouth image is projected into a learned independent component subspace to obtain a low-dimensional representation of the mouth appearance. The inversion problem is treated as one of regression; a non-linear regressor using relevance vector machines is trained with a dataset of simultaneous images of a subject’s face, acoustic features and positions of magnetic coils glued to the subjects’s tongue. The results show the benefit of using both cues for inversion. We envisage the inversion method to be part of a pronunciation training system with articulatory feedback. Index Terms: audio-visual to articulatory inversion. 1.
Facial Analysis and Synthesis
- Vrije Universiteit Brussel, Dept
, 2006
"... To my son to remind me of my dreams; to my husband to support me in pursuing my dreams; to my mother to guide me towards my dreams; to my family and friends to tell me to believe in my dreams; to my colleagues to help me realize my dreams on professional level. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
To my son to remind me of my dreams; to my husband to support me in pursuing my dreams; to my mother to guide me towards my dreams; to my family and friends to tell me to believe in my dreams; to my colleagues to help me realize my dreams on professional level.
Function-Based Classification from 3D Data and Audio
- Proceedings of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
, 2006
"... Abstract — We propose a novel scheme for fusion between two types of modalities to support function-based classification. While the first modality targets functional classification from sounds registered at impact, the second one aims classification of objects in 3D images. Using audio one can answe ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract — We propose a novel scheme for fusion between two types of modalities to support function-based classification. While the first modality targets functional classification from sounds registered at impact, the second one aims classification of objects in 3D images. Using audio one can answer functional questions such as what is the material the analyzed objects are built of, if the objects are full or hollow, if they are heavy, and if they are rigidly linked to their supports. Audio based signatures are used to label parts of the object under analysis. Different parts of any object can be partitioned in generic multi-level hierarchical descriptions of functional components. Functionality, in the visual modality reasoning scheme, is derived from a large set of geometric attributes and relationships between object parts. These geometric properties represent labeling signatures to the primitive and functional parts of the analyzed classes. The fusion between both of the modalities relies on a shared cooperation among audio and visual signatures of the functional and primitive parts. The scheme does not require a-priori knowledge about any class. We tested the proposed scheme on a database of about one thousand different 3D objects. The results show high accuracy in classification. I.
Fitting a single active appearance model simultaneously to multiple images
- In Proceedings of the British Machine Vision Conference
, 2004
"... Active Appearance Models (AAMs) are a well studied 2D deformable model. One recently proposed extension of AAMs to multiple images is the Coupled-View AAM. Coupled-View AAMs model the 2D shape and appearance of a face in two or more views simultaneously. The major limitation of Coupled-View AAMs, ho ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Active Appearance Models (AAMs) are a well studied 2D deformable model. One recently proposed extension of AAMs to multiple images is the Coupled-View AAM. Coupled-View AAMs model the 2D shape and appearance of a face in two or more views simultaneously. The major limitation of Coupled-View AAMs, however, is that they are specific to a particular set of cameras, both in geometry and the photometric responses. In this paper, we describe how a single AAM can be fit to multiple images, captured simultaneously by cameras with arbitrary geometry and response functions. Our algorithm retains the major benefits of Coupled-View AAMs: the integration of information from multiple images into a single model, and improved fitting robustness. 1
Robust Lip-Tracking using Rigid Flocks of Selected Linear Predictors
"... This paper proposes a learnt data-driven approach to the accurate, real-time tracking of lip shapes using only intensity information i.e. grey-scale images. This has the advantage that constraints such as a-priori shape models or temporal models for dynamics are not required or used. Tracking the li ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper proposes a learnt data-driven approach to the accurate, real-time tracking of lip shapes using only intensity information i.e. grey-scale images. This has the advantage that constraints such as a-priori shape models or temporal models for dynamics are not required or used. Tracking the lip shape is simply the independent tracking of a set of points that lie on the lip’s contour. This allows us to cope with different lip shapes that were not present in the training data and performs as well as other approaches that have pre-learnt shape models such as the AAM. Tracking is achived via linear predictors, where each linear predictor essentially linearly maps sparse template difference vectors to tracked feature position displacements. Multiple linear predictors are grouped into a rigid flock to obtain increased robustness. To achieve accurate tracking, two approaches are proposed for selecting relevant sets of LPs within each flock. Analysis of the selection results show that the LPs selected for tracking a feature point choose areas that are strongly correlated with that of the tracked target and that these areas are not necessarily the region around the feature point as is commonly assumed in LK based approaches. Experimental results also show that this method is comparable in performance to that of AAMs, despite being much simpler, both in the training and tracking phases, without any apriorishape information and with minimal training examples. 1.
Use of Vertical Face Profiles for Text Dependent Audio-Visual Biometric Person
, 2002
"... In this paper, a technique is proposed for text dependent audio-visual biometric person authentication using Dynamic Time Warping (DTW). A combination of features derived from video and audio is used as a representation of the utterance. The use of mid-face vertical intensity profiles as a represent ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, a technique is proposed for text dependent audio-visual biometric person authentication using Dynamic Time Warping (DTW). A combination of features derived from video and audio is used as a representation of the utterance. The use of mid-face vertical intensity profiles as a representation of facial movements associated with the utterance is proposed. The time varying profiles are extracted from the video sequence after detecting the face using a motion guided template matching technique. The visual feature is combined with Linear Prediction Cepstral Coefficients (LPCC) extracted from the speech waveform to obtain a temporally synchronous joint audio-visual feature. The joint audio-visual feature sequences are matched using the DTW algorithm to obtain the distances between the test and the reference utterances. The performance of the system is evaluated for a database of 25 speakers, and the results are discussed.

