Results 1 - 10
of
437
Audio-Visual Clustering for Multiple Speaker Localization
"... Abstract. We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the identification and localiz ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair
Detection and Localization of 3D Audio-Visual Objects Using Unsupervised Clustering
"... This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization probl ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture
Submitted to the British Machine Vision Conference 2004 Autonomous learning of perceptual categories and symbolic protocols from audio-visual input
"... The development of cognitive vision systems that autonomously learn how to interact with their environment through input from sensors is a major challenge for the Computer Vision, Machine Learning and Artificial Intelligence communities. This paper presents a framework in which a symbolic inference ..."
Abstract
- Add to MetaCart
engine is integrated with a perceptual system. The rules of the inference engine are learned from audio-visual observation of the world, to form an interactive perceptual agent that can respond suitably to its environment. From the video and audio input streams interesting objects are identified using
A coupled HMM for audio-visual speech recognition
- in International Conference on Acoustics, Speech and Signal Processing (CASSP’02
, 2002
"... In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of featu ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony
Author manuscript, published in "ACM/IEEE International Conference on Multimodal Interfaces (2008)" Detection and Localization of 3D Audio-Visual Objects Using Unsupervised Clustering
, 2009
"... This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization probl ..."
Abstract
- Add to MetaCart
problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture
Asynchrony Modeling for Audio-Visual Speech Recognition
, 2002
"... We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs
Stream Confidence Estimation For Audio-Visual Speech Recognition
- IN PROC. OF ICSLP 2000, BEJING
, 2000
"... We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved audio -visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved audio -visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM
Audio-Visual Speaker Tracking with Importance Particle Filters
- in Proc. IEEE Int. Conf. on Image Processing (ICIP
, 2003
"... We present a probabilistic method for audio-visual (AV) speaker tracking, using an uncalibrated wide-angle camera and a microphone array. The algorithm fuses 2-D object shape and audio information via importance particle filters (I-PFs), allowing for the asymmetrical integration of AV information in ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
We present a probabilistic method for audio-visual (AV) speaker tracking, using an uncalibrated wide-angle camera and a microphone array. The algorithm fuses 2-D object shape and audio information via importance particle filters (I-PFs), allowing for the asymmetrical integration of AV information
Speaker independent audio-visual continuous speech recognition
- In International Conference on Multimedia and Expo
, 2002
"... The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generat ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech
Detection of documentary scene changes by audio-visual fusion
- In Proc. CIVR
, 2004
"... Abstract. The concept of a documentary scene was inferred from the audio-visual characteristics of certain documentary videos. It was ob-served that the amount of information from the visual component alone was not enough to convey a semantic context to most portions of these videos, but a joint obs ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. The concept of a documentary scene was inferred from the audio-visual characteristics of certain documentary videos. It was ob-served that the amount of information from the visual component alone was not enough to convey a semantic context to most portions of these videos, but a joint
Results 1 - 10
of
437