Results 1 - 10
of
19
Multimodal Integration - A Statistical View
- IEEE Transactions on Multimedia
, 1999
"... This paper presents a statistical approach to developing multimodal recognition systems and, in particular, to integrating the posterior probabilities of parallel input signals involved in the multimodal system. We first identify the primary factors that influence multimodal recognition performance ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
This paper presents a statistical approach to developing multimodal recognition systems and, in particular, to integrating the posterior probabilities of parallel input signals involved in the multimodal system. We first identify the primary factors that influence multimodal recognition performance by evaluating the multimodal recognition probabilities. We then develop two techniques, an estimate approach and a learning approach, which are designed to optimize accurate recognition during the multimodal integration process. We evaluate these methods using Quickset, a speech/gesture multimodal system, and report evaluation results based on an empirical corpus collected with Quickset. From an architectural perspective, the integration technique presented here offers enhanced robustness. It also is premised on more realistic assumptions than previous multimodal systems using semantic fusion. From a methodological standpoint, the evaluation techniques that we describe provide a valuable too...
Visual Tracking for Multimodal Human Computer Interaction
, 1996
"... In this paper, we present visual tracking techniques for multimodal human computer interaction. First, we discuss techniques for tracking human faces in which human skin-color is used as a major feature. An adaptive stochastic model has been developed to characterize the skin-color distributions. Ba ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
In this paper, we present visual tracking techniques for multimodal human computer interaction. First, we discuss techniques for tracking human faces in which human skin-color is used as a major feature. An adaptive stochastic model has been developed to characterize the skin-color distributions. Based on the maximum likelihood method, the model parameters can be adapted for different people and different lighting conditions. The feasibility of the model has been demonstrated by the development of a real-time face tracker. The system has achieved a rate of 30+ frames/second using a low-end workstation with a framegrabber and a camera. We also present a top-down approach for tracking facial features such as eyes, nostrils, and lip corners. These real-time visual tracking techniques have been successfully applied to many applications such as gaze tracking, and lipreading. The face tracker has been combined with a microphone array for extracting speech signal from a specific person. The g...
See Me, Hear Me: Integrating Automatic Speech Recognition And Lip-Reading
- Proc. Int. Conf. Spoken Lang. Process
, 1994
"... We present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. A Multi-State Time Delay Neural Network performs the recognition of spelled letter sequences taking advantage of lip images from a standard camera. The prob ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
We present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. A Multi-State Time Delay Neural Network performs the recognition of spelled letter sequences taking advantage of lip images from a standard camera. The problems addressed include efficient but effective representation of the visual information and optimum manner of combining the two modalities when rendering a decision. We show results for several alternatives to direct gray level image as the visual evidence. These are: Principal Components, Linear Discriminants, and DFT coefficients. Dimensionality of the input is decreased by a factor of 12 while maintaining recognition rates. Combination of the visual and acoustic information is performed at three different levels of abstraction. Results suggest that integration of higher order input features works best. On a continuous spelling task, visual-alone recognition of 45-55%, when combined with a...
Toward Movement-Invariant Automatic Lip-Reading and Speech Recognition
- Proc. of IEEE Int'l Conf. on Acoustics, Speech and Signal Processing
, 1995
"... We present the development of a modular system for flexible human–computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
We present the development of a modular system for flexible human–computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker’s face by the lip locator module. Finally, the speaker’s face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions. 1.
Dynamic Bayesian Networks for Information Fusion with Applications to Human-Computer Interfaces
, 1999
"... Recent advances in various display and virtual technologies coupled with an explosion in available computing power have given rise to a numberofnovel human-computer interaction (HCI) modalities -- speech, vision-based gesture recognition, eye tracking, EEG, etc. However, despite the abundance of nov ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Recent advances in various display and virtual technologies coupled with an explosion in available computing power have given rise to a numberofnovel human-computer interaction (HCI) modalities -- speech, vision-based gesture recognition, eye tracking, EEG, etc. However, despite the abundance of novel interaction devices, the naturalness and efficiency of HCI has remained low. This is due in particular to the lack of robust sensory data interpretation techniques. To deal with the task of interpreting single and multiple interaction modalities this dissertation establishes a novel probabilistic approach based on dynamic Bayesian networks (DBNs). As a generalization of the successful hidden Markov models, DBNs are a natural basis for the general temporal action interpretation task. The problem of interpretation of single or multiple interacting modalities can then be viewed as a Bayesian inference task. In this work three complex DBN models are introduced: mixtures of DBNs, mixed-state DBNs, and coupled HMMs. In-depth study of these models yields efficient approximate inference and parameter learning techniques applicable to a wide variety of problems. Experimental validation of the proposed approaches in the domains of gesture and speech recognition con rms the model's applicability to both unimodal and multimodal interpretation tasks.
A platform for developing Intelligent MultiMedia Applications
, 1998
"... Intelligent multimedia (IntelliMedia), which involves the computer processing and understanding of perceptual input from at least speech, text and visual images, and then reacting to it, is complex and involves signal and symbol processing techniques ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
Intelligent multimedia (IntelliMedia), which involves the computer processing and understanding of perceptual input from at least speech, text and visual images, and then reacting to it, is complex and involves signal and symbol processing techniques
Multimodal Man-Machine Interface for Mission Planning
- In Proceedings of the AAAI Spring Symposium on Intelligent Environments
, 1998
"... This paper presents a multimodal interface featuring fusion of multiple modalities for natural human-computer interaction. The architecture of the interface and the methods applied are described, and the results of the real-time multimodal fusion are analyzed. The research in progress concerning a m ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This paper presents a multimodal interface featuring fusion of multiple modalities for natural human-computer interaction. The architecture of the interface and the methods applied are described, and the results of the real-time multimodal fusion are analyzed. The research in progress concerning a mission planning scenario is discussed and other possible future directions are also presented. Keywords Multimodal interfaces, speech recognition, microphonearray, force-feedback tactile glove, gaze tracking, military maps INTRODUCTION Current human-machine communication systems predominantly use keyboard and mouse inputs that inadequately approximate human abilities for communication. More natural communication technologies such as speech, sight and touch, are capable of freeing computer users from the constraints of keyboard and mouse. Although they are not sufficiently advanced to be used individually for robust human-machine communication, they have adequately advanced to serve simul...
Modeling and Interpreting Multimodal Inputs: A Semantic Integration Approach
, 1997
"... Modern user interfaces can take advantage of multiple input modalities such as speech, gestures, handwriting... to increase robustness and flexibility. The construction of such multimodal interfaces would be greatly facilitated by a unified framework that provides methods to characterize and interpr ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Modern user interfaces can take advantage of multiple input modalities such as speech, gestures, handwriting... to increase robustness and flexibility. The construction of such multimodal interfaces would be greatly facilitated by a unified framework that provides methods to characterize and interpret multimodal inputs. In this paper we describe a semantic model and a multimodal grammar structure for a broad class of multimodal applications. We also present a set of grammarbased Java tools that facilitate the construction of multimodal input processing modules, including a connectionist network for multimodal semantic integration.
An Architecture for Multimodal Information Fusion
- Proceedings of the Workshop on Perceptual User Interfaces (PUI’97
, 1997
"... This paper presents a multimodal interface featuring fusion of multiple modalities for natural human-computer interaction. The architecture of the interface and the methods applied are described, and the results of the real-time multimodal fusion are analyzed. The research in progress concerning a m ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper presents a multimodal interface featuring fusion of multiple modalities for natural human-computer interaction. The architecture of the interface and the methods applied are described, and the results of the real-time multimodal fusion are analyzed. The research in progress concerning a mission planning scenario is discussed and other possible future directions are also presented. 1 Introduction Current human/machine communication systems predominantly use keyboard and mouse inputs that inadequately approximate human abilities for communication. More natural communication technologies such as speech, sight and touch, are capable of freeing computer users from the keyboard and mouse. Although they are not sufficiently advanced to be used individually for robust human/machine communication, they have adequately advanced to serve simultaneous multisensory information exchange [2], [6]. The challenge is to properly combine these technologies to replicate the natural style of h...

