Results 1 - 10
of
19
Gaze Tracking Based on Face-Color
- In International Workshop on Automatic Face- and Gesture-Recognition
, 1995
"... In many practical situations, a desirable user interface to a computer system should have a model of where a person is looking at and what he/she is paying attention to. This is particularly important if a system is providing multimodal communication cues, speech, gesture, lipreading, etc., [2, 3, 8 ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
In many practical situations, a desirable user interface to a computer system should have a model of where a person is looking at and what he/she is paying attention to. This is particularly important if a system is providing multimodal communication cues, speech, gesture, lipreading, etc., [2, 3, 8] and the system must identify, whether the cues are aimed at it, or at someone else in the room. This paper describes a system that identifies user focus of attention by visually determining where a person is looking. While other attempts at gaze tracking usually assume a fixed or limited location of a person 's face, the approach presented here allows for complete freedom of movement in a room. The gaze-tracking system, uses several connectionist modules, that track a person's face using a software controlled pan-tilt camera with zoom and identifies the focus of attention from the orientation and direction of the face. 1 Introduction One major impediment to user acceptance of speech inte...
Audio-Visual Integration In Multimodal Communication
- Proc. IEEE
, 1998
"... : In this paper, we review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip-reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also st ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
: In this paper, we review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip-reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also study the enabling technologies for these research topics, including automatic facial feature tracking and audio-to-visual mapping. Recent progress in audio-visual research shows that joint processing of audio and video provides advantages that are not available when the audio and video are processed independently. Keywords: Multimedia communication, Speech processing, Speech communication, Video signal processing, Image analysis 1. Introduction Multimedia is more than simply the combination of various forms of data: text, speech, audio, music, images, graphics, and video. When we discuss multimedia signal processing, it is the integration and interaction among these different media types t...
Extraction of Visual Features for Lipreading
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2002
"... The multi-modal nature of speech is often ignored in human-computer interaction but lip deformation, and other body such as head and arm motion all convey additional infor-mation. We integrate speech cues from many sources and this improves intelligibility, es-pecially when the acoustic signal is de ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
The multi-modal nature of speech is often ignored in human-computer interaction but lip deformation, and other body such as head and arm motion all convey additional infor-mation. We integrate speech cues from many sources and this improves intelligibility, es-pecially when the acoustic signal is degraded. This paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three meth-ods for parameterising lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape, or shape and appearance respectively. The third, bottom-up, method uses a non-linear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multi-talker visual speech recognition task of isolated letters.
Adaptive Bimodal Sensor Fusion For Automatic Speechreading
, 1996
"... We present recent work on improving the performance of automated speech recognizers by using additional visual information (Lip-/Speechreading), achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition perfor ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We present recent work on improving the performance of automated speech recognizers by using additional visual information (Lip-/Speechreading), achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition performance. We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. We have developed adaptive combination methods at several levels of the recognition network. Additional information such as estimated signal-to-noise ratio (SNR) is used in some cases. The results of the different combination methods are shown for clean speech and data with artificial noise (white, music, motor). The new combination methods adapt automatically to varying noise conditions making hand-tuned parameters unnecessary. 1. INTRODUCTION Automated speech recognition systems still perform poorly in real-world applications. Most approaches are very sensitive to background n...
Eye Controlled Media: Present and Future State
, 1995
"... Today, the human eye-gaze can be recorded by relatively unobtrusive techniques. This thesis argues that it is possible to use the eye-gaze of a computer user in the interface to aid the control of the application. Care must be taken, though, that eye-gaze tracking data is used in a sensible way, sin ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Today, the human eye-gaze can be recorded by relatively unobtrusive techniques. This thesis argues that it is possible to use the eye-gaze of a computer user in the interface to aid the control of the application. Care must be taken, though, that eye-gaze tracking data is used in a sensible way, since the nature of human eye-movements is a combination of several voluntary and involuntary cognitive processes. The main reason for eye-gaze based user interfaces being attractive is that the direction of the eye-gaze can express the interests of the user -- it is a potential porthole into the current cognitive processes -- and communication through the direction of the eyes is faster than any other mode of human communication. It is argued that eye-gaze tracking data is best used in multimodal interfaces where the user interacts with the data instead of the interface, in so-called noncommand user interfaces. Furthermore, five usability criteria for eye-gaze media are given. This thesis also sugges...
Accurate, Real-Time, Unadorned Lip Tracking
- in ICCV
, 1998
"... Human speech is inherently multi-modal, consisting of both audio and visual components. Recently researchers have shown that the incorporation of information about the position of the lips into acoustic speech recognisers enables robust recognition of noisy speech. In the case of Hidden Markov Model ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Human speech is inherently multi-modal, consisting of both audio and visual components. Recently researchers have shown that the incorporation of information about the position of the lips into acoustic speech recognisers enables robust recognition of noisy speech. In the case of Hidden Markov Modelrecognition, we show that this happens because the visual signal stabilises the alignment of states. It is also shown that unadorned lips, both the inner and outer contours, can be robustly tracked in real time on general-purpose workstations. To accomplish this, efficient algorithms are employed which contain three key components: shape models, motion models, and focused colour feature detectors --- all of which are learnt from examples.
Preprocessing Of Visual Speech Under Real World Conditions
- In Proceedings of European Tutorial & Research Workshop on Audio-Visual Speech Processing
, 1997
"... In this paper we present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. We have developed a modular system for flexible human-computer interaction via speech. In order to give the speaker reasonable freedom of move ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
In this paper we present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. We have developed a modular system for flexible human-computer interaction via speech. In order to give the speaker reasonable freedom of movement within a room, the speaker's face is automatically acquired and followed by a face tracker subsystem, which delivers constant size, centered images of the face in real time. The image of the lips is automatically extracted from the camera image of the speaker's face by the lip tracker module, which can track the lips in real time. Furthermore, we show how the system deals with problems in real environments such as different illuminations and image sizes, and how the system adapts automatically to different noise conditions. 1. INTRODUCTION Most approches to automated speech recognition (ASR) that consider solely acoustic information are very sensitive to background noise or fail totall...
Statistical chromaticity models for lip tracking with B-splines
- In Proceedings of the First International Conference on Audio- and Video-based Biometric Person Authentication, Lecture Notes in Computer Science
, 1997
"... . A method for lip tracking intended to support personal verification is presented in this paper. Lip contours are represented by means of quadratic Bsplines. The lips are automatically localised in the original image and an elliptic B-spline is generated to start up tracking. Lip localisation explo ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
. A method for lip tracking intended to support personal verification is presented in this paper. Lip contours are represented by means of quadratic Bsplines. The lips are automatically localised in the original image and an elliptic B-spline is generated to start up tracking. Lip localisation exploits grey-level gradient projections as well as chromaticity models to find the lips in an automatically segmented region corresponding to the face area. Tracking proceeds by estimating new lip contour positions according to a statistical chromaticity model for the lips. The current tracker implementation follows a deterministic second order model for the spline motion based on a Lagrangian formulation of contour dynamics. The method has been tested on the M2VTS database[1]. Lips were accurately tracked on sequences consisting of more than hundred frames. localisation 1 Introduction INT. CONF. ON AUDIO- AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, CRANS MONTANA, SWITZERLAND, 1997. Lip tr...
Improved Roi And Within Frame Discriminant Features For Lipreading
- Proc. International Conference on Image Processing, Thessaloniki, Greece
, 2001
"... We study three aspects of designing appearance based visual features for automatic lipreading: (a) The choice of the video region of interest (ROI), on which image transform features are obtained; (b) The extraction of speech discriminant features at each frame; and (c) The use of temporal informati ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We study three aspects of designing appearance based visual features for automatic lipreading: (a) The choice of the video region of interest (ROI), on which image transform features are obtained; (b) The extraction of speech discriminant features at each frame; and (c) The use of temporal information to improve visual speech modeling. In particular, with respect to (a), we propose a ROI that includes the speaker's jaw and cheeks, in addition to the traditionally used mouth/lip region; with respect to (b) and (c), we propose the use of a two-stage linear discriminant analysis, both within frame, as well as across a large number of frames. On a largevocabulary, continuous speech audio-visual database, the proposed visual features result in a 13% absolute reduction in visual-only word error rate over a baseline visual front end, and in an additional 28% relative improvement in audio-visual over audio-only phonetic classification accuracy. 1.
Linear Discriminant Analysis For Speechreading
- Proc. Work. Multimedia Signal Process
, 1998
"... This paper investigates the use of Fisher-Rao linear discriminant analysis (LDA) as a means of visual feature extraction for hidden Markov model based automatic speechreading. For every video frame, a three-dimensional region of interest containing the speaker's mouth over a sequence of adjacent fra ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper investigates the use of Fisher-Rao linear discriminant analysis (LDA) as a means of visual feature extraction for hidden Markov model based automatic speechreading. For every video frame, a three-dimensional region of interest containing the speaker's mouth over a sequence of adjacent frames is lexicographically arranged into a data vector. Suchvectors are then projected onto the space of the most discriminant "eigensequences", estimated by means of LDA on a training set of image sequence vectors, labeled from a set of a-priori chosen classes. The resulting projections, as well as their first and second derivatives over time, are used as features for automatic speechreading. The proposed method is applied to single-speaker, multi-speaker, and speaker-independent visual-only recognition tasks, consistently outperforming principal component analysis and discrete wavelet transform based visual features. Specific issues relevant to LDA are also discussed, namely, class selection, aut...

