Results 1 - 10
of
79
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Audio-Visual Integration In Multimodal Communication
- Proc. IEEE
, 1998
"... : In this paper, we review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip-reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also st ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
: In this paper, we review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip-reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also study the enabling technologies for these research topics, including automatic facial feature tracking and audio-to-visual mapping. Recent progress in audio-visual research shows that joint processing of audio and video provides advantages that are not available when the audio and video are processed independently. Keywords: Multimedia communication, Speech processing, Speech communication, Video signal processing, Image analysis 1. Introduction Multimedia is more than simply the combination of various forms of data: text, speech, audio, music, images, graphics, and video. When we discuss multimedia signal processing, it is the integration and interaction among these different media types t...
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
Extraction of Visual Features for Lipreading
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2002
"... The multi-modal nature of speech is often ignored in human-computer interaction but lip deformation, and other body such as head and arm motion all convey additional infor-mation. We integrate speech cues from many sources and this improves intelligibility, es-pecially when the acoustic signal is de ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
The multi-modal nature of speech is often ignored in human-computer interaction but lip deformation, and other body such as head and arm motion all convey additional infor-mation. We integrate speech cues from many sources and this improves intelligibility, es-pecially when the acoustic signal is degraded. This paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three meth-ods for parameterising lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape, or shape and appearance respectively. The third, bottom-up, method uses a non-linear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multi-talker visual speech recognition task of isolated letters.
Acoustic-labial speaker verification
- Audio and Video based Person Authentication - AVBPA97, volume LNCS-1206
, 1997
"... defined), by Ben Gold and Nelson Morgan. ..."
The challenge of spoken language systems: Research directions for the nineties
- IEEE Transactions on Speech and Audio Processing
, 1995
"... Footnote This article is based on a February, 1992workshop sponsored by the National Science ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Footnote This article is based on a February, 1992workshop sponsored by the National Science
See Me, Hear Me: Integrating Automatic Speech Recognition And Lip-Reading
- Proc. Int. Conf. Spoken Lang. Process
, 1994
"... We present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. A Multi-State Time Delay Neural Network performs the recognition of spelled letter sequences taking advantage of lip images from a standard camera. The prob ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
We present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. A Multi-State Time Delay Neural Network performs the recognition of spelled letter sequences taking advantage of lip images from a standard camera. The problems addressed include efficient but effective representation of the visual information and optimum manner of combining the two modalities when rendering a decision. We show results for several alternatives to direct gray level image as the visual evidence. These are: Principal Components, Linear Discriminants, and DFT coefficients. Dimensionality of the input is decreased by a factor of 12 while maintaining recognition rates. Combination of the visual and acoustic information is performed at three different levels of abstraction. Results suggest that integration of higher order input features works best. On a continuous spelling task, visual-alone recognition of 45-55%, when combined with a...
Integration of acoustic and visual speech signals using neural networks
- IEEE Communications Magazine
, 1989
"... rely almost exclusively on the acoustic speech signal and, consequently, these systems often perform poorly in noisy environments [I]. Attempts to clean up the acoustic input have had limited success [2]. Another approach is to use other sources of speech information, such as visual speech signals. ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
rely almost exclusively on the acoustic speech signal and, consequently, these systems often perform poorly in noisy environments [I]. Attempts to clean up the acoustic input have had limited success [2]. Another approach is to use other sources of speech information, such as visual speech signals. The perception of acoustic speech by humans can be affected by the visible speech signals [3-51. Specifically, when the acoustic signal is degraded by noise, the visual signal can provide supplemental speech information that improves speech perception [6-81. When no acoustic signal is available, as for the profoundly deaf, the visual signal alone can provide speech information through lip reading [9- 1 I]. Here we answer two questions: Can the speech information conveyed by visual speech signals be extracted automatically? How can this information be combined with information from the acoustic signal to improve automat
Adaptive Bimodal Sensor Fusion For Automatic Speechreading
, 1996
"... We present recent work on improving the performance of automated speech recognizers by using additional visual information (Lip-/Speechreading), achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition perfor ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We present recent work on improving the performance of automated speech recognizers by using additional visual information (Lip-/Speechreading), achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition performance. We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. We have developed adaptive combination methods at several levels of the recognition network. Additional information such as estimated signal-to-noise ratio (SNR) is used in some cases. The results of the different combination methods are shown for clean speech and data with artificial noise (white, music, motor). The new combination methods adapt automatically to varying noise conditions making hand-tuned parameters unnecessary. 1. INTRODUCTION Automated speech recognition systems still perform poorly in real-world applications. Most approaches are very sensitive to background n...

