Results 1 - 10
of
27
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
CUAVE: A new audio-visual database for multimodal human-computer interface research
- In Proc. ICASSP
, 2002
"... Multimodal signal processing has become an important topic of research for overcoming certain problems of audio-only speech processing. Audio-visual speech recognition is one area with great potential. Difficulties due to background noise and multiple speakers are significantly reduced by the additi ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
Multimodal signal processing has become an important topic of research for overcoming certain problems of audio-only speech processing. Audio-visual speech recognition is one area with great potential. Difficulties due to background noise and multiple speakers are significantly reduced by the additional information provided by extra visual features. Despite a few efforts to create databases in this area, none has emerged as a standard for comparison for several possible reasons. This paper seeks to introduce a new audiovisual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD. The CUAVE database is a speaker-independent corpus of over 7,000 utterances of both connected and isolated digits. It is designed to meet several goals that are discussed in this paper. The most notable are availability of the database, flexibility for use of
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
Moving-talker speaker-independent feature study and baseline results using the CUAVE multimodal speech corpus
- EURASIP Journal on Applied Signal Processing
, 2002
"... Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming cert ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming certain problems of traditional audio-only methods. Difficulties due to background noise and multiple speakers in an application environment are significantly reduced by the additional information provided by visual features. This paper presents information on a new audio-visual database, a feature study on moving speakers, and baseline results for the whole speaker group. Although a few databases have been collected in this area, none has emerged as a standard for comparison. Also, efforts to date have often been limited, focusing on cropped video or stationary speakers. This paper seeks to introduce a challenging audio-visual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD. The CUAVE database is a speaker-independent corpus of both connected and continuous digit strings totaling over 7,000 utterances. It contains a wide variety of speakers, and is designed to meet several goals discussed in this paper. One of these goals is to allow testing of adverse conditions such as moving talkers and speaker pairs. For information on obtaining CUAVE, please visit our webpage
Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition
- EURASIP J. APPL. SIGNAL PROCESSING
, 2002
"... When trying to overcome the significant performance drops of ASR systems in the presence of noise, one road to follow is the integration of the information present in the lips movement of the speaker. Comparisons showed that integration of audio and video data on the decision level yields best re ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
When trying to overcome the significant performance drops of ASR systems in the presence of noise, one road to follow is the integration of the information present in the lips movement of the speaker. Comparisons showed that integration of audio and video data on the decision level yields best recognition results. This raises the question how to weight the two modalities in different noise conditions. Throughout this article we develop a weighting process adaptive to various background noise situations. Firstly
Large-Vocabulary Audio-Visual Speech Recognition by Machines and Humans
- of the Johns Hopkins Summer 2000 Workshop,” in Proc. Works. Signal Processing
, 2001
"... We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at va ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audiovisual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audioonly speech perception at low SNRs. 1.
Articulatory Features for Robust Visual Speech Recognition
, 2004
"... Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel SVM classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Feature Analysis for Automatic Speechreading
- In Proc. Int’l Workshop Multimedia Signal Processing
, 2001
"... Audio-Visual Automatic Speech Recognition systems use visual information to enhance ASR systems in clean and noisy environments. This paper compares of a number of different visual feature extraction methods. When performing visual speech recognition the visual feature vector requires a base level o ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Audio-Visual Automatic Speech Recognition systems use visual information to enhance ASR systems in clean and noisy environments. This paper compares of a number of different visual feature extraction methods. When performing visual speech recognition the visual feature vector requires a base level of detail for optimum recognition. Geometric feature extraction provides lower recognition than pixel based methods due to the loss of characteristic speech information such as f-tuck, protrusion etc. Downsampling of images reduces visual recognition scores due to the loss of detail in the images. Also, the role of dynamic features was investigated for improved recognition. It was observed that the use of static features only, provided higher recognition scores than with a feature vector of the same length containing both static and dynamic features. These results illustrate the need for a base level of detail in the feature vector for improved visual recognition scores.
A Cascade Visual Front End for Speaker Independent Automatic Speechreading
- International Journal of Speech Technology
, 2001
"... We propose a three-stage pixel based visual front end for automatic speechreading #lipreading# that results in signi#cantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest th ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We propose a three-stage pixel based visual front end for automatic speechreading #lipreading# that results in signi#cantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The #rst stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis based data projection, which is applied on a concatenation of a small number of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multi-variate normal distribution with diagonal covariance. We apply the algorithm to visual-only 52-class phonetic and 27-class visemic classi#cation on a 162-subject, 8-hour long, large-vocabulary, continuous speech audio-visual database. We demonstrate signi#cant classi#cation accuracy gains byeach added stage of the proposed algorithm, which, when combined, can reach up to 27# improvement. Overall, weachieve a 60# #49## visual-only frame-level visemic classi#cation accuracy with #without# use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classi#cation over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.
OPTIMAL WEIGHTING OF POSTERIORS FOR AUDIO-VISUAL SPEECH RECOGNITION
, 2001
"... We investigate the fusion of audio and video a posteriori phonetic probabilities in a hybrid ANN/HMM audio-visual speech recognition system. Three basic conditions to the fusion process are stated and implemented in a linear and a geometric weighting scheme. These conditions are the assumption of co ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We investigate the fusion of audio and video a posteriori phonetic probabilities in a hybrid ANN/HMM audio-visual speech recognition system. Three basic conditions to the fusion process are stated and implemented in a linear and a geometric weighting scheme. These conditions are the assumption of conditional independence of the audio and video data and the contribution of only one of the two paths when the SNR is very high or very low, respectively. In the case of the geometric weighting a new weighting scheme is developed whereas the linear weighting follows the Full Combination approach as employed in multi-stream recognition. We compare these two new concepts in audio-visual recognition to a rather standard approach known from the literature. Recognition tests were performed in a continuous number recognition task on a single speaker database containing 1712 utterances with two different types of noise added.

