Results 1 - 10
of
76
Robust automatic speech recognition with missing and unreliable acoustic data
- Speech Communication
, 2001
"... ..."
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
Heterogeneous Acoustic Measurement And Multiple Classifiers For Speech Recognition
, 1998
"... The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements.
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Large-Vocabulary Audio-Visual Speech Recognition by Machines and Humans
- of the Johns Hopkins Summer 2000 Workshop,” in Proc. Works. Signal Processing
, 2001
"... We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at va ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audiovisual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audioonly speech perception at low SNRs. 1.
Spectral Signal Processing for ASR
- Proc. ASRU’99
, 1999
"... The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting of melscale cepstrum coefficients and their temporal derivatives, is described. Some variations and extensions of the standard analysis --- PLP, cepstrum correlation methods, LDA, and variants on log power --- are then discussed. These techniques pass the test of having been found useful at multiple sites, especially with noisy speech. The extent to which auditory properties can account for the advantage found for particular techniques is considered. It is concluded that the advantages do not in fact stem from auditory properties, and that there is so far little or no evidence that the study of the human auditory system has contributed to advances in automatic speech recognition. Contributio...
Asynchrony Modeling for Audio-Visual Speech Recognition
, 2002
"... We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various degrees of audio and visual state-sequence asynchrony. Furthermore, we investigate joint training of all product HMM parameters, instead of just composing the model from separately trained audio- and visual-only HMMs. We report experiments on a multi-subject connected digit recognition task, as well as on a more complex, speaker-independent large-vocabulary dictation task. Our results demonstrate that in both cases, joint multistream HMM training is superior to separate training of singlestream HMMs. In addition, we observe that allowing state-sequence asynchrony between the HMM audio and visual components improves connected digit recognition significantly, however it degrades performance on the dictation task. The resulting multi-stream models dramatically improve speech recognition robustness to noise, by successfully exploiting the visual modality speech information: For example, at 11 dB SNR, they reduce connected digit word error rate from the audio-only 2.3% to 0.77% audio-visual, and, for the largevocabulary task, from 28.3% to 19.5%. Compared to the audioonly performance at 10 dB SNR, the use of multi-stream HMMs achieves an effective SNR gain of up to 9 dB and 7 dB respectively, for the two recognition tasks considered.
A comparison of the data requirements of automatic speech recognition systems and human listeners
- Proc. Eurospeech, Geneva
, 2003
"... Since the introduction of hidden Markov modelling there has been an increasing emphasis on data-driven approaches to automatic speech recognition. This derives from the fact that systems trained on substantial corpora readily outperform those that rely on more phonetic or linguistic priors. Similarl ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
Since the introduction of hidden Markov modelling there has been an increasing emphasis on data-driven approaches to automatic speech recognition. This derives from the fact that systems trained on substantial corpora readily outperform those that rely on more phonetic or linguistic priors. Similarly, extra training data almost always results in a reduction in word error rate- “there's no data like more data”. However, despite this progress, contemporary systems are not able to fulfill the requirements demanded by many potential applications, and performance is still significantly short of the capabilities exhibited by human listeners. For these reasons, the R&D community continues to call for even greater quantities of data in order to train their systems. This paper addresses the issue of just how much data might be required in order to bring the performance of an automatic speech recognition system up to that of a human listener. 1.

