Results 1 - 10
of
42
On Designing a Visual System (Towards a Gibsonian computational model of vision)
- Journal of Experimental and Theoretical AI
, 1989
"... This paper contrasts the standard (in AI) "modular" theory of the nature of vision with a more general (labyrinthine) theory of vision as involving multiple functions and multiple relationships with other sub-systems of an intelligent system. ..."
Abstract
-
Cited by 54 (41 self)
- Add to MetaCart
This paper contrasts the standard (in AI) "modular" theory of the nature of vision with a more general (labyrinthine) theory of vision as involving multiple functions and multiple relationships with other sub-systems of an intelligent system.
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
Adaptive Bimodal Sensor Fusion For Automatic Speechreading
, 1996
"... We present recent work on improving the performance of automated speech recognizers by using additional visual information (Lip-/Speechreading), achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition perfor ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We present recent work on improving the performance of automated speech recognizers by using additional visual information (Lip-/Speechreading), achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition performance. We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. We have developed adaptive combination methods at several levels of the recognition network. Additional information such as estimated signal-to-noise ratio (SNR) is used in some cases. The results of the different combination methods are shown for clean speech and data with artificial noise (white, music, motor). The new combination methods adapt automatically to varying noise conditions making hand-tuned parameters unnecessary. 1. INTRODUCTION Automated speech recognition systems still perform poorly in real-world applications. Most approaches are very sensitive to background n...
Towards unrestricted lip reading
- In Second International Conference on Multimedia Interfaces, Hong Kong
, 1999
"... Lip reading provides useful information in speech perception and language understanding, especially when the auditory speech is degraded. However, many current automatic lip reading systems impose some restrictions on users. In this paper, we present our research e orts, in the Interactive System La ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Lip reading provides useful information in speech perception and language understanding, especially when the auditory speech is degraded. However, many current automatic lip reading systems impose some restrictions on users. In this paper, we present our research e orts, in the Interactive System Laboratory, towards unrestricted lip reading. We rst introduce a top-down approach to automatically track and extract lip regions. This technique makes it possible to acquire visual information in real-time without limiting user's freedom of movement. We then discuss normalization algorithms to preprocess images for di erent lightning conditions (global illumination and side illumination). We also compare di erent visual preprocessing methods such as raw image, Linear Discriminant Analysis (LDA), and Principle Component Analysis (PCA). We demonstrate the feasibility of the proposed methods by development of a modular system for exible human-computer interaction via both visual and acoustic speech. The system is based on an extension of an existing state-of-the-art speech recognition system, a modular Multiple State-Time Delayed Neural Network (MS-TDNN) system. We have developed adaptive combination methods at several di erent levels of the recognition network. The system can automatically track a speaker and extract his/her lip region in real-time. The system has been evaluated under di erent noisy conditions such as white noise, music, and mechanical noise. The experimental results indicate that the system can achieve up to 55 % error reduction using additional visual information. 1.
A coupled HMM for audio-visual speech recognition
- in International Conference on Acoustics, Speech and Signal Processing (CASSP’02
, 2002
"... In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of featu ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition. 1.
Improved Roi And Within Frame Discriminant Features For Lipreading
- Proc. International Conference on Image Processing, Thessaloniki, Greece
, 2001
"... We study three aspects of designing appearance based visual features for automatic lipreading: (a) The choice of the video region of interest (ROI), on which image transform features are obtained; (b) The extraction of speech discriminant features at each frame; and (c) The use of temporal informati ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We study three aspects of designing appearance based visual features for automatic lipreading: (a) The choice of the video region of interest (ROI), on which image transform features are obtained; (b) The extraction of speech discriminant features at each frame; and (c) The use of temporal information to improve visual speech modeling. In particular, with respect to (a), we propose a ROI that includes the speaker's jaw and cheeks, in addition to the traditionally used mouth/lip region; with respect to (b) and (c), we propose the use of a two-stage linear discriminant analysis, both within frame, as well as across a large number of frames. On a largevocabulary, continuous speech audio-visual database, the proposed visual features result in a 13% absolute reduction in visual-only word error rate over a baseline visual front end, and in an additional 28% relative improvement in audio-visual over audio-only phonetic classification accuracy. 1.
Integration strategies for audiovisual speech processing: Applied to text-dependent speaker recognition
- IEEE TRANS. MULTIMEDIA
, 2005
"... In this paper an in depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our w ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In this paper an in depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our work is based around the well known hidden Markov model (HMM) classifier framework for modelling speech. A novel framework for modelling the mismatch between train and test observation sets is proposed, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most AVSP applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise for the task of text-dependent speaker recognition.
Polysp: a polysystemic, phonetically-rich approach to speech understanding
- Italian Journal of Linguistics - Rivista di Linguistica
, 2001
"... understanding ..."
A Cascade Visual Front End for Speaker Independent Automatic Speechreading
- International Journal of Speech Technology
, 2001
"... We propose a three-stage pixel based visual front end for automatic speechreading #lipreading# that results in signi#cantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest th ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We propose a three-stage pixel based visual front end for automatic speechreading #lipreading# that results in signi#cantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The #rst stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis based data projection, which is applied on a concatenation of a small number of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multi-variate normal distribution with diagonal covariance. We apply the algorithm to visual-only 52-class phonetic and 27-class visemic classi#cation on a 162-subject, 8-hour long, large-vocabulary, continuous speech audio-visual database. We demonstrate signi#cant classi#cation accuracy gains byeach added stage of the proposed algorithm, which, when combined, can reach up to 27# improvement. Overall, weachieve a 60# #49## visual-only frame-level visemic classi#cation accuracy with #without# use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classi#cation over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.
Factors Influencing Audiovisual Fission and Fusion
- Illusions,” Cognitive Brain Research
"... 2/25 Information processing in auditory and visual modalities interacts in many circumstances. Spatially and temporally coincident acoustic and visual information are often bound together to form multisensory percepts [13,16]. Shams and coworkers recently reported a multisensory fission illusion whe ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
2/25 Information processing in auditory and visual modalities interacts in many circumstances. Spatially and temporally coincident acoustic and visual information are often bound together to form multisensory percepts [13,16]. Shams and coworkers recently reported a multisensory fission illusion where a single flash is perceived as two flashes when two rapid tone beeps are presented concurrently [11,12]. The absence of a fusion illusion, where two flashes would fuse to one when accompanied by one beep, indicated a perceptual rather than cognitive nature of the illusion. Here we report both fusion and fission illusions using stimuli very similar to those used by Shams et al. By instructing subjects to count beeps rather than flashes and decreasing the sound intensity to near threshold we also created a corresponding visually induced auditory illusion. We discuss our results in light of four hypotheses of multisensory integration, each advocating a condition for modality dominance. According to the discontinuity hypothesis [12], the modality in which stimulation is discontinuous dominates. The modality appropriateness hypothesis [16] states that the modality more appropriate for the task at hand dominates. The information reliability hypothesis [10] claims that the modality providing more reliable information dominates. In strong forms, none of these three hypotheses applies to our data. We re-state the hypotheses in weak forms so that discontinuity, modality appropriateness and information reliability are factors which increase a modality’s tendency to dominate. All these factors are important in explaining our data. Finally, we interpret the effect of instructions in light of the directed attention hypothesis which states that the attended modality is dominant [16].

