Results 1 - 10
of
10
Multimedia Content Analysis Using Both Audio and Visual Cues
, 2000
"... : Including all the scenes/shots that contain special events may generate too long an abstract. Also, simply staggering them together may not be visually or aurally appealing. In the MoCA project, it was determined that only 50% of the abstract should contain special events. The remaining part shoul ..."
Abstract
-
Cited by 70 (0 self)
- Add to MetaCart
: Including all the scenes/shots that contain special events may generate too long an abstract. Also, simply staggering them together may not be visually or aurally appealing. In the MoCA project, it was determined that only 50% of the abstract should contain special events. The remaining part should be left for filler clips. The special event clips to be included are chosen uniformly and randomly from different types of events. The selection of a short clip from a scene is subject to some additional criteria, such as the amount of action and the similarity to the overall color composition of the movie. Closeness to the desired AV characteristics of certain scene types are also considered. The filler clips are chosen so that they do not overlap with the content of chosen special event clips, to ensure a good coverage of all parts of a movie. MPEG-7 Standard for Multimedia Content Description Interface MPEG-7 is an on-going standardization effort for content description of AV documen...
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
A Segment-Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments
- In Proc. ICMI
, 2004
"... This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total h ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.
Automatic Facial Expression Recognition Using Facial Animation Parameters And Multi-Stream Hmms
- Information Forensics and Security, IEEE Transactions on Volume 1, Issue 1, March 2006 Page(s):3
, 2005
"... Abstract—The performance of an automatic facial expression recognition system can be significantly improved by modeling the reliability of different streams of facial expression information utilizing multistream hidden Markov models (HMMs). In this paper, we present an automatic multistream HMM faci ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract—The performance of an automatic facial expression recognition system can be significantly improved by modeling the reliability of different streams of facial expression information utilizing multistream hidden Markov models (HMMs). In this paper, we present an automatic multistream HMM facial expression recognition system and analyze its performance. The proposed system utilizes facial animation parameters (FAPs), supported by the MPEG-4 standard, as features for facial expression classification. Specifically, the FAPs describing the movement of the outer-lip contours and eyebrows are used as observations. Experiments are first performed employing single-stream HMMs under several different scenarios, utilizing outer-lip and eyebrow FAPs individually and jointly. A multistream HMM approach is proposed for introducing facial expression and FAP group dependent stream reliability weights. The stream weights are determined based on the facial expression recognition results obtained when FAP streams are utilized individually. The proposed multistream HMM facial expression system, which utilizes stream reliability weights, achieves relative reduction of the facial expression recognition error of 44 % compared to the single-stream HMM system. Index Terms—Facial expression recognition, multistream HMMs, facial animation parameters. I.
Articulatory Features for Robust Visual Speech Recognition
, 2004
"... Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel SVM classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Effective Tracking through Tree-Search
, 2003
"... A new contour tracking algorithm is presented. Tracking is posed as a matching problem between curves constructed out of edges in the image, and some shape space describing the class of objects of interest. The main contributions of the paper are to present an algorithm which solves this problem a ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A new contour tracking algorithm is presented. Tracking is posed as a matching problem between curves constructed out of edges in the image, and some shape space describing the class of objects of interest. The main contributions of the paper are to present an algorithm which solves this problem accurately and efficiently, in a provable manner. In particular, the algorithm's efficiency derives from a novel tree-search algorithm through the shape space, which allows for much of the shape space to be explored with very little effort. This latter property makes the algorithm effective in highly cluttered scenes, as is demonstrated in an experimental comparison with a condensation tracker.
Integration Of Multimodal Features For Video Scene Classification Based On Hmm
- In IEEE Workshop on Multimedia Signal Processing
, 1999
"... Along with the advance in multimedia and internet technology, ahuge amount of data, including digital video and audio, are generated daily. Tools for e#cient indexing and retrieval are indispensable. With multi-modal information present in the data, e#ectiveintegration is necessary and is still a ch ..."
Abstract
- Add to MetaCart
Along with the advance in multimedia and internet technology, ahuge amount of data, including digital video and audio, are generated daily. Tools for e#cient indexing and retrieval are indispensable. With multi-modal information present in the data, e#ectiveintegration is necessary and is still a challenging problem. In this paper, we present four di#erent methods for integrating audio and visual information for video classi#cation based on Hidden Markov Model. Our results have shown signi#cant improvementover using single modality. INTRODUCTION Along with the advancementinmultimedia and internet technology, the amount of digital data, including TV programs, conference archives, and movies, grows exponentially. For e#cient access, understanding, and retrieval of digital video, tools that can automatically understand the semantic content in a video are becoming indispensable. In this paper, we consider the classi#cation of a video sequence into one of a few predetermined scene types. ...
Hmm-Based Audio-Visual Speech Recognition Integrating Geometric- And Appearance-Based Visual Features
- in Proc. Works. Multimedia Signal Processing
, 2001
"... A good front end for visual feature extraction is an important element of audio-visual speech recognition systems. We propose a new visual feature representation that combines both geometric- and pixel-based features. Using our previously developed contour-based lip-tracking algorithm, geometric fea ..."
Abstract
- Add to MetaCart
A good front end for visual feature extraction is an important element of audio-visual speech recognition systems. We propose a new visual feature representation that combines both geometric- and pixel-based features. Using our previously developed contour-based lip-tracking algorithm, geometric features including the height and width of the lips are automatically extracted. Lip boundary tracking allows accurate determination of a region of interest from which we construct pixel-based features that are robust to variation in scale and translation. Motivated by computational considerations, we selected a subset of the pixels in the center of the inner mouth area that was found to capture sufficient details of the appearance of the teeth and tongue for assisting in the discrimination of spoken words. We show the advantage of the combination of these visual features for visual-only and audio-visual speech recognition of isolated digits.
Genetic Snakes: Application on Lipreading
, 2003
"... Contrary to Genetics Snakes, the current methods of mouth modeling are very sensitive to initialization (position of a snake or a deformable contour before convergence) and fall easily into local minima. We propose in this article to make converge two snakes in parallel via a genetic algorithm. T ..."
Abstract
- Add to MetaCart
Contrary to Genetics Snakes, the current methods of mouth modeling are very sensitive to initialization (position of a snake or a deformable contour before convergence) and fall easily into local minima. We propose in this article to make converge two snakes in parallel via a genetic algorithm. The coding of the chromosome takes into account at the same time gradients and region type information contained in the image. In addition we introduce the use of STM (Sparse Template Matching) into the field of leapreading. Thanks to a temporal filter, word signatures (stored in Sparse Templates) make it possible to recognize various words pronounced several times at one week interval.

