Results 1 - 10
of
21
Modeling Coarticulation in Synthetic Visual Speech
- Models and Techniques in Computer Animation
, 1993
"... After describing the importance of visual information in speech perception and sketching the history of visual speech synthesis, we consider a number of theories of coarticulation in human speech. An implementation of Lo.. fqvist’s (1990) gestural theory of speech production is described for visual ..."
Abstract
-
Cited by 172 (13 self)
- Add to MetaCart
After describing the importance of visual information in speech perception and sketching the history of visual speech synthesis, we consider a number of theories of coarticulation in human speech. An implementation of Lo.. fqvist’s (1990) gestural theory of speech production is described for visual speech synthesis along with a description of the graphically controlled development system. We conclude with some plans for future work.
The Cog project: Building a humanoid robot
- Lecture Notes in Computer Science
, 1999
"... Abstract. To explore issues of developmental structure, physical embodiment, integration of multiple sensory and motor systems, and social interaction, we have constructed an upper-torso humanoid robot called Cog. The robot has twenty-one degrees of freedom and a variety of sensory systems, includin ..."
Abstract
-
Cited by 125 (7 self)
- Add to MetaCart
Abstract. To explore issues of developmental structure, physical embodiment, integration of multiple sensory and motor systems, and social interaction, we have constructed an upper-torso humanoid robot called Cog. The robot has twenty-one degrees of freedom and a variety of sensory systems, including visual, auditory, vestibular, kinesthetic, and tactile senses. This chapter gives a background on the methodology that we have used in our investigations, highlights the research issues that have been raised during this project, and provides a summary of both the current state of the project and our long-term goals. We report on a variety of implemented visual-motor routines (smooth-pursuit tracking, saccades, binocular vergence, and vestibular-ocular and opto-kinetic reflexes), orientation behaviors, motor control techniques, and social behaviors (pointing to a visual target, recognizing joint attention through face and eye finding, imitation of head nods, and regulating interaction through expressive feedback). We further outline a number of areas for future research that will be necessary to build a complete embodied system. 1
Alternative essences of intelligence
, 1998
"... We present a novel methodology for building humanlike artificially intelligent systems. We take as a model the only existing systems which are universally accepted as intelligent: humans. We emphasize building intelligent systems which are not masters of a single domain, but, like humans, are adept ..."
Abstract
-
Cited by 56 (11 self)
- Add to MetaCart
We present a novel methodology for building humanlike artificially intelligent systems. We take as a model the only existing systems which are universally accepted as intelligent: humans. We emphasize building intelligent systems which are not masters of a single domain, but, like humans, are adept at performing a variety of complex tasks in the real world. Using evidence from cognitive science and neuroscience, we suggest four alternative essences of intelligence to those held by classical AI. These are the parallel themes of development, social interaction, embodiment, and integration. Following a methodology based on these themes, we have built a physical humanoid robot. In this paper we present our methodology and the insights it affords for facilitating learning, simplifying the computation underlying rich behavior, and building systems that can scale to more complex tasks in more challenging environments.
Multimedia content processing through cross-modal association
- In MULTIMEDIA ’03: Proceedings of the eleventh ACM international conference on Multimedia
, 2003
"... Multimodal information processing has received considerable attention in recent years. The focus of existing research in this area has been predominantly on the use of fusion technology. In this paper, we suggest that cross-modal association can provide a new set of powerful solutions in this area. ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Multimodal information processing has received considerable attention in recent years. The focus of existing research in this area has been predominantly on the use of fusion technology. In this paper, we suggest that cross-modal association can provide a new set of powerful solutions in this area. We investigate different crossmodal association methods using the linear correlation model. We also introduce a novel method for cross-modal association called Cross-modal Factor Analysis (CFA). Our earlier work on Latent Semantic Indexing (LSI) is extended for applications that use offline supervised training. As a promising research direction and practical application of cross-modal association, cross-modal information retrieval where queries from one modality are used to search for content in another modality using low-level features is then discussed in detail. Different association methods are tested and compared using the proposed cross-modal retrieval system. All these methods achieve significant dimensionality reduction. Among them CFA gives the best retrieval performance. Finally, this paper addresses the use of cross-modal association to detect talking heads. The CFA method achieves 91.1 % detection accuracy, while LSI and Canonical Correlation Analysis (CCA) achieve 66.1 % and 73.9 % accuracy, respectively. As shown by experiments, crossmodal association provides many useful benefits, such as robust noise resistance and effective feature selection. Compared to CCA and LSI, the proposed CFA shows several advantages in analysis performance and feature usage. Its capability in feature selection and noise resistance also makes CFA a promising tool for many multimedia analysis applications.
A Critique of Pure Audition
- Proceedings of the Computational Auditory Scene Analysis Workshop, Joint International Conference on Artificial Intelligence
, 1995
"... All sound separation systems based on perception assume a bottom-up or Marr-like view of the world. Sound is processed by a cochlear model, passed to an analysis system, grouped into objects, and then passed to higher level processing systems. The information flow is strictly bottom up, with no info ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
All sound separation systems based on perception assume a bottom-up or Marr-like view of the world. Sound is processed by a cochlear model, passed to an analysis system, grouped into objects, and then passed to higher level processing systems. The information flow is strictly bottom up, with no information flowing down from higher level expectations. Is this the right approach? This paper summarizes the existing bottom-up perceptual models, and the evidence for more top-down processing. This paper describes many of the auditory and visual effects that indicate top-down information flow. Hopefully this paper will generate discussion about the role of top-down processing, whether this information should be included in sound separation models, and how to build testable architectures. 1
The dynamics of audiovisual behavior in speech
- Speechreading by Humans and Machines: Models, Systems, and Applications, volume 150 of NATO ASI Series. Series F: Computer and Systems Sciences
, 1996
"... While it is well-known that faces provide linguistically relevant information during communication, most efforts to identify the visual correlates of the acoustic signal have focused on the shape, position and luminance of the oral aperture. In this work, we extend the analysis to full facial motion ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
While it is well-known that faces provide linguistically relevant information during communication, most efforts to identify the visual correlates of the acoustic signal have focused on the shape, position and luminance of the oral aperture. In this work, we extend the analysis to full facial motion under the assumption that the process of producing speech acoustics generates linguistically salient visual information, which is distributed over large portions of the face. Support for this is drawn from our recent studies of the eye movements of perceivers during a variety of audiovisual speech perception tasks. These studies suggest that perceivers detect visual information at low spatial frequencies and that such information may not be restricted to the region of the oral aperture. Since the biomechanical linkage between the facial and vocal tract systems is one of close proximity and shared physiology, we propose that physiological models of speech and facial motion be integrated into one audiovisual model of speech production. In addition to providing a coherent account of audiovisual motor control, the proposed model could become a useful experimental tool, providing synthetic audiovisual stimuli with realistic control parameters. 2 1.
A Text-To-Audiovisual-Speech Synthesizer For French
- In Proceedings of the International Conference on Spoken Language Processing (ICSLP
, 1996
"... An audiovisual speech synthesizer from unlimited French text is here presented. It uses a 3-D parametric model of the face. The facial model is controlled by eight parameters. Target values have been assigned to the parameters, for each French viseme, based upon measurements made on a human speaker. ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
An audiovisual speech synthesizer from unlimited French text is here presented. It uses a 3-D parametric model of the face. The facial model is controlled by eight parameters. Target values have been assigned to the parameters, for each French viseme, based upon measurements made on a human speaker. Parameter trajectories are modeled by means of dominance functions associated with each parameter and each viseme. A dominance function is characterized by three coefficients so that coarticulation finally depends on the phonetic context, the speech rate, and an "hypo-hyper articulation" coefficient adjustable by the user. Finally, the visual and audiovisual intelligibility of our visual synthesizer has been evaluated in its first version, and compared to that of the acoustic synthesizer on which it was implemented.
Audio-visual and Multimodal Speech Systems
- In D. Gibbon (Ed.) Handbook of Standards and Resources for Spoken Language Systems - Supplement Volume
"... ion Signal Level Semantic Level Figure 13: Multimodal Design Space (adapted from [224]) system in the design space is the pivotal center of its features. According to the characterization of an interaction along the two dimensions, fusion, and use of modalities, four basic types of multimodal intera ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
ion Signal Level Semantic Level Figure 13: Multimodal Design Space (adapted from [224]) system in the design space is the pivotal center of its features. According to the characterization of an interaction along the two dimensions, fusion, and use of modalities, four basic types of multimodal interactions can be distinguished: alternative, synergistic, exclusive, and concurrent multimodal interaction, as shown in Figure 13. Obviously, synergistic systems subsume the other three classes of multimodal systems. Therefore, architectural models of multimodal integration (as presented in the next subsection and in Section 9) are sufficient if they are able to model synergistic cooperation of modalities. 6.2.2 Fusion of Multimodal Input Fusion of multimodal input events can occur on different levels, ranging from signal-level to semantic-level. Signal-level fusion (or lexical fusion [224]) performs the combination of multimodal input at the level of the input signal. Signal-level fusion has...
Issues In Measuring The Benefits Of Multimodal Interfaces
- Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97
, 1997
"... Multimedia interfaces are rapidly evolving to facilitate human /machine communication. Most of the technologies on which they are based are, as yet, imperfect. But, the interfaces do begin to allow information exchange in ways familiar and comfortable to the human---principally through natural actio ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Multimedia interfaces are rapidly evolving to facilitate human /machine communication. Most of the technologies on which they are based are, as yet, imperfect. But, the interfaces do begin to allow information exchange in ways familiar and comfortable to the human---principally through natural actions in the sensory dimensions of sight, sound and touch. Further, as digital networking becomes ubiquitous, the opportunity grows for collaborative work through conferenced computing. In this context the machine takes on the role of mediator in human/machine/human communication--- the ideal being to extend the intellectual abilities of humans through access to distributed information resources and collective decision making. The challenge is how to design machine mediation so that it extends, not impedes, human abilities. This report describes evolving work to incorporate multimodal interfaces into a networked system for collaborative distributed computing. It also addresses strategies for qu...
3D Models of the Lips for Realistic Speech Animation
- In Computer Animation'96
, 1996
"... 3D models of the lips have been developed in the framework of an audiovisual articulatory speech synthetizer. Unlike most of the regions of the human face, the lips are essentially characterized by their border contours. The internal and external contours of the vermilion zone can be fitted by means ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
3D models of the lips have been developed in the framework of an audiovisual articulatory speech synthetizer. Unlike most of the regions of the human face, the lips are essentially characterized by their border contours. The internal and external contours of the vermilion zone can be fitted by means of algebraic equations. The coe#cients of these equations must be controlled so that the lip shape can be adapted to various speakers conformations and to any speech gesture. To reach this goal, a 3D model of the lips has been worked out from geometrical analysis of the natural lips of a French speaker. Our lip model was developed to adjust a set of continuous functions best fitting the contours of 22 reference lip shapes. Only five parameters are necessary to predict all the equations of the contours of the lip model. From this model, a volumic model based on implicit surfaces was also developped to take in account lip contact. 1 Introduction Over the last score years, many researchers at...

