Results 1 - 10
of
40
Learning words from sights and sounds: a computational model
, 2002
"... This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been imple ..."
Abstract
-
Cited by 182 (29 self)
- Add to MetaCart
This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling.
Grounded semantic composition for visual scenes
- Journal of Artificial Intelligence Research
, 2004
"... We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex referring expressions. The model has been implemented, and it is ..."
Abstract
-
Cited by 70 (21 self)
- Add to MetaCart
We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex referring expressions. The model has been implemented, and it is able to understand a broad range of spatial referring expressions. We describe our implementation of word level visually-grounded semantics and their embedding in a compositional parsing framework. The implemented system selects the correct referents in response to natural language expressions for a large percentage of test cases. In an analysis of the system’s successes and failures we reveal how visual context influences the semantics of utterances and propose future extensions to the model that take such context into account. 1.
Semiotic Schemas: A Framework for Grounding Language in Action and Perception
, 2005
"... A theoretical framework for grounding language is introduced that provides a computational path from sensing and motor action to words and speech acts. The approach combines concepts from semiotics and schema theory to develop a holistic approach to linguistic meaning. Schemas serve as structured be ..."
Abstract
-
Cited by 58 (10 self)
- Add to MetaCart
A theoretical framework for grounding language is introduced that provides a computational path from sensing and motor action to words and speech acts. The approach combines concepts from semiotics and schema theory to develop a holistic approach to linguistic meaning. Schemas serve as structured beliefs that are grounded in an agent’s physical environment through a causal-predictive cycle of action and perception. Words and basic speech acts are interpreted in terms of grounded schemas. The framework reflects lessons learned from implementations of several language processing robots. It provides a basis for the analysis and design of situated, multimodal communication systems that straddle symbolic and non-symbolic realms.
When Push comes to Shove: A Computational Model of the Role of Motor Control in the Acquisition of Action Verbs
, 1997
"... Children learn a variety of verbs for hand actions starting in their second year of life. The semantic distinctions can be subtle, and they vary across languages, yet they are learned quickly. Howis this possible? This dissertation explores the hypothesis that to explain the acquisition and use of a ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
Children learn a variety of verbs for hand actions starting in their second year of life. The semantic distinctions can be subtle, and they vary across languages, yet they are learned quickly. Howis this possible? This dissertation explores the hypothesis that to explain the acquisition and use of action verbs, motor control must be taken into account. It presents a model of embodied semantics|based on the principles of neural computation in general and on the human motor system in particular|which takes a set of labelled actions and learns both to label novel actions and to obey verbal commands. Akey feature of the model is the executing schema, anactivecontroller mechanism which, by actually driving behavior, allows the model to carry out verbal commands. A hard-wired mechanism links the activity of executing schemas to a set of linguistically important features including hand posture, joint motions, force, aspect and goals. The feature set is relatively small and is xed, helping to make learning tractable. Moreover, the use of traditional feature structures facilitates the use of model merging, a Bayesian probabilistic learning algorithm which rapidly learns plausible word meanings, automatically determines an appropriate number of senses for each verb, and can plausibly be mapped to a connectionist recruitment
Modeling Embodied Lexical Development
- IN PROCEEDINGS OF THE 19TH COGNITIVE SCIENCE SOCIETY CONFERENCE
, 1997
"... This paper presents an implemented computational model of lexical development for the case of action verbs. A simulated agent is trained by an informant giving labels to the agent's actions (here hand motions) and the system learns to both label and carry out similar actions. Computationally, t ..."
Abstract
-
Cited by 54 (8 self)
- Add to MetaCart
This paper presents an implemented computational model of lexical development for the case of action verbs. A simulated agent is trained by an informant giving labels to the agent's actions (here hand motions) and the system learns to both label and carry out similar actions. Computationally, the system employs a novel form of active representation and is explicitly intended to be neurally plausible. The learning methodology is a version of Bayesian model merging (Omohundro, 1992). The verb learning model is placed in the broader context of the L0 project on embodied natural language and its acquisition.
Embodied Construction Grammar in Simulation-Based Language Understanding
- EDS): CONSTRUCTION GRAMMAR(S): COGNITIVE AND CROSS-LANGUAGE DIMENSIONS. JOHN BENJAMIN PUBL CY
, 2003
"... We present Embodied Construction Grammar, a formalism for linguistic analysis designed specifically for integration into a simulation-based model of language understanding. As in other construction grammars, linguistic constructions serve to map between phonological forms and conceptual representa ..."
Abstract
-
Cited by 41 (12 self)
- Add to MetaCart
We present Embodied Construction Grammar, a formalism for linguistic analysis designed specifically for integration into a simulation-based model of language understanding. As in other construction grammars, linguistic constructions serve to map between phonological forms and conceptual representations.
Mental Imagery for a Conversational Robot
, 2004
"... To build robots that engage in fluid face-to-face spoken conversations with people, robots must have ways to connect what they say to what they see. A critical aspect of how language connects to vision is that language encodes points of view. The meaning of my left and your left differs due to an im ..."
Abstract
-
Cited by 36 (17 self)
- Add to MetaCart
To build robots that engage in fluid face-to-face spoken conversations with people, robots must have ways to connect what they say to what they see. A critical aspect of how language connects to vision is that language encodes points of view. The meaning of my left and your left differs due to an implied shift of visual perspective. The connection of language to vision also relies on object permanence. We can talk about things that are not in view. For a robot to participate in situated spoken dialog, it must have the capacity to imagine shifts of perspective, and it must maintain object permanence. We present a set of representations and procedures that enable a robotic manipulator to maintain a “mental model” of its physical environment by coupling active vision to physical simulation. Within this model, “imagined” views can be generated from arbitrary perspectives, providing the basis for situated language comprehension and production. An initial application of mental imagery for spatial language understanding for an interactive robot is described.
A Trainable Spoken Language Understanding System For Visual Object Selection
- In Proceedings of the International Conference of Spoken Language Processing
, 2002
"... We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During trai ..."
Abstract
-
Cited by 32 (16 self)
- Add to MetaCart
We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visually-grounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description. 1.
Learning Visually-Grounded Words and Syntax for a Scene Description Task
"... A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a `show-and-tell' procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures ..."
Abstract
-
Cited by 30 (16 self)
- Add to MetaCart
A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a `show-and-tell' procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes.

