Results 1 - 10
of
18
Mental Imagery for a Conversational Robot
, 2004
"... To build robots that engage in fluid face-to-face spoken conversations with people, robots must have ways to connect what they say to what they see. A critical aspect of how language connects to vision is that language encodes points of view. The meaning of my left and your left differs due to an im ..."
Abstract
-
Cited by 36 (17 self)
- Add to MetaCart
To build robots that engage in fluid face-to-face spoken conversations with people, robots must have ways to connect what they say to what they see. A critical aspect of how language connects to vision is that language encodes points of view. The meaning of my left and your left differs due to an implied shift of visual perspective. The connection of language to vision also relies on object permanence. We can talk about things that are not in view. For a robot to participate in situated spoken dialog, it must have the capacity to imagine shifts of perspective, and it must maintain object permanence. We present a set of representations and procedures that enable a robotic manipulator to maintain a “mental model” of its physical environment by coupling active vision to physical simulation. Within this model, “imagined” views can be generated from arbitrary perspectives, providing the basis for situated language comprehension and production. An initial application of mental imagery for spatial language understanding for an interactive robot is described.
Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application
- In Proceedings of the International Conference on Intelligent User Interfaces
, 2005
"... Our goal is to automatically recognize and enroll new vocabulary in a multimodal interface. To accomplish this our technique aims to leverage the mutually disambiguating aspects of co-referenced, co-temporal handwriting and speech. The co-referenced semantics are spatially and temporally determined ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Our goal is to automatically recognize and enroll new vocabulary in a multimodal interface. To accomplish this our technique aims to leverage the mutually disambiguating aspects of co-referenced, co-temporal handwriting and speech. The co-referenced semantics are spatially and temporally determined by our multimodal interface for schedule chart creation. This paper motivates and describes our technique for recognizing out-of-vocabulary (OOV) terms and enrolling them dynamically in the system. We report results for the detection and segmentation of OOV words within a small multimodal test set. On the same test set we also report utterance, word and pronunciation level error rates both over individual input modes and multimodally. We show that combining information from handwriting and speech yields significantly better results than achievable by either mode alone.
T.: Active information selection: Visual attention through the hands
- IEEE Transactions on Autonomous Mental Development
"... Abstract—An important goal in studying both human intelligence and artificial intelligence is to understand how a natural or an artificial learning system deals with the uncertainty and ambiguity of the real world. For a natural intelligence system such as a human toddler, the relevant aspects in a ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
Abstract—An important goal in studying both human intelligence and artificial intelligence is to understand how a natural or an artificial learning system deals with the uncertainty and ambiguity of the real world. For a natural intelligence system such as a human toddler, the relevant aspects in a learning environment are only those that make contact with the learner’s sensory system. In real-world interactions, what the child perceives critically depends on his own actions as these actions bring information into and out of the learner’s sensory field. The present analyses indicate how, in the case of a toddler playing with toys, these perception-action loops may simplify the learning environment by selecting relevant information and filtering irrelevant information. This paper reports new findings using a novel method that seeks to describe the visual learning environment from a young child’s point of view and measures the visual information that a child perceives in real-time toy play with a parent. The main results are 1) what the child perceives primarily depends on his own actions but also his social partner’s actions; 2) manual actions, in particular, play a critical role in creating visual experiences in which one object dominates; 3) this selecting and filtering of visual objects through the actions of the child provides more constrained and clean input that seems likely to facilitate cognitive learning processes. These findings have broad implications for how one studies and thinks about human and artificial learning systems.
A Unified Model of Early Word Learning: Integrating Statistical and Social Cues
"... Previous work on early language acquisition has shown that word meanings can be acquired by an associative procedure that maps perceptual experience onto linguistic labels based on cross-situational observation. A new trend termed social-pragmatic theory [27] focuses on the effect of the child’s soc ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Previous work on early language acquisition has shown that word meanings can be acquired by an associative procedure that maps perceptual experience onto linguistic labels based on cross-situational observation. A new trend termed social-pragmatic theory [27] focuses on the effect of the child’s social-cognitive capacities, such as joint attention and intention reading. In this paper, we argue that statistical and social cues can be seamlessly integrated to facilitate early word learning. To support this idea, we first introduce a statistical learning mechanism that provides a formal account of cross-situational observation. The main part of this paper then presents a unified model that is able to make use of different kinds of social cues, such as joint attention and prosody in maternal speech, in the statistical learning framework. In a computational analysis of infant data, we report the quantitative results of our unified model in computing word-meaning associations, which outperforms the purely statistical learning method. 1
Computational models in the debate over language learnability
, 2007
"... Computational models have played a central role in the debate over language learnability. This article discusses how they have been used in different “stances”, from generative views to more recently introduced explanatory frameworks based on embodiment, cognitive development and cultural evolution. ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Computational models have played a central role in the debate over language learnability. This article discusses how they have been used in different “stances”, from generative views to more recently introduced explanatory frameworks based on embodiment, cognitive development and cultural evolution. By digging into the details of certain specific models, we show how they organize, transform and rephrase defining questions about what makes language learning possible for children. Finally, we present a tentative synthesis to recast the debate using the notion of learning bias.
P.: Investigating Multimodal Real-Time Patterns of Joint Attention
- in an HRI Word Learning Task. In: 5th ACM/IEEE International Conference on Human-Robot Interaction (2010
"... Abstract—Joint attention – the idea that humans make inferences from observable behaviors of other humans by attending to the objects and events that these others humans attend to – has been recognized as a critical component in human-robot interactions. While various HRI studies showed that having ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Abstract—Joint attention – the idea that humans make inferences from observable behaviors of other humans by attending to the objects and events that these others humans attend to – has been recognized as a critical component in human-robot interactions. While various HRI studies showed that having robots to behave in ways that support human recognition of joint attention leads to better behavioral outcomes on the human side, there are no studies that investigate the detailed time course of interactive joint attention processes. In this paper, we present the results from an HRI study that investigates the exact time course of human multi-modal attentional processes during an HRI word learning task in an unprecedented way. Using novel data analysis techniques, we are able to demonstrate that the temporal details of human attentional behavior are critical for understanding human expectations of joint attention in HRI and that failing to do so can force humans into assuming unnatural behaviors. Keywords-human-robot interaction; joint attention I.
Robot Developmental Learning of an Object Ontology Grounded in Sensorimotor Experience
, 2007
"... This work would not have been possible without the support of many individuals. My advisor Ben Kuipers has been a source of constant support and stimulation. Without his guidance, my research would have been both less ambitious and less interesting. His relentless pur-suit of clear answers to import ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This work would not have been possible without the support of many individuals. My advisor Ben Kuipers has been a source of constant support and stimulation. Without his guidance, my research would have been both less ambitious and less interesting. His relentless pur-suit of clear answers to important questions in science is tempered with joviality and compassion. These qualities have shaped my approach to research. My thesis committee has significantly influenced this work, both in my research and by the demand that lofty claims are supported by empirical evidence. Ray Mooney introduced me to the importance of evaluation in empirical AI. Brian Stankiewicz gave me an understanding of how psychologists can evaluate human performance, and this allowed me to evaluate robot performance. David Kortenkamp inspired me to think about how a robot can pay attention to events outside of a narrowly specified task with a talk at AAAI that mentioned the difficulty of having a planetary rover notice a dinosaur bone. Peter Stone encouraged me to consider the relevance of reasoning about agents and actions on physical robots. My lab colleagues have patiently endured many discussions, thoughtfully prodded on the weak spots in my research, suggested improvements, critiqued documents, and aided in experimen-tal setups. Special thanks go to Patrick Beeson, Aniket Murarka, Matt MacMahon, Subramanian Ramamoorthy, Jefferson Provost, Harold Chaput, Jonathan Mugan, Changhai Xu, and Shilpa Gu-lati. My many friends in Austin have preserved my sanity over the years. Special thanks go out to Mikhail Bilenko, Matt Horstman, Prem Melville, Amol Nayate, and Lucas Wilcox for being
Combining Background Knowledge and Learned Topics
"... Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. Although topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always id ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. Although topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, however, tend to be semantically richer due to careful selection of words that define the concepts, but they may not span the themes in a data set exhaustively. In this study, we review a new probabilistic framework for combining a hierarchy of human-defined semantic concepts with a statistical topic model to seek the best of both worlds. Results indicate that this combination leads to systematic improvements in generalization performance as well as enabling new techniques for inferring and visualizing the content of a document.
What you learn is what you see: using eye movements to study infant cross-situational word learning
"... Recent studies show that both adults and young children possess powerful statistical learning capabilities to solve the wordto-world mapping problem. However, the underlying mechanisms that make statistical learning possible and powerful are not yet known. With the goal of providing new insights int ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Recent studies show that both adults and young children possess powerful statistical learning capabilities to solve the wordto-world mapping problem. However, the underlying mechanisms that make statistical learning possible and powerful are not yet known. With the goal of providing new insights into this issue, the research reported in this paper used an eye tracker to record the moment-by-moment eye movement data of 14-month-old babies in statistical learning tasks. Various measures are applied to such fine-grained temporal data, such as looking duration and shift rate (the number of shifts in gaze from one visual object to the other) trial by trial, showing different eye movement patterns between strong and weak statistical learners. Moreover, an information-theoretic measure is developed and applied to gaze data to quantify the degree of learning uncertainty trial by trial. Next, a simple associative statistical learning model is applied to eye movement data and these simulation results are compared with empirical results from young children, showing strong correlations between these two. This suggests that an associative learning mechanism with selective attention can provide a cognitively plausible model of cross-situational statistical learning. The work represents the first steps in using eye movement data to infer underlying real-time processes in statistical word learning.
The Effects of Deictic Pointing in Word Learning
"... Previous research suggested that eye gaze as a social cue plays a crucial role in early word learning. In light of this, we investigated another kind of embodied social cue, pointing, and asked how it relates to word learning in young children as it is ubiquitous in day – to – day parent- child inte ..."
Abstract
- Add to MetaCart
Previous research suggested that eye gaze as a social cue plays a crucial role in early word learning. In light of this, we investigated another kind of embodied social cue, pointing, and asked how it relates to word learning in young children as it is ubiquitous in day – to – day parent- child interactions. Parents were asked to narrate a story book displayed on a computer screen. Each page of the story contains the pictures of multiple objects and the novel spoken names of those objects were introduced during the narration. Word learning was measured at the end of the story. The three learning conditions were, (1) pointing to the correct object while labeling it, (2) no pointing, and (3) general pointing to the center of the screen but not to a specific object. The results showed embodied pointing actions significantly increase word learning. Moreover, a touch screen panel placed over the computer screen was used to record the time and location of each pointing action. We developed and implemented various approaches to measure the spatial and temporal correlations of parental speech and pointing actions. The results of detailed analyses suggest that exact synchrony and degree of overlap of speech and pointing streams of action are not directly relevant to learning efficiency. Overall, this work suggests both that social cues, such as pointing, are embedded in a system of correlations relating the speech stream to the physical world of objects and events and that the human word-learning system is robust. Index Terms: word learning, social cues, language learning, and intermodal synchrony

