Results 1 - 10
of
24
Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic
- Journal of Artificial Intelligence Research
, 2001
"... This paper presents an implemented system for recognizing the occurrence of events described by simple spatial-motion verbs in short image sequences. The semantics of these verbs is specified with event-logic expressions that describe changes in the state of force-dynamic relations between the parti ..."
Abstract
-
Cited by 75 (2 self)
- Add to MetaCart
This paper presents an implemented system for recognizing the occurrence of events described by simple spatial-motion verbs in short image sequences. The semantics of these verbs is specified with event-logic expressions that describe changes in the state of force-dynamic relations between the participants of the event. An efficient finite representation is introduced for the infinite sets of intervals that occur when describing liquid and semi-liquid events. Additionally, an efficient procedure using this representation is presented for inferring occurrences of compound events, described with event-logic expressions, from occurrences of primitive events. Using force dynamics and event logic to specify the lexical semantics of events allows the system to be more robust than prior systems based on motion profile. 1.
Movement, Activity, and Action: The Role of Knowledge in the Perception of Motion
- Royal Society Workshop on Knowledge-based Vision in Man and Machine
, 1997
"... We present several approaches to the machine perception of motion and discuss the role and levels of knowledge in each. In particular we describe different techniques of motion understanding as focusing on one of movement, activity, or action. Movements are the most atomic primitives, requiring no c ..."
Abstract
-
Cited by 39 (3 self)
- Add to MetaCart
We present several approaches to the machine perception of motion and discuss the role and levels of knowledge in each. In particular we describe different techniques of motion understanding as focusing on one of movement, activity, or action. Movements are the most atomic primitives, requiring no contextual or sequence knowledge to be recognized; movement is often addressed using either view- invariant or view specific geometric techniques. Activity refers to sequences of movements or states, where the only real knowledge required is the statistics of the sequence; much of the recent work in gesture understanding falls within this category of motion perception. Finally, actions are larger scale events which typically include interaction with the environment and causal relationships; action understanding straddles the gray division between perception and cognition, computer vision and artificial intelligence. We illustrate these levels with examples drawn mostly from our work in unders...
Understanding Manipulation in Video
- In AFGR96
, 1996
"... Manipulations are a significant subset of human gestures that are distinguished by the fact that their logic and meaning are particularly clear, being heavily constrained by physical causality. We present techniques and causal semantics for interpreting video of manipulation tasks such as disassembl ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
Manipulations are a significant subset of human gestures that are distinguished by the fact that their logic and meaning are particularly clear, being heavily constrained by physical causality. We present techniques and causal semantics for interpreting video of manipulation tasks such as disassembly. Psychologicallybased causal constraints are used to detect meaningful changes in the integrity and motions of foregroundsegmented blobs; a small causal model of manipulation is used to disambiguate and parse these into a coherent account of video's action. The causal constraints are drawn from studies of infant perceptual development; as with infants, they precede and may possibly even bootstrap the ability to reliably segment still objects. Our implementation produces a script of the causal evolution of the scene---output that supports cartoon summary, automated editing, and higher-level reasoning. 1 Understanding manipulation Much visual experience is devoted to watching humans manipu...
A Maximum-Likelihood Approach to Visual Event Classification
- In Proceedings of the Fourth European Conference on Computer Vision
"... This paper presents a novel framework, based on maximum likelihood, for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull, drop, and throw, and classifying novel observations into previously trained classes. The model th ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
This paper presents a novel framework, based on maximum likelihood, for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull, drop, and throw, and classifying novel observations into previously trained classes. The model that we employ does not presuppose prior recognition or tracking of 3D object pose, shape, or identity. We describe our general framework for using maximum-likelihood techniques for visual event classification, the details of the generative model that we use to characterise observations as instances of event types, and the implemented computational techniques used to support training and classification for this generative model. We conclude by illustrating the operation of our implementation on a small example.
Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition
"... Abstract—Interpretation of images and videos containing humans interacting with different objects is a daunting task. It involves understanding scene/event, analyzing human movements, recognizing manipulable objects, and observing the effect of the human movement on those objects. While each of thes ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Abstract—Interpretation of images and videos containing humans interacting with different objects is a daunting task. It involves understanding scene/event, analyzing human movements, recognizing manipulable objects, and observing the effect of the human movement on those objects. While each of these perceptual tasks can be conducted independently, recognition rate improves when interactions between them are considered. Motivated by psychological studies of human perception, we present a Bayesian approach which integrates various perceptual tasks involved in understanding human-object interactions. Previous approaches to object and action recognition rely on static shape/appearance feature matching and motion analysis, respectively. Our approach goes beyond these traditional approaches and applies spatial and functional constraints on each of the perceptual elements for coherent semantic interpretation. Such constraints allow us to recognize objects and actions when the appearances are not discriminative enough. We also demonstrate the use of such constraints in recognition of actions from static images without using any motion information. Index Terms—Action recognition, object recognition, functional recognition. Ç 1
Summarization of Video-taped Presentations: Automatic Analysis of Motion and Gesture
- IEEE Trans. on Circuits and Systems for Video Technology
, 1998
"... This paper presents an automatic system for analyzing and annotating video sequences of technical talks. Our method uses a robust motion estimation technique to detect key frames and segment the video sequence into subsequences containing a single overhead slide. The subsequences are stabilized to r ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
This paper presents an automatic system for analyzing and annotating video sequences of technical talks. Our method uses a robust motion estimation technique to detect key frames and segment the video sequence into subsequences containing a single overhead slide. The subsequences are stabilized to remove motion that occurs when the speaker adjusts their slides. Any changes remaining between frames in the stabilized sequences may be due to speaker gestures such as pointing or writing and we use active contours to automatically track these potential gestures. Given the constrained domain we define a simple set of actions that can be recognized based on the active contour shape and motion. The recognized actions provide an annotation of the sequence that can be used to access a condensed version of the talk from a web page.
Physics-Based Visual Understanding
- Computer Vision and Image Understanding
, 1996
"... An understanding of a scene's causal physics---how scene elements interact and respond to forces---is a precondition to reasoning about how the scene came to be, how it may evolve in time, and how it will respond to manipulation. We propose a computationally inexpensive method for recovering causal ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
An understanding of a scene's causal physics---how scene elements interact and respond to forces---is a precondition to reasoning about how the scene came to be, how it may evolve in time, and how it will respond to manipulation. We propose a computationally inexpensive method for recovering causal structure from images, in which which a scene model is built incrementally through interleaved sensing and analysis. Reasoning uses generic qualitative knowledge about rigid-body interactions, reusable between domains and similar to concepts thought to be acquired or activated during child development. Causal constraint propagation reveals anomalous degrees-of-freedom in the scene model; prediction yields sensory plans to resolve them. Sensing operations are highly directed and local in scope, e.g., visual routines and proprioception. Inference-depth and the number of pixels "touched" are bounded by the complexity of the scene. We presents algorithms and semantics that have been successfully...
The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis
- In Proceedings, AAAI97
, 1997
"... We address the problem of visually detecting causal events and fitting them together into a coherent story of the action witnessed by the camera. We show that this can be done by reasoning about the motions and collisions of surfaces, using high-level causal constraints derived from psychological st ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
We address the problem of visually detecting causal events and fitting them together into a coherent story of the action witnessed by the camera. We show that this can be done by reasoning about the motions and collisions of surfaces, using high-level causal constraints derived from psychological studies of infant visual behavior. These constraints are naive forms of basic physical laws governing substantiality, contiguity, momentum, and acceleration. We describe two implementations. One system parses instructional videos, extracting plans of action and key frames suitable for storyboarding. Since learning will play a role in making such systems robust, we introduce a new framework for coupling hidden Markov models and demonstrate its use in a second system that segments stereo video into actions in near real-time. Rather than attempt accurate low-level vision, both systems use highlevel causal analysis to integrate fast but sloppy pixelbased representations over time. The output is su...
Issues in automated visual surveillance
- Proc. VIIth Digital Image
, 2003
"... Abstract. The usefulness of networks of surveillance cameras is primarily limited by the demand placed on human supervisors to monitor many real time video feeds simultaneously. The goal of automated visual surveillance is to reduce the burden on operators by including software in a surveillance sys ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Abstract. The usefulness of networks of surveillance cameras is primarily limited by the demand placed on human supervisors to monitor many real time video feeds simultaneously. The goal of automated visual surveillance is to reduce the burden on operators by including software in a surveillance system that can analyse video content automatically. This paper reviews progress in the field and considers some of the major remaining problems in automated video surveillance. 1
Visual Event Perception
- In Proceedings of the NEC Research Symposium
, 1999
"... This paper presents a novel framework for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull, drop, tip, and tap and classifying novel observations into previously trained classes. Simple colour- and motionbased segmentati ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper presents a novel framework for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull, drop, tip, and tap and classifying novel observations into previously trained classes. Simple colour- and motionbased segmentation and tracking techniques are used to produce a time series of feature vectors constructed from the 2D object positions, orientations, shapes, and sizes. Hidden Markov models are trained on this time series data and used to classify novel occurrences into previously trained classes. The particular choice of features used allows the system to construct meaningful semantic representations of the event classes that it has learned. KEYWORDS: Event classification, Motion analysis, Segmentation, Tracking, Learning, Hidden Markov models, Lexical semantics 1 Introduction People can describe what they see. If I were to pick up a block and ask you what you saw, you could say Jeff picked up the...

