Results 1 - 10
of
94
Actions as space-time shapes
- In ICCV
, 2005
"... Human action in video sequences can be seen as silhouettes of a moving torso and protruding limbs undergoing articulated motion. We regard human actions as three-dimensional shapes induced by the silhouettes in the space-time volume. We adopt a recent approach [14] for analyzing 2D shapes and genera ..."
Abstract
-
Cited by 192 (3 self)
- Add to MetaCart
Human action in video sequences can be seen as silhouettes of a moving torso and protruding limbs undergoing articulated motion. We regard human actions as three-dimensional shapes induced by the silhouettes in the space-time volume. We adopt a recent approach [14] for analyzing 2D shapes and generalize it to deal with volumetric space-time action shapes. Our method utilizes properties of the solution to the Poisson equation to extract space-time features such as local space-time saliency, action dynamics, shape structure and orientation. We show that these features are useful for action recognition, detection and clustering. The method is fast, does not require video alignment and is applicable in (but not limited to) many scenarios where the background is known. Moreover, we demonstrate the robustness of our method to partial occlusions, non-rigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action, and low quality video. Index Terms Action representation, action recognition, space-time analysis, shape analysis, poisson equation
Unsupervised learning of human action categories using spatial-temporal words
- In Proc. BMVC
, 2006
"... Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences ..."
Abstract
-
Cited by 163 (4 self)
- Add to MetaCart
Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences is very useful for a variety of tasks, such as video surveillance, objectlevel video summarization, video indexing, digital library organization, etc. However, it remains a challenging task for computers to achieve robust action recognition due to cluttered background, camera motion, occlusion, and geometric and photometric variances of objects. For example, in a live video of a skating competition, the skater moves rapidly across the rink, and the camera also moves to follow the skater. With moving camera, non-stationary background, and moving target, few vision algorithms could identify, categorize and
Retrieving actions in movies
"... We address recognition and localization of human actions in realistic scenarios. In contrast to the previous work studying human actions in controlled settings, here we train and test algorithms on real movies with substantial variation of actions in terms of subject appearance, motion, surrounding ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
We address recognition and localization of human actions in realistic scenarios. In contrast to the previous work studying human actions in controlled settings, here we train and test algorithms on real movies with substantial variation of actions in terms of subject appearance, motion, surrounding scenes, viewing angles and spatio-temporal extents. We introduce a new annotated human action dataset and use it to evaluate several existing methods. We in particular focus on boosted space-time window classifiers and introduce “keyframe priming ” that combines discriminative models of human motion and shape within an action. Keyframe priming is shown to significantly improve the performance of action detection. We present detection results for the action class “drinking ” evaluated on two episodes of the movie “Coffee and Cigarettes”. 1.
Learning human action via information maximization, CVPR
, 2008
"... In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. ..."
Abstract
-
Cited by 43 (7 self)
- Add to MetaCart
In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. Our proposed approach is able to automatically discover the optimal number of videoword clusters by utilizing Maximization of Mutual Information(MMI). Unlike the k-means algorithm, which is typically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI clustering further groups the video-words, which are highly correlated to some group of actions. To capture the structural information of the learnt optimal video-word clusters, we explore the correlation of the compact video-word clusters. We use the modified correlgoram, which is not only translation and rotation invariant, but also somewhat scale invariant. We extensively test our proposed approach on two publicly available challenging datasets: the KTH dataset and IXMAS multiview dataset. To the best of our knowledge, we are the first to try the bag of video-words related approach on the multiview dataset. We have obtained very impressive results on both datasets. 1.
Action recognition by learning mid-level motion features
- In CVPR
, 2008
"... This paper presents a method for human action recognition based on patterns of motion. Previous approaches to action recognition use either local features describing small patches or large-scale features describing the entire human figure. We develop a method constructing mid-level motion features w ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
This paper presents a method for human action recognition based on patterns of motion. Previous approaches to action recognition use either local features describing small patches or large-scale features describing the entire human figure. We develop a method constructing mid-level motion features which are built from low-level optical flow information. These features are focused on local regions of the image sequence and are created using a variant of AdaBoost. These features are tuned to discriminate between different classes of action, and are efficient to compute at run-time. A battery of classifiers based on these mid-level features is created and used to classify input sequences. State-of-theart results are presented on a variety of standard datasets. 1.
Machine recognition of human activities: A survey
, 2008
"... The past decade has witnessed a rapid proliferation of video cameras in all walks of life and has resulted in a tremendous explosion of video content. Several applications such as content-based video annotation and retrieval, highlight extraction and video summarization require recognition of the a ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
The past decade has witnessed a rapid proliferation of video cameras in all walks of life and has resulted in a tremendous explosion of video content. Several applications such as content-based video annotation and retrieval, highlight extraction and video summarization require recognition of the activities occurring in the video. The analysis of human activities in videos is an area with increasingly important consequences from security and surveillance to entertainment and personal archiving. Several challenges at various levels of processing—robustness against errors in low-level processing, view and rate-invariant representations at midlevel processing and semantic representation of human activities at higher level processing—make this problem hard to solve. In this review paper, we present a comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications. We discuss the problem at two major levels of complexity: 1) “actions ” and 2) “activities. ” “Actions ” are characterized by simple motion patterns typically executed by a single human. “Activities ” are more complex and involve coordinated actions among a small number of humans. We will discuss several approaches and classify them according to their ability to handle varying degrees of complexity as interpreted above. We begin with a discussion of approaches to model the simplest of action classes known as atomic or primitive actions that do not require sophisticated dynamical modeling. Then, methods to model actions with more complex dynamics are discussed. The discussion then leads naturally to methods for higher level representation of complex activities.
Learning motion categories using both semantics and structural information, CVPR
, 2007
"... Current approaches to motion category recognition typically focus on either full spatiotemporal volume analysis (holistic approach) or analysis of the content of spatiotemporal interest points (part-based approach). Holistic approaches tend to be more sensitive to noise e.g. geometric variations, wh ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Current approaches to motion category recognition typically focus on either full spatiotemporal volume analysis (holistic approach) or analysis of the content of spatiotemporal interest points (part-based approach). Holistic approaches tend to be more sensitive to noise e.g. geometric variations, while part-based approaches usually ignore structural dependencies between parts. This paper presents a novel generative model, which extends probabilistic latent semantic analysis (pLSA), to capture both semantic (content of parts) and structural (connection between parts) information for motion category recognition. The structural information learnt can also be used to infer the location of motion for the purpose of motion detection. We test our algorithm on challenging datasets involving human actions, facial expressions and hand gestures and show its performance is better than existing unsupervised methods in both tasks of motion localisation and recognition. 1.
Discriminative subsequence mining for action classification
- IN: 11TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, LOS ALAMITOS
, 2007
"... Recent approaches to action classification in videos have used sparse spatio-temporal words encoding local appearance around interesting movements. Most of these approaches use a histogram representation, discarding the temporal order among features. But this ordering information can contain importa ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Recent approaches to action classification in videos have used sparse spatio-temporal words encoding local appearance around interesting movements. Most of these approaches use a histogram representation, discarding the temporal order among features. But this ordering information can contain important information about the action itself, e.g. consider the sport disciplines of hurdle race and long jump, where the global temporal order of motions (running, jumping) is important to discriminate between the two. In this work we propose to use a sequential representation which retains this temporal order. Further, we introduce Discriminative Subsequence Mining to find optimal discriminative subsequence patterns. In combination with the LPBoost classifier, this amounts to simultaneously learning a classification function and performing feature selection in the space of all possible feature sequences. The resulting classifier linearly combines a small number of interpretable decision functions, each checking for the presence of a single discriminative pattern. The classifier is benchmarked on the KTH action classification data set and outperforms the best known results in the literature.
Extracting spatiotemporal interest points us19 global information
- IEEE Proc. International Conference on Computer Vision (ICCV
, 2007
"... Local spatiotemporal features or interest points provide compact but descriptive representations for efficient video analysis and motion recognition. Current local feature extraction approaches involve either local filtering or entropy computation which ignore global information (e.g. large blobs of ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Local spatiotemporal features or interest points provide compact but descriptive representations for efficient video analysis and motion recognition. Current local feature extraction approaches involve either local filtering or entropy computation which ignore global information (e.g. large blobs of moving pixels) in video inputs. This paper presents a novel extraction method which utilises global information from each video input so that moving parts such as a moving hand can be identified and are used to select relevant interest points for a condensed representation. The proposed method involves obtaining a small set of subspace images, which can synthesise frames in the video input from their corresponding coefficient vectors, and then detecting interest points from the subspaces and the coefficient vectors. Experimental results indicate that the proposed method can yield a sparser set of interest points for motion recognition than existing methods. 1.
A scalable approach to activity recognition based on object use
- In Proceedings of the International Conference on Computer Vision (ICCV), Rio de
, 2007
"... We propose an approach to activity recognition based on detecting and analyzing the sequence of objects that are being manipulated by the user. In domains such as cooking, where many activities involve similar actions, object-use information can be a valuable cue. In order for this approach to scale ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We propose an approach to activity recognition based on detecting and analyzing the sequence of objects that are being manipulated by the user. In domains such as cooking, where many activities involve similar actions, object-use information can be a valuable cue. In order for this approach to scale to many activities and objects, however, it is necessary to minimize the amount of human-labeled data that is required for modeling. We describe a method for automatically acquiring object models from video without any explicit human supervision. Our approach leverages sparse and noisy readings from RFID tagged objects, along with common-sense knowledge about which objects are likely to be used during a given activity, to bootstrap the learning process. We present a dynamic Bayesian network model which combines RFID and video data to jointly infer the most likely activity and object labels. We demonstrate that our approach can achieve activity recognition rates of more than 80 % on a real-world dataset consisting of 16 household activities involving 33 objects with significant background clutter. We show that the combination of visual object recognition with RFID data is significantly more effective than the RFID sensor alone. Our work demonstrates that it is possible to automatically learn object models from video of household activities and employ these models for activity recognition, without requiring any explicit human labeling. 1.

