Results 1 - 10
of
31
A spatio-temporal descriptor based on 3d-gradients
- In BMVC’08
"... In this work, we present a novel local descriptor for video sequences. The proposed descriptor is based on histograms of oriented 3D spatio-temporal gradients. Our contribution is four-fold. (i) To compute 3D gradients for arbitrary scales, we develop a memory-efficient algorithm based on integral v ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
In this work, we present a novel local descriptor for video sequences. The proposed descriptor is based on histograms of oriented 3D spatio-temporal gradients. Our contribution is four-fold. (i) To compute 3D gradients for arbitrary scales, we develop a memory-efficient algorithm based on integral videos. (ii) We propose a generic 3D orientation quantization which is based on regular polyhedrons. (iii) We perform an in-depth evaluation of all descriptor parameters and optimize them for action recognition. (iv) We apply our descriptor to various action datasets (KTH, Weizmann, Hollywood) and show that we outperform the state-of-the-art. 1
Recognizing Actions by Shape-Motion Prototype Trees
"... A prototype-based approach is introduced for action recognition. The approach represents an action as a sequence of prototypes for efficient and flexible action matching in long video sequences. During training, first, an action prototype tree is learned in a joint shape and motion space via hierarc ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
A prototype-based approach is introduced for action recognition. The approach represents an action as a sequence of prototypes for efficient and flexible action matching in long video sequences. During training, first, an action prototype tree is learned in a joint shape and motion space via hierarchical k-means clustering; then a lookup table of prototype-to-prototype distances is generated. During testing, based on a joint likelihood model of the actor location and action prototype, the actor is tracked while a frame-to-prototype correspondence is established by maximizing the joint likelihood, which is efficiently performed by searching the learned prototype tree; then actions are recognized using dynamic prototype sequence matching. Distance matrices used for sequence matching are rapidly obtained by look-up table indexing, which is an order of magnitude faster than brute-force computation of frame-to-frame distances. Our approach enables robust action matching in very challenging situations (such as moving cameras, dynamic backgrounds) and allows automatic alignment of action sequences. Experimental results demonstrate that our approach achieves recognition rates of 91.07 % on a large gesture dataset (with dynamic backgrounds), 100 % on the Weizmann action dataset and 95.77 % on the KTH action dataset. 1.
Pose search: retrieving people using their pose
- In CVPR
, 2009
"... We describe a method for retrieving shots containing a particular 2D human pose from unconstrained movie and TV videos. The method involves first localizing the spatial layout of the head, torso and limbs in individual frames using pictorial structures, and associating these through a shot by tracki ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
We describe a method for retrieving shots containing a particular 2D human pose from unconstrained movie and TV videos. The method involves first localizing the spatial layout of the head, torso and limbs in individual frames using pictorial structures, and associating these through a shot by tracking. A feature vector describing the pose is then constructed from the pictorial structure. Shots can be retrieved either by querying on a single frame with the desired pose, or through a pose classifier trained from a set of pose examples. Our main contribution is an effective system for retrieving people based on their pose, and in particular we propose and investigate several pose descriptors which are person, clothing, background and lighting independent. As a second contribution, we improve the performance over existing methods for localizing upper body layout on unconstrained video. We compare the spatial layout pose retrieval to a baseline method where poses are retrieved using a HOG descriptor. Performance is assessed on five episodes of the TV series ’Buffy the Vampire Slayer’, and pose retrieval is demonstrated also on three Hollywood movies. 1.
Local Trinary Patterns for Human Action Recognition
"... We present a novel action recognition method which is based on combining the effective description properties of Local Binary Patterns with the appearance invariance and adaptability of patch matching based methods. The resulting method is extremely efficient, and thus is suitable for real-time uses ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
We present a novel action recognition method which is based on combining the effective description properties of Local Binary Patterns with the appearance invariance and adaptability of patch matching based methods. The resulting method is extremely efficient, and thus is suitable for real-time uses of simultaneous recovery of human action of several lengths and starting points. Tested on all publicity available datasets in the literature known to us, our system repeatedly achieves state of the art performance. Lastly, we present a new benchmark that focuses on uncut motion recognition in broadcast sports video. 1.
Learning a hierarchy of discriminative space-time neighborhood features for human action recognition
- In IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2010
"... Recent work shows how to use local spatio-temporal features to learn models of realistic human actions from video. However, existing methods typically rely on a predefined spatial binning of the local descriptors to impose spatial information beyond a pure “bag-of-words ” model, and thus may fail to ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Recent work shows how to use local spatio-temporal features to learn models of realistic human actions from video. However, existing methods typically rely on a predefined spatial binning of the local descriptors to impose spatial information beyond a pure “bag-of-words ” model, and thus may fail to capture the most informative space-time relationships. We propose to learn the shapes of space-time feature neighborhoods that are most discriminative for a given action category. Given a set of training videos, our method first extracts local motion and appearance features, quantizes them to a visual vocabulary, and then forms candidate neighborhoods consisting of the words associated with nearby points and their orientation with respect to the central interest point. Rather than dictate a particular scaling of the spatial and temporal dimensions to determine which points are near, we show how to learn the class-specific distance functions that form the most informative configurations. Descriptors for these variable-sized neighborhoods are then recursively mapped to higher-level vocabularies, producing a hierarchy of space-time configurations at successively broader scales. Our approach yields state-of-theart performance on the UCF Sports and KTH datasets. 1.
Trajectons: Action Recognition Through the Motion Analysis of Tracked Features
, 2009
"... The defining feature of video compared to still images is motion, and as such the selection of good motion features for action recognition is crucial, especially for bag of words techniques that rely heavily on their features. Existing motion techniques either assume that a difficult problem like ba ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
The defining feature of video compared to still images is motion, and as such the selection of good motion features for action recognition is crucial, especially for bag of words techniques that rely heavily on their features. Existing motion techniques either assume that a difficult problem like background/foreground segmentation has already been solved (contour/silhouette based techniques) or are computationally expensive and prone to noise (optical flow). We present a technique for motion based on quantized trajectory snippets of tracked features. These quantized snippets, or trajectons, rely only on simple feature tracking and are computationally efficient. We demonstrate that within a bag of words framework trajectons can match state of the art results, slightly outperforming histogram of optical flow features on the Hollywood Actions dataset. Additionally, we present qualitative results in a video search task on a custom dataset of challenging YouTube videos.
Exploiting Multi-level Parallelism for Lowlatency Activity Recognition
- IN STREAMING VIDEO; PROC. ACM MULTIMEDIA SYSTEMS (MMSYS) CONFERENCE, 2010
"... Video understanding is a computationally challenging task that is critical not only for traditionally throughput-oriented applications such as search but also latency-sensitive interactive applications such as surveillance, gaming, videoconferencing, and vision-based user interfaces. Enabling these ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Video understanding is a computationally challenging task that is critical not only for traditionally throughput-oriented applications such as search but also latency-sensitive interactive applications such as surveillance, gaming, videoconferencing, and vision-based user interfaces. Enabling these types of video processing applications will require not only new algorithms and techniques, but new runtime systems that optimize latency as well as throughput. In this paper, we present a runtime system called Sprout that achieves low latency by exploiting the parallelism inherent in video understanding applications. We demonstrate the utility of our system on an activity recognition application that employs a robust new descriptor called MoSIFT, which explicitly augments appearance features with motion information. MoSIFT outperforms previous recognition techniques, but like other state-of-the-art techniques, it is computationally expensive — a sequential implementation runs 100 times slower than real time. We describe the implementation of the activity recognition application on Sprout, and show that it can accurately recognize activities at full frame rate (25 fps) and low latency on a challenging airport surveillance video corpus.
Recognising Action as Clouds of Space-Time Interest Points
"... Much of recent action recognition research is based on space-time interest points extracted from video using a Bag of Words (BOW) representation. It mainly relies on the discriminative power of individual local space-time descriptors, whilst ignoring potentially valuable information about the global ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Much of recent action recognition research is based on space-time interest points extracted from video using a Bag of Words (BOW) representation. It mainly relies on the discriminative power of individual local space-time descriptors, whilst ignoring potentially valuable information about the global spatio-temporal distribution of interest points. In this paper, we propose a novel action recognition approach which differs significantly from previous interest points based approaches in that only the global spatiotemporal distribution of the interest points are exploited. This is achieved through extracting holistic features from clouds of interest points accumulated over multiple temporal scales followed by automatic feature selection. Our approach avoids the non-trivial problems of selecting the optimal space-time descriptor, clustering algorithm for constructing a codebook, and selecting codebook size faced by previous interest points based methods. Our model is able to capture smooth motions, robust to view changes and occlusions at a low computation cost. Experiments using the KTH and WEIZMANN datasets demonstrate that our approach outperforms most existing methods. 1.
Real-time motion-based gesture recognition using the gpu
- in Proc. of the IAPR conf. on Machine Vision Applications
, 2009
"... In this paper we describe a real-time system for gesture recognition. Given an input video, we derive a set of motion features based on smoothed optical flow estimates. A user-centric representation of these features is obtained using face detection, and an efficient classifier is learned to discrim ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In this paper we describe a real-time system for gesture recognition. Given an input video, we derive a set of motion features based on smoothed optical flow estimates. A user-centric representation of these features is obtained using face detection, and an efficient classifier is learned to discriminate between gestures. We develop a real-time system using GPU programming for implementing the classifier. We provide experimental results demonstrating the speed and efficacy of our system. 1
Human Action Recognition from a Single Clip per Action
"... Learning-based approaches for human action recognition often rely on large training sets. Most of these approaches do not perform well when only a few training samples are available. In this paper, we consider the problem of human action recognition from a single clip per action. Each clip contains ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Learning-based approaches for human action recognition often rely on large training sets. Most of these approaches do not perform well when only a few training samples are available. In this paper, we consider the problem of human action recognition from a single clip per action. Each clip contains at most 25 frames. Using a patch based motion descriptor and matching scheme, we can achieve promising results on three different action datasets with a single clip as the template. Our results are comparable to previously published results using much larger training sets. We also present a method for learning a transferable distance function for these patches. The transferable distance function learning extracts generic knowledge of patch weighting from previous training sets, and can be applied to videos of new actions without further learning. Our experimental results show that the transferable distance function learning not only improves the recognition accuracy of the single clip action recognition, but also significantly enhances the efficiency of the matching scheme. 1.

