Results 1 -
8 of
8
Binarized normed gradients for objectness estimation at 300fps
- in IEEE CVPR
, 2014
"... Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, wit ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
(Show Context)
Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their cor-responding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 × 8 and use the norm of the gra-dients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this fea-ture, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single lap-top CPU) generates a small set of category-independent, high quality object windows, yielding 96.2 % object detec-tion rate (DR) with 1,000 proposals. Increasing the num-bers of proposals and color spaces for computing BING fea-tures, our performance can be further improved to 99.5% DR. 1.
Fast action proposals for human action detection and search
- In CVPR
, 2015
"... In this paper we target at generating generic action pro-posals in unconstrained videos. Each action proposal cor-responds to a temporal series of spatial bounding boxes, i.e., a spatio-temporal video tube, which has a good poten-tial to locate one human action. Assuming each action is performed by ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
In this paper we target at generating generic action pro-posals in unconstrained videos. Each action proposal cor-responds to a temporal series of spatial bounding boxes, i.e., a spatio-temporal video tube, which has a good poten-tial to locate one human action. Assuming each action is performed by a human with meaningful motion, both ap-pearance and motion cues are utilized to measure the ac-tionness of the video tubes. After picking those spatiotem-poral paths of high actionness scores, our action proposal generation is formulated as a maximum set coverage prob-lem, where greedy search is performed to select a set of action proposals that can maximize the overall actionness score. Compared with existing action proposal approaches, our action proposals do not rely on video segmentation and can be generated in nearly real-time. Experimental results on two challenging datasets, MSRII and UCF 101, validate the superior performance of our action proposals as well as competitive results on action detection and search. 1.
Can Humans Fly? Action Understanding with Multiple Classes of Actors
"... Can humans fly? Emphatically no. Can cars eat? Again, absolutely not. Yet, these absurd inferences result from the current disregard for particular types of actors in action understanding. There is no work we know of on simulta-neously inferring actors and actions in the video, not to mention a data ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Can humans fly? Emphatically no. Can cars eat? Again, absolutely not. Yet, these absurd inferences result from the current disregard for particular types of actors in action understanding. There is no work we know of on simulta-neously inferring actors and actions in the video, not to mention a dataset to experiment with. Our paper hence marks the first effort in the computer vision community to jointly consider various types of actors undergoing various actions. To start with the problem, we collect a dataset of 3782 videos from YouTube and label both pixel-level actors and actions in each video. We formulate the general actor-action understanding problem and instantiate it at vari-ous granularities: both video-level single- and multiple-label actor-action recognition and pixel-level actor-action semantic segmentation. Our experiments demonstrate that inference jointly over actors and actions outperforms infer-ence independently over them, and hence concludes our ar-gument of the value of explicit consideration of various ac-tors in comprehensive action understanding. 1.
Joint action recognition and pose estimation from video
- in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, 2015
"... Action recognition and pose estimation from video are closely related tasks for understanding human motion, most methods, however, learn separate models and combine them sequentially. In this paper, we propose a framework to in-tegrate training and testing of the two tasks. A spatial-temporal And-Or ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Action recognition and pose estimation from video are closely related tasks for understanding human motion, most methods, however, learn separate models and combine them sequentially. In this paper, we propose a framework to in-tegrate training and testing of the two tasks. A spatial-temporal And-Or graph model is introduced to represent ac-tion at three scales. Specifically the action is decomposed into poses which are further divided to mid-level ST-parts and then parts. The hierarchical structure of our model captures the geometric and appearance variations of pose at each frame and lateral connections between ST-parts at adjacent frames capture the action-specific motion informa-tion. The model parameters for three scales are learned dis-criminatively, and action labels and poses are efficiently in-ferred by dynamic programming. Experiments demonstrate that our approach achieves state-of-art accuracy in action recognition while also improving pose estimation.
Human Action Segmentation with Hierarchical Supervoxel Consistency
"... Detailed analysis of human action, such as action classi-fication, detection and localization has received increasing attention from the community; datasets like JHMDB have made it plausible to conduct studies analyzing the impact that such deeper information has on the greater action un-derstanding ..."
Abstract
- Add to MetaCart
(Show Context)
Detailed analysis of human action, such as action classi-fication, detection and localization has received increasing attention from the community; datasets like JHMDB have made it plausible to conduct studies analyzing the impact that such deeper information has on the greater action un-derstanding problem. However, detailed automatic segmen-tation of human action has comparatively been unexplored. In this paper, we take a step in that direction and pro-pose a hierarchical MRF model to bridge low-level video fragments with high-level human motion and appearance; novel higher-order potentials connect different levels of the supervoxel hierarchy to enforce the consistency of the hu-man segmentation by pulling from different segment-scales. Our single layer model significantly outperforms the cur-rent state-of-the-art on actionness, and our full model im-proves upon the single layer baselines in action segmenta-tion. 1.
Action Detection by Implicit Intentional Motion Clustering
"... Explicitly using human detection and pose estimation has found limited success in action recognition problems. This may be due to the complexity in the articulated mo-tion human exhibit. Yet, we know that action requires an actor and intention. This paper hence seeks to understand the spatiotemporal ..."
Abstract
- Add to MetaCart
Explicitly using human detection and pose estimation has found limited success in action recognition problems. This may be due to the complexity in the articulated mo-tion human exhibit. Yet, we know that action requires an actor and intention. This paper hence seeks to understand the spatiotemporal properties of intentional movement and how to capture such intentional movement without relying on challenging human detection and tracking. We conduct a quantitative analysis of intentional movement, and our find-ings motivate a new approach for implicit intentional move-ment extraction that is based on spatiotemporal trajectory clustering by leveraging the properties of intentional move-ment. The intentional movement clusters are then used as action proposals for detection. Our results on three action detection benchmarks indicate the relevance of focusing on intentional movement for action detection; our method sig-nificantly outperforms the state of the art on the challenging MSR-II multi-action video benchmark. 1.
Action Localization in Videos through Context Walk
"... This paper presents an efficient approach for localizing actions by learning contextual relations, in the form of rel-ative locations between different video regions. We begin by over-segmenting the videos into supervoxels, which have the ability to preserve action boundaries and also reduce the com ..."
Abstract
- Add to MetaCart
(Show Context)
This paper presents an efficient approach for localizing actions by learning contextual relations, in the form of rel-ative locations between different video regions. We begin by over-segmenting the videos into supervoxels, which have the ability to preserve action boundaries and also reduce the complexity of the problem. Context relations are learned during training which capture displacements from all the supervoxels in a video to those belonging to foreground ac-tions. Then, given a testing video, we select a supervoxel randomly and use the context information acquired during training to estimate the probability of each supervoxel be-longing to the foreground action. The walk proceeds to a new supervoxel and the process is repeated for a few steps. This “context walk ” generates a conditional distribution of an action over all the supervoxels. A Conditional Ran-dom Field is then used to find action proposals in the video, whose confidences are obtained using SVMs. We validated the proposed approach on several datasets and show that context in the form of relative displacements between su-pervoxels can be extremely useful for action localization. This also results in significantly fewer evaluations of the classifier, in sharp contrast to the alternate sliding window approaches. 1.
Compositional Structure Learning for Action Understanding
, 2014
"... The focus of the action understanding literature has predominately been classification, how-ever, there are many applications demanding richer action understanding such as mobile robotics and video search, with solutions to classification, localization and detection. In this paper, we propose a comp ..."
Abstract
- Add to MetaCart
(Show Context)
The focus of the action understanding literature has predominately been classification, how-ever, there are many applications demanding richer action understanding such as mobile robotics and video search, with solutions to classification, localization and detection. In this paper, we propose a compositional model that leverages a new mid-level representation called composi-tional trajectories and a locally articulated spatiotemporal deformable parts model (LALSDPM) for fully action understanding. Our methods is advantageous in capturing the variable struc-ture of dynamic human activity over a long range. First, the compositional trajectories capture long-ranging, frequently co-occurring groups of trajectories in space time and represent them in discriminative hierarchies, where human motion is largely separated from camera motion; second, LASTDPM learns a structured model with multi-layer deformable parts to capture multiple levels of articulated motion. We implement our methods and demonstrate state of the art performance on all three problems: action detection, localization, and recognition. 1