Results 1 - 10
of
44
A database for fine grained activity detection of cooking activities
- In CVPR
, 2012
"... While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body mot ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
(Show Context)
While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intra-class variability due to diverse subjects and ingredients. We benchmark two approaches on our dataset, one based on articulated pose tracks and the second using holistic video features. While the holistic approach outperforms the pose-based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases. Providing high-resolution videos as well as an intermediate pose representation we hope to foster research in fine-grained activity recognition. 1.
Sum-product networks for modeling activities with stochastic structure. CVPR
, 2012
"... This paper addresses recognition of human activities with stochastic structure, characterized by variable spacetime arrangements of primitive actions, and conducted by a variable number of actors. We demonstrate that modeling aggregate counts of visual words is surprisingly expressive enough for suc ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
(Show Context)
This paper addresses recognition of human activities with stochastic structure, characterized by variable spacetime arrangements of primitive actions, and conducted by a variable number of actors. We demonstrate that modeling aggregate counts of visual words is surprisingly expressive enough for such a challenging recognition task. An activity is represented by a sum-product network (SPN). SPN is a mixture of bags-of-words (BoWs) with exponentially many mixture components, where subcomponents are reused by larger ones. SPN consists of terminal nodes representing BoWs, and product and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. The connectivity of SPN and parameters of BoW distributions are learned under weak supervision using the EM algorithm. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video in terms of activity detection and localization. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. A new Volleyball dataset is compiled and annotated for evaluation. Our classification accuracy and localization precision and recall are superior to those of the state-of-the-art on the benchmark and our Volleyball datasets. 1.
Social behavior recognition in continuous videos
- In CVPR
, 2012
"... We present a novel method for analyzing social behavior. Continuous videos are segmented into action ‘bouts’ by building a temporal context model that combines features from spatio-temporal energy and agent trajectories. The method is tested on an unprecedented dataset of videos of interacting pairs ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
(Show Context)
We present a novel method for analyzing social behavior. Continuous videos are segmented into action ‘bouts’ by building a temporal context model that combines features from spatio-temporal energy and agent trajectories. The method is tested on an unprecedented dataset of videos of interacting pairs of mice, which was collected as part of a state-of-the-art neurophysiological study of behavior. The dataset comprises over 88 hours (8 million frames) of annotated videos. We find that our novel trajectory features, used in a discriminative framework, are more informative than widely used spatio-temporal features; furthermore, temporal context plays an important role for action recognition in continuous videos. Our approach may be seen as a baseline method on this dataset, reaching a mean recognition rate of 61.2 % compared to the expert’s agreement rate of about 70%. 1.
Context-aware activity recognition and anomaly detection in video
- IN IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING
, 2013
"... In this paper, we propose a mathematical framework to jointly model related activities with both motion and context information for activity recognition and anomaly detection. This is motivated from observations that activities related in space and time rarely occur independently and can serve as c ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a mathematical framework to jointly model related activities with both motion and context information for activity recognition and anomaly detection. This is motivated from observations that activities related in space and time rarely occur independently and can serve as context for each other. The spatial and temporal distribution of different activities provides useful cues for the understanding of these activities. We denote the activities occurring with high frequencies in the database as normal activities. Given training data which contains labeled normal activities, our model aims to automatically capture frequent motion and context patterns for each activity class, as well as each pair of classes, from sets of predefined patterns during the learning process. Then, the learned model is used to generate globally optimum labels for activities in the testing videos. We show how to learn the model parameters via an unconstrained convex optimization problem and how to predict the correct labels for a testing instance consisting of multiple activities. The learned model and generated labels are used to detect anomalies whose motion and context patterns deviate from the learned patterns. We show promising results on the VIRAT Ground Dataset that demonstrates the benefit of joint modeling and recognition of activities in a wide-area scene and the effectiveness of the proposed method in anomaly detection.
M.: Patch to the future: Unsupervised visual prediction
- In: CVPR. (2014
"... In this paper we present a conceptually simple but sur-prisingly powerful method for visual prediction which com-bines the effectiveness of mid-level visual elements with temporal modeling. Our framework can be learned in a completely unsupervised manner from a large collection of videos. However, m ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
In this paper we present a conceptually simple but sur-prisingly powerful method for visual prediction which com-bines the effectiveness of mid-level visual elements with temporal modeling. Our framework can be learned in a completely unsupervised manner from a large collection of videos. However, more importantly, because our approach models the prediction framework on these mid-level ele-ments, we can not only predict the possible motion in the scene but also predict visual appearances — how are ap-pearances going to change with time. This yields a visual “hallucination ” of probable events on top of the scene. We show that our method is able to accurately predict and visu-alize simple future events; we also show that our approach is comparable to supervised methods for event prediction. 1.
Modeling Human Activities as Speech
- in CVPR
, 2011
"... Human activity recognition and speech recognition ap-pear to be two loosely related research areas. However, on a careful thought, there are several analogies between ac-tivity and speech signals with regard to the way they are generated, propagated, and perceived. In this paper, we propose a novel ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Human activity recognition and speech recognition ap-pear to be two loosely related research areas. However, on a careful thought, there are several analogies between ac-tivity and speech signals with regard to the way they are generated, propagated, and perceived. In this paper, we propose a novel action representation, the action spectro-gram, which is inspired by a common spectrographic repre-sentation of speech. Different from sound spectrogram, an action spectrogram is a space-time-frequency representa-tion which characterizes the short-time spectral properties of body parts ’ movements. While the essence of the speech signal is the variation of air pressure in time, our method models activities as the likelihood time series of action as-sociated local interest patterns. This low-level process is realized by learning boosted window classifiers from spa-tially quantized spatio-temporal interest features. We have tested our algorithm on a variety of human activity datasets and achieved superior results. 1.
Monocular Object Detection Using 3D Geometric Primitives
"... Abstract. Multiview object detection methods achieve robustness in adverse imaging conditions by exploiting projective consistency across views. In this paper, we present an algorithm that achieves performance comparable to multiview methods from a single camera by employing geometric primitives as ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Multiview object detection methods achieve robustness in adverse imaging conditions by exploiting projective consistency across views. In this paper, we present an algorithm that achieves performance comparable to multiview methods from a single camera by employing geometric primitives as proxies for the true 3D shape of objects, such as pedestrians or vehicles. Our key insight is that for a calibrated camera, geometric primitives produce predetermined location-specific patterns in occupancy maps. We use these to define spatially-varying kernel functions of projected shape. This leads to an analytical formation model of occupancy maps as the convolution of locations and projected shape kernels. We estimate object locations by deconvolving the occupancy map using an efficient template similarity scheme. The number of objects and their positions are determined using the mean shift algorithm. The approach is highly parallel because the occupancy probability of a particular geometric primitive at each ground location is an independent computation. The algorithm extends to multiple cameras without requiring significant bandwidth. We demonstrate comparable performance to multiview methods and show robust, realtime object detection on full resolution HD video in a variety of challenging imaging conditions. 1
A.: Incremental activity modeling and recognition in streaming videos
- In: CVPR (2014
"... Most of the state-of-the-art approaches to human activity recognition in video need an intensive training stage and as-sume that all of the training examples are labeled and avail-able beforehand. But these assumptions are unrealistic for many applications where we have to deal with streaming videos ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Most of the state-of-the-art approaches to human activity recognition in video need an intensive training stage and as-sume that all of the training examples are labeled and avail-able beforehand. But these assumptions are unrealistic for many applications where we have to deal with streaming videos. In these videos, as new activities are seen, they can be leveraged upon to improve the current activity recogni-tion models. In this work, we develop an incremental ac-tivity learning framework that is able to continuously up-date the activity models and learn new ones as more videos are seen. Our proposed approach leverages upon state-of-the-art machine learning tools, most notably active learn-ing systems. It does not require tedious manual labeling of every incoming example of each activity class. We per-form rigorous experiments on challenging human activity datasets, which demonstrate that the incremental activity modeling framework can achieve performance very close to the cases when all examples are available a priori. 1.
Exploiting Spatio-Temporal Scene Structure for Wide-Area Activity Analysis in Unconstrained Environments
"... Abstract—Surveillance videos typically consist of long duration sequences of activities which occur at different spatio-temporal locations and can involve multiple people acting simultaneously. Often, the activities have contextual relationships with one another. Although context has been studied in ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Surveillance videos typically consist of long duration sequences of activities which occur at different spatio-temporal locations and can involve multiple people acting simultaneously. Often, the activities have contextual relationships with one another. Although context has been studied in the past for the purpose of activity recognition to a certain extent, the use of context in recognition of activities in such challenging environments is relatively unexplored. In this paper, we propose a novel method for capturing the spatio-temporal context between activities in a Markov random field. The structure of the MRF is improvised upon during test time and not pre-defined, unlike many approaches that model the contextual relationships between activities. Given a collection of videos and a set of weak classifiers for individual activities, the spatio-temporal relationships between activities are represented as probabilistic edge weights in the MRF. This model provides a generic representation for an activity sequence that can extend to any number of objects and interactions in a video. We show that the recognition of activities in a video can be posed as an inference problem on the graph. We conduct experiments on the publicly available UCLA office dataset VIRAT dataset to demonstrate the improvement in recognition accuracy using our proposed model as opposed to recognition using state-of-the-art features on individual activity regions. Index Terms—Context-aware activity recognition, Markov random field, wide-area activity analysis.