Results 1 - 10
of
45
A database for fine grained activity detection of cooking activities
- In CVPR
, 2012
"... While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body mot ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
(Show Context)
While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intra-class variability due to diverse subjects and ingredients. We benchmark two approaches on our dataset, one based on articulated pose tracks and the second using holistic video features. While the holistic approach outperforms the pose-based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases. Providing high-resolution videos as well as an intermediate pose representation we hope to foster research in fine-grained activity recognition. 1.
Action recognition with multiscale spatio-temporal contexts
- in Proc. IEEE Conf. CVPR
, 2011
"... The popular bag of words approach for action recogni-tion is based on the classifying quantized local features den-sity. This approach focuses excessively on the local features but discards all information about the interactions among them. Local features themselves may not be discriminative enough, ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
The popular bag of words approach for action recogni-tion is based on the classifying quantized local features den-sity. This approach focuses excessively on the local features but discards all information about the interactions among them. Local features themselves may not be discriminative enough, but combined with their contexts, they can be very useful for the recognition of some actions. In this paper, we present a novel representation that captures contextual interactions between interest points, based on the density of all features observed in each interest point’s mutliscale spatio-temporal contextual domain. We demonstrate that augmenting local features with our contextual feature sig-nificantly improves the recognition performance. 1.
R.: Sampling strategies for real-time action recognition
- In: CVPR (2013
"... Local spatio-temporal features and bag-of-features rep-resentations have become popular for action recognition. A recent trend is to use dense sampling for better perfor-mance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
(Show Context)
Local spatio-temporal features and bag-of-features rep-resentations have become popular for action recognition. A recent trend is to use dense sampling for better perfor-mance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram in-tersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93 % on KTH, 83.3 % on UCF50 and 47.6 % on HMDB51. 1.
Feature detector and descriptor evaluation in human action recognition
- In Proceedings of the ACM International Conference on Image and Video Retrieval
, 2010
"... In this paper, we evaluate and compare different feature detection and feature description methods for part-based approaches in human action recognition. Different methods have been proposed in the literature for both feature detection of space-time interest points and description of local video pat ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
In this paper, we evaluate and compare different feature detection and feature description methods for part-based approaches in human action recognition. Different methods have been proposed in the literature for both feature detection of space-time interest points and description of local video patches. It is however unclear which method performs better in the field of human action recognition. We compare, in the feature detection section, Dollar’s method [18], Laptev’s method [22], a bank of 3D-Gabor filters [6] and a method based on Space-Time Differences of Gaussians. We also compare and evaluate different descriptors such as Gradient [18], HOG-HOF [22], 3D SIFT [24] and an enhanced version of LBP-TOP [15]. We show the combination of Dollar’s detection method and the improved LBP-TOP descriptor to be computationally efficient and to reach the best recognition accuracy on the KTH database.
Action Recognition with Actons
"... With the improved accessibility to an exploding amoun-t of video data and growing demands in a wide range of video analysis applications, video-based action recogni-tion/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for acti ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
With the improved accessibility to an exploding amoun-t of video data and growing demands in a wide range of video analysis applications, video-based action recogni-tion/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe the properties of being compact, infor-mative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness of applying the learned actons in our two-layer structure, and show the state-of-the-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51. 1.
Dynamic scene classification: Learning motion descriptors with slow features analysis
- In CVPR
, 2013
"... In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on hand-crafted descriptors, we propose here to represent videos us-ing unsupervised learning of motion features. Our method encompasses thre ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on hand-crafted descriptors, we propose here to represent videos us-ing unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned lo-cal motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global cod-ing/pooling architecture in order to provide an effective sig-nature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11 % in classification score is reached on a data set intro-duced in 2012. 1.
Human Action Recognition and Localization in Video Using Structured Learning of Local Space-Time Features
- in Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance
, 2010
"... This paper presents a unified framework for human ac-tion classification and localization in video using structured learning of local space-time features. Each human action class is represented by a set of its own compact set of lo-cal patches. In our approach, we first use a discriminative hierarch ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper presents a unified framework for human ac-tion classification and localization in video using structured learning of local space-time features. Each human action class is represented by a set of its own compact set of lo-cal patches. In our approach, we first use a discriminative hierarchical Bayesian classifier to select those space-time interest points that are constructive for each particular ac-tion. Those concise local features are then passed to a Sup-port Vector Machine with Principal Component Analysis projection for the classification task. Meanwhile, the ac-tion localization is done using Dynamic Conditional Ran-dom Fields developed to incorporate the spatial and tem-poral structure constraints of superpixels extracted around those features. Each superpixel in the video is defined by the shape and motion information of its corresponding feature region. Compelling results obtained from experiments on KTH [22], Weizmann [1], HOHA [13] and TRECVid [23] datasets have proven the efficiency and robustness of our framework for the task of human action recognition and lo-calization in video. 1.
A Dataset for Movie Description
- In CVPR
, 2015
"... Descriptive video service (DVS) provides linguistic de-scriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter-esting data source for computer vision and computational linguistic ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Descriptive video service (DVS) provides linguistic de-scriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter-esting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel cor-pus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmark-ing different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production. 1.
Relevance Feedback for Real World Human Action Retrieval
- In: Intelligent Multimedia Interactivity
, 2011
"... a b s t r a c t Content-based video retrieval is an increasingly popular research field, in large part due to the quickly growing catalogue of multimedia data to be found online. Even though a large portion of this data concerns humans, however, retrieval of human actions has received relatively li ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
a b s t r a c t Content-based video retrieval is an increasingly popular research field, in large part due to the quickly growing catalogue of multimedia data to be found online. Even though a large portion of this data concerns humans, however, retrieval of human actions has received relatively little attention. Presented in this paper is a video retrieval system that can be used to perform a content-based query on a large database of videos very efficiently. Furthermore, it is shown that by using ABRS-SVM, a technique for incorporating Relevance feedback (RF) on the search results, it is possible to quickly achieve useful results even when dealing with very complex human action queries, such as in Hollywood movies.