Results 1 - 10
of
25
Unsupervised learning of human action categories using spatial-temporal words
- In Proc. BMVC
, 2006
"... Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences ..."
Abstract
-
Cited by 163 (4 self)
- Add to MetaCart
Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences is very useful for a variety of tasks, such as video surveillance, objectlevel video summarization, video indexing, digital library organization, etc. However, it remains a challenging task for computers to achieve robust action recognition due to cluttered background, camera motion, occlusion, and geometric and photometric variances of objects. For example, in a live video of a skating competition, the skater moves rapidly across the rink, and the camera also moves to follow the skater. With moving camera, non-stationary background, and moving target, few vision algorithms could identify, categorize and
Recovering 3D Human Pose from Monocular Images
"... We describe a learning based method for recovering 3D human body pose from single images and monocular image sequences. Our approach requires neither an explicit body model nor prior labelling of body parts in the image. Instead, it recovers pose by direct nonlinear regression against shape descrip ..."
Abstract
-
Cited by 95 (0 self)
- Add to MetaCart
We describe a learning based method for recovering 3D human body pose from single images and monocular image sequences. Our approach requires neither an explicit body model nor prior labelling of body parts in the image. Instead, it recovers pose by direct nonlinear regression against shape descriptor vectors extracted automatically from image silhouettes. For robustness against local silhouette segmentation errors, silhouette shape is encoded by histogram-of-shape-contexts descriptors. We evaluate several different regression methods: ridge regression, Relevance Vector Machine (RVM) regression and Support Vector Machine (SVM) regression over both linear and kernel bases. The RVMs provide much sparser regressors without compromising performance, and kernel bases give a small but worthwhile improvement in performance. Loss of depth and limb labelling information often makes the recovery of 3D pose from single silhouettes ambiguous. We propose two solutions to this: the first embeds the method in a tracking framework, using dynamics from the previous state estimate to disambiguate the pose; the second uses a mixture of regressors framework to return multiple solutions for each silhouette. We show that the resulting system tracks long sequences stably, and is also capable of accurately reconstructing 3D human pose from single images, giving multiple possible solutions in ambiguous cases. For realism and good generalization over a wide range of viewpoints, we train the regressors on images resynthesized from real human motion capture data. The method is demonstrated on a 54-parameter full body pose model, both quantitatively on independent but similar test data, and qualitatively on real image sequences. Mean angular errors of 4–5 degrees are obtained — a factor of 3 better than the current state of the art for the much simpler upper body problem.
HumanEva: Synchronized video and motion capture dataset for evaluation of articulated human motion
, 2006
"... While research on articulated human motion and pose estimation has progressed rapidly in the last few years, there has been no systematic quantitative evaluation of competing methods to establish the current state of the art. We present data obtained using a hardware system that is able to capture s ..."
Abstract
-
Cited by 93 (8 self)
- Add to MetaCart
While research on articulated human motion and pose estimation has progressed rapidly in the last few years, there has been no systematic quantitative evaluation of competing methods to establish the current state of the art. We present data obtained using a hardware system that is able to capture synchronized video and ground-truth 3D motion. The resulting HUMANEVA datasets contain multiple subjects performing a set of predefined actions with a number of repetitions. On the order of 40, 000 frames of synchronized motion capture and multi-view video (resulting in over one quarter million image frames in total) were collected at 60 Hz with an additional 37,000 time instants of pure motion capture data. A standard set of error measures is defined for evaluating both 2D and 3D pose estimation and tracking algorithms. We also describe a baseline algorithm for 3D articulated tracking that uses a relatively standard Bayesian framework with optimization in the form of Sequential Importance Resampling and Annealed Particle Filtering. In the context of this baseline algorithm we explore a variety of likelihood functions, prior models of human motion and the effects of algorithm parameters. Our experiments suggest that image observation models and motion priors play important roles in performance, and that in a multi-view laboratory environment, where initialization is available, Bayesian filtering tends to perform well. The datasets and the software are made available to the research community. This infrastructure will support the development of new articulated motion and pose estimation algorithms, will provide a baseline for the evaluation and comparison of new methods, and will help establish the current state of the art in human pose estimation and tracking.
Predicting 3D People from 2D Pictures
- In IV Conference on Articulated Motion and Deformable Objects, AMDO
, 2006
"... Abstract. We propose a hierarchical process for inferring the 3D pose of a person from monocular images. First we infer a learned view-based 2D body model from a single image using non-parametric belief propagation. This approach integrates information from bottom-up body-part proposal processes and ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
Abstract. We propose a hierarchical process for inferring the 3D pose of a person from monocular images. First we infer a learned view-based 2D body model from a single image using non-parametric belief propagation. This approach integrates information from bottom-up body-part proposal processes and deals with self-occlusion to compute distributions over limb poses. Then, we exploit a learned Mixture of Experts model to infer a distribution of 3D poses conditioned on 2D poses. This approach is more general than recent work on inferring 3D pose directly from silhouettes since the 2D body model provides a richer representation that includes the 2D joint angles and the poses of limbs that may be unobserved in the silhouette. We demonstrate the method in a laboratory setting where we evaluate the accuracy of the 3D poses against ground truth data. We also estimate 3D body pose in a monocular image sequence. The resulting 3D estimates are sufficiently accurate to serve as proposals for the Bayesian inference of 3D human motion over time. 1
Monocular Human Motion Capture with a Mixture of Regressors
- IEEE Workshop on Vision for Human-Computer Interaction
, 2005
"... We address 3D human motion capture from monocular images, taking a learning based approach to construct a probabilistic pose estimation model from a set of labelled human silhouettes. To compensate for ambiguities in the pose reconstruction problem, our model explicitly calculates several possible p ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
We address 3D human motion capture from monocular images, taking a learning based approach to construct a probabilistic pose estimation model from a set of labelled human silhouettes. To compensate for ambiguities in the pose reconstruction problem, our model explicitly calculates several possible pose hypotheses. It uses locality on a manifold in the input space and connectivity in the output space to identify regions of multi-valuedness in the mapping from silhouette to 3D pose. This information is used to fit a mixture of regressors on the input manifold, giving us a global model capable of predicting the possible poses with corresponding probabilities. These are then used in a dynamicalmodel based tracker that automatically detects tracking failures and re-initializes in a probabilistically correct manner. The system is trained on conventional motion capture data, using both the corresponding real human silhouettes and silhouettes synthesized artificially from several different models for improved robustness to inter-person variations. Static pose estimation is illustrated on a variety of silhouettes. The robustness of the method is demonstrated by tracking on a real image sequence requiring multiple automatic re-initializations. 1.
Discriminative subsequence mining for action classification
- IN: 11TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, LOS ALAMITOS
, 2007
"... Recent approaches to action classification in videos have used sparse spatio-temporal words encoding local appearance around interesting movements. Most of these approaches use a histogram representation, discarding the temporal order among features. But this ordering information can contain importa ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Recent approaches to action classification in videos have used sparse spatio-temporal words encoding local appearance around interesting movements. Most of these approaches use a histogram representation, discarding the temporal order among features. But this ordering information can contain important information about the action itself, e.g. consider the sport disciplines of hurdle race and long jump, where the global temporal order of motions (running, jumping) is important to discriminate between the two. In this work we propose to use a sequential representation which retains this temporal order. Further, we introduce Discriminative Subsequence Mining to find optimal discriminative subsequence patterns. In combination with the LPBoost classifier, this amounts to simultaneously learning a classification function and performing feature selection in the space of all possible feature sequences. The resulting classifier linearly combines a small number of interpretable decision functions, each checking for the presence of a single discriminative pattern. The classifier is benchmarked on the KTH action classification data set and outperforms the best known results in the literature.
Temporal motion models for monocular and multiview 3D human body tracking
- CVIU
"... We explore an approach to 3D people tracking with learned motion models and deterministic optimization. The tracking problem is formulated as the minimization of a differentiable criterion whose differential structure is rich enough for optimization to be accomplished via hill-climbing. This avoids ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
We explore an approach to 3D people tracking with learned motion models and deterministic optimization. The tracking problem is formulated as the minimization of a differentiable criterion whose differential structure is rich enough for optimization to be accomplished via hill-climbing. This avoids the computational expense of Monte Carlo methods, while yielding good results under challenging conditions. To demonstrate the generality of the approach we show that we can learn and track cyclic motions such as walking and running, as well as acyclic motions such as a golf swing. We also show results from both monocular and multi-camera tracking. Finally, we provide results with a motion model learned from multiple activities, and show how this models might be used for recognition.
Hierarchical Space-time Model Enabling Efficient Search for Human Actions
, 2008
"... We propose a five-layer hierarchical space-time model (HSTM) for representing and searching human actions in videos. From a feature point of view, both invariance and selectivity are desirable characteristics, which seem to contradict each other. To make these characteristics coexist, we introduce a ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We propose a five-layer hierarchical space-time model (HSTM) for representing and searching human actions in videos. From a feature point of view, both invariance and selectivity are desirable characteristics, which seem to contradict each other. To make these characteristics coexist, we introduce a coarse-to-fine search and verification scheme based on the HSTM model for action searching. Because going through layers of the hierarchy corresponds to progressively turning the knob between invariance and selectivity, this strategy enables the search for human actions ranging from rapid movements of sports to subtle motions of facial expressions. The introduction of the HIstogram of Gabor Orientations (HIGO) feature makes searching for actions go smoothly across the hierarchical layers of the HSTM model. The matching efficiency is enhanced by applying integral histograms to compute the features in the top two layers. The HSTM model was tested on three selected challenging video sequences and on the KTH human action database and achieved improvement over other state-of-the-art algorithms. These results validate that the HSTM model is both selective and robust for searching human actions.
Regression-based human motion capture from voxel data
- BMVC
, 2006
"... A regression based method is proposed to recover human body pose from 3D voxel data. In order to do this we need to convert the voxel data into a feature vector. This is done using a Bayesian approach based on Mixture of Probabilistic PCA that transforms a collection of 3D shape context descriptors, ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A regression based method is proposed to recover human body pose from 3D voxel data. In order to do this we need to convert the voxel data into a feature vector. This is done using a Bayesian approach based on Mixture of Probabilistic PCA that transforms a collection of 3D shape context descriptors, extracted from the voxels, to a compact feature vector. For the regression, the newly-proposed Multi-Variate Relevance Vector Machine is explored to learn a single mapping from this feature vector to a low-dimensional representation of full body pose. We demonstrate the effectiveness and robustness of our method with experiments on both synthetic data and real sequences.
Shared Kernel Information Embedding for Discriminative Inference
"... Latent Variable Models (LVM), like the Shared-GPLVM and the Spectral Latent Variable Model, help mitigate overfitting when learning discriminative methods from small or moderately sized training sets. Nevertheless, existing methods suffer from several problems: 1) complexity; 2) the lack of explicit ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Latent Variable Models (LVM), like the Shared-GPLVM and the Spectral Latent Variable Model, help mitigate overfitting when learning discriminative methods from small or moderately sized training sets. Nevertheless, existing methods suffer from several problems: 1) complexity; 2) the lack of explicit mappings to and from the latent space; 3) an inability to cope with multi-modality; and 4) the lack of a well-defined density over the latent space. We propose a LVM called the Shared Kernel Information Embedding (sKIE). It defines a coherent density over a latent space and multiple input/output spaces (e.g., image features and poses), and it is easy to condition on a latent state, or on combinations of the input/output states. Learning is quadratic, and it works well on small datasets. With datasets too large to learn a coherent global model, one can use sKIE to learn local online models. sKIE permits missing data during inference, and partially labelled data during learning. We use sKIE for human pose inference. 1.

