Results 1 - 10
of
179
1 Articulated Human Detection with Flexible Mixtures-of-Parts
"... Abstract—We describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened) templates, we use a mixture of small, non-oriented ..."
Abstract
-
Cited by 65 (2 self)
- Add to MetaCart
(Show Context)
Abstract—We describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened) templates, we use a mixture of small, non-oriented parts. We describe a general, flexible mixture model that jointly captures spatial relations between part locations and co-occurrence relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations. Our models have several notable properties: (1) they efficiently model articulation by sharing computation across similar warps (2) they efficiently model an exponentially-large set of global mixtures through composition of local mixtures and (3) they capture the dependency of global geometry on local appearance (parts look different at different locations). When relations are tree-structured, our models can be efficiently optimized with dynamic programming. We learn all parameters, including local appearances, spatial relations, and co-occurrence relations (which encode local rigidity) with a structured SVM solver. Because our model is efficient enough to be used as a detector that searches over scales and image locations, we introduce novel criteria for evaluating pose estimation and human detection, both separately and jointly. We show that currently-used evaluation criteria may conflate these two issues. Most previous approaches model limbs with rigid and articulated templates that are trained independently of each other, while we present an extensive diagnostic evaluation that suggests that flexible structure and joint training are crucial for strong performance. We present experimental results on standard benchmarks that suggest our approach is the state-of-the-art system for pose estimation, improving past work on the challenging Parse and Buffy datasets, while being orders of magnitude faster.
Discovering localized attributes for fine-grained recognition
- In CVPR. IEEE
, 2012
"... red stripes on wings orange stripes on wings Attributes are visual concepts that can be detected by machines, understood by humans, and shared across categories. They are particularly useful for fine-grained domains where categories are closely related to one other (e.g. bird species recognition). I ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
(Show Context)
red stripes on wings orange stripes on wings Attributes are visual concepts that can be detected by machines, understood by humans, and shared across categories. They are particularly useful for fine-grained domains where categories are closely related to one other (e.g. bird species recognition). In such scenarios, relevant attributes are often local (e.g. “white belly”), but the question of how to choose these local attributes remains largely unexplored. In this paper, we propose an interactive approach that discovers local attributes that are both discriminative and semantically meaningful from image datasets annotated only with fine-grained category labels and object bounding boxes. Our approach uses a latent conditional random field model to discover candidate attributes that are detectable and discriminative, and then employs a recommender system that selects attributes likely to be semantically meaningful. Human interaction is used to provide semantic names for the discovered attributes. We demonstrate our method on two challenging datasets, Caltech-UCSD Birds-200-2011 and Leeds Butterflies, and find that our discovered attributes outperform those generated by traditional approaches. 1.
Object detection using stronglysupervised deformable part models
- In ECCV
, 2012
"... Abstract. Deformable part-based models [1, 2] achieve state-of-the-art performance for object detection, but rely on heuristic initialization dur-ing training due to the optimization of non-convex cost function. This paper investigates limitations of such an initialization and extends earlier method ..."
Abstract
-
Cited by 42 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Deformable part-based models [1, 2] achieve state-of-the-art performance for object detection, but rely on heuristic initialization dur-ing training due to the optimization of non-convex cost function. This paper investigates limitations of such an initialization and extends earlier methods using additional supervision. We explore strong supervision in terms of annotated object parts and use it to (i) improve model initial-ization, (ii) optimize model structure, and (iii) handle partial occlusions. Our method is able to deal with sub-optimal and incomplete annotations of object parts and is shown to benefit from semi-supervised learning se-tups where part-level annotation is provided for a fraction of positive examples only. Experimental results are reported for the detection of six animal classes in PASCAL VOC 2007 and 2010 datasets. We demon-strate significant improvements in detection performance compared to the LSVM [1] and the Poselet [3] object detectors. 1
A database for fine grained activity detection of cooking activities
- In CVPR
, 2012
"... While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body mot ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
(Show Context)
While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intra-class variability due to diverse subjects and ingredients. We benchmark two approaches on our dataset, one based on articulated pose tracks and the second using holistic video features. While the holistic approach outperforms the pose-based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases. Providing high-resolution videos as well as an intermediate pose representation we hope to foster research in fine-grained activity recognition. 1.
Detecting actions, poses, and objects with relational phraselets. ECCV
, 2012
"... Abstract. We present a novel approach to modeling human pose, together with interacting objects, based on compositional models of local visual interactions and their relations. Skeleton models, while flexible enough to capture large articulations, fail to accurately model selfocclusions and interact ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
(Show Context)
Abstract. We present a novel approach to modeling human pose, together with interacting objects, based on compositional models of local visual interactions and their relations. Skeleton models, while flexible enough to capture large articulations, fail to accurately model selfocclusions and interactions. Poselets and Visual Phrases address this limitation, but do so at the expense of requiring a large set of templates. We combine all three approaches with a compositional model that is flexible enough to model detailed articulations but still captures occlusions and object interactions. Unlike much previous work on action classification, we do not assume test images are labeled with a person, and instead present results for “action detection ” in an unlabeled image. Notably, for each detection, our model reports back a detailed description including an action label, articulated human pose, object poses, and occlusion flags. We demonstrate that modeling occlusion is crucial for recognizing human-object interactions. We present results on the PASCAL Action
Joint deep learning for pedestrian detection
- In ICCV
, 2013
"... Feature extraction, deformation handling, occlusion handling, and classification are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well ex-plored. This paper ..."
Abstract
-
Cited by 34 (11 self)
- Add to MetaCart
(Show Context)
Feature extraction, deformation handling, occlusion handling, and classification are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well ex-plored. This paper proposes that they should be jointly learned in order to maximize their strengths through coop-eration. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interac-tion among components, the deep model achieves a 9 % re-duction in the average miss rate compared with the cur-rent best-performing pedestrian detection approaches on the largest Caltech benchmark dataset. 1.
Parsing Clothing in Fashion Photographs
"... In this paper we demonstrate an effective method for parsing clothing in fashion photographs, an extremely challenging problem due to the large number of possible garment items, variations in configuration, garment appearance, layering, and occlusion. In addition, we provide a large novel dataset an ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
In this paper we demonstrate an effective method for parsing clothing in fashion photographs, an extremely challenging problem due to the large number of possible garment items, variations in configuration, garment appearance, layering, and occlusion. In addition, we provide a large novel dataset and tools for labeling garment items, to enable future research on clothing estimation. Finally, we present intriguing initial results on using clothing estimates to improve pose identification, and demonstrate a prototype application for pose-independent visual garment retrieval.
Human Pose Estimation using Body Parts Dependent Joint Regressors
"... In this work, we address the problem of estimating 2d human pose from still images. Recent methods that rely on discriminatively trained deformable parts organized in a tree model have shown to be very successful in solving this task. Within such a pictorial structure framework, we address the probl ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
(Show Context)
In this work, we address the problem of estimating 2d human pose from still images. Recent methods that rely on discriminatively trained deformable parts organized in a tree model have shown to be very successful in solving this task. Within such a pictorial structure framework, we address the problem of obtaining good part templates by proposing novel, non-linear joint regressors. In particular, we employ two-layered random forests as joint regressors. The first layer acts as a discriminative, independent body part classifier. The second layer takes the estimated class distributions of the first one into account and is thereby able to predict joint locations by modeling the interdependence and co-occurrence of the parts. This results in a pose estimation framework that takes dependencies between body parts already for joint localization into account and is thus able to circumvent typical ambiguities of tree structures, such as for legs and arms. In the experiments, we demonstrate that our body parts dependent joint regressors achieve a higher joint localization accuracy than tree-based state-of-the-art methods. 1.
C.: Joint training of a convolutional network and a graphical model for human pose estimation
, 2014
"... This paper proposes a new hybrid architecture that consists of a deep Convolu-tional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose esti-mation in monocular images. The architecture can exploit structural ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
This paper proposes a new hybrid architecture that consists of a deep Convolu-tional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose esti-mation in monocular images. The architecture can exploit structural domain con-straints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques. 1
Histograms of Sparse Codes for Object Detection
"... Object detection has seen huge progress in recent years, much thanks to the heavily-engineered Histograms of Oriented Gradients (HOG) features. Can we go beyond gradients and do better than HOG? We provide an affirmative answer by proposing and investigating a sparse representation for object detect ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
(Show Context)
Object detection has seen huge progress in recent years, much thanks to the heavily-engineered Histograms of Oriented Gradients (HOG) features. Can we go beyond gradients and do better than HOG? We provide an affirmative answer by proposing and investigating a sparse representation for object detection, Histograms of Sparse Codes (HSC). We compute sparse codes with dictionaries learned from data using K-SVD, and aggregate per-pixel sparse codes to form local histograms. We intentionally keep true to the sliding window framework (with mixtures and parts) and only change the underlying features. To keep training (and testing) efficient, we apply dimension reduction by computing SVD on learned models, and adopt supervised training where latent positions of roots and parts are given externally e.g. from a HOG-based detector. By learning and using local representations that are much more expressive than gradients, we demonstrate large improvements over the state of the art on the PASCAL benchmark for both rootonly and part-based models. 1.